Getting Started#

Overview#

The Analysis Productions workflow is as follows:

Analysts create a merge request to the lhcb-datapkg/AnalysisProductions repository, adding their options and the associated metadata.
After the merge request is accepted and merged, the production is submitted to LHCbDIRAC.
Productions are run using the DIRAC transformation system and can be monitored on the Analysis Productions webpage. Issues are followed up in the associated GitLab issue, created upon submission.
After the transformations complete, the output data is replicated to CERN EOS.

Creating an Analysis Production#

From scratch#

To create and configure a new analysis production, you will need to clone the repository and create a new branch:

git clone ssh://git@gitlab.cern.ch:7999/lhcb-datapkg/AnalysisProductions.git
cd AnalysisProductions
git checkout -b ${USER}/my-analysis

Then you need to create a directory containing an info.yaml file and any options files your jobs may need. This directory name is the “analysis” name under which your samples will be grouped after they are ready, and this name must start with a letter followed by any sequence of alphanumeric characters and underscores.

Important

Make sure not to add your info.yaml file and options files to the top-level of the repository!

Once you have added these, you can commit and push your changes, which will be reviewed and subsequently approved.

git add <new directory>
git commit -m "<meaningful commit message>"
git push -u origin ${USER}/my-analysis

Based on a previous version#

Important

If you just want to look at the code that was used in a previous production you can browse the different tags of the lhcb-datapkg/AnalysisProductions repository in the GitLab webpage.

Find the version number of the analysis production you wish to base this new production on. You could use the analysis productions website, or query from the command line:

$ lb-ap versions B2OC B02DKPi
The available versions for B02DKPi are:
v0r0p1674088
v0r0p1735460
...

Now, create a new clone of the AnalysisProductions repository with lb-ap clone. For example:

$ lb-ap clone --clone-type krb5 B2OC B02DKPi v0r0p1674088 ${USER}-my-new-prod

This will clone the AnalysisProductions repository and create a new branch, which will contain a single commit restoring the files to the version requested. At this point you can edit the files in your analysis’s folder, commit and push as usual.

Important

If there are samples which haven’t changed, you should remove their jobs from the info.yaml file to avoid producing identical samples.

Alternatively, if you wish to create a new branch in an existing clone you can use lb-ap checkout. This should only be done in a clean repository and may result in data loss.

YAML configuration#

Write a file called info.yaml which will configure your jobs.

Each top-level key is the name of a job, and the value must be a dict whose allowed keys are:

Key	Type	Meaning
`application`	string	The application and version that you want to run, e.g. `DaVinci/v46r5`. A platform can be set with the syntax e.g. `A/B@platform`.
`options` (`gaudirun.py`-style)	string or list	The options files to pass to a Run 1 or 2 application.
`options` (`lbexec`-style)	dict	Configures `lbexec` `entrypoint` and YAML `extra_options` to configure the Run 3 application.
`input`	dict	The input to the job. You can use `bk_query` for data in BookKeeping, or `job_name` to use the output of another job in the same production as input. `transform_ids` can be used to specify input as the output from a specific transformation ID. For information on all possible options for `input` see here.
`output`	string	The output file to be registered in the BookKeeping. NB: must be upper-case.
`wg`	string	The Working Group that the analysis belongs to. The allowed values are listed here.
`inform`	string or list	Email address(es) to inform about the status of the production. (Default empty)
`automatically_configure`*	boolean	Deduce common options based on the input data. (Default `false`, see automatic configuration)
`turbo`*	boolean	Required to be `true` if using turbo stream data and `auto matically_configure` is enabled (Default `false`)
`root_in_tes`*	string	Set the value of `RootInTES` for when running over MDST input. Requires `auto matically_configure` to be enabled
`checks`*	dict	Additional tasks to perform while testing a job. See checks

* optional keys.

A job can therefore be created like this:

My_job:
  application: DaVinci/v45r4
  wg: WG
  automatically_configure: yes
  turbo: no
  inform:
    - someone@cern.ch
  options:
    - make_ntuple.py
  input:
    bk_query: /some/bookkeeping/path.DST
  output: DVNtuple.root

defaults key#

Instead of defining the same values for every job, you can use the special key defaults.

defaults:
  application: DaVinci/v45r4
  wg: WG
  automatically_configure: yes
  turbo: no
  inform:
    - someone@cern.ch
  options:
    - make_ntuple.py
  output: DVNtuple.root

My_MagUp_job:
  input:
    bk_query: /some/MagUp/bookkeeping/path.DST
    n_test_lfns: 3  # only to be used in special cases, default=1

My_MagDown_job:
  input:
    bk_query: /some/MagDown/bookkeeping/path.DST

Jinja templating#

You can use the Jinja templating language to add some python functionality, e.g. looping over years and polarities.

defaults:
  application: DaVinci/v45r4
  wg: WG
  automatically_configure: yes
  turbo: no
  inform:
    - someone@cern.ch
  options:
    - make_ntuple.py
  output: DVNtuple.root

{%- set datasets = [
  (11, 3500, '14', '21r1'),
  (12, 4000, '14', '21'),
  (15, 6500, '15a', '24r2'),
  (16, 6500, '16', '28r2'),
  (17, 6500, '17', '29r2'),
  (18, 6500, '18', '34'),
]%}

{%- for year, energy, reco, strip in datasets %}
  {%- for polarity in ['MagDown', 'MagUp'] %}

My_20{{year}}_{{polarity}}_job:
  input:
    bk_query: /LHCb/Collision{{year}}/Beam{{energy}}GeV-VeloClosed-{{polarity}}/Real Data/Reco{{reco}}/Stripping{{strip}}/90000000/BHADRON.MDST

  {%- endfor %}
{%- endfor %}

Automatic Configuration#

If automatically_configure is enabled, the following attributes of the Gaudi application will be deduced:

DataType
InputType
Simulation
Lumi (takes the opposite value than Simulation)
CondDBtag and DDDBtag (using LatestGlobalTagByDataType if running over real data)

Enabling automatically_configure also allows the attributes Turbo and RootInTES to be configured from keys in info.yaml.

Important

For Run 3 jobs, automatically_configure does not work yet! Please configure conddb_tag and dddb_tag manually!

Input#

There are three ways to define what input a job should take.

`bk_query`#

A bookkeeping query will specify a particular part of the Dirac bookkeeping to take input files from. All LFNs in this location will be used as input.

2011_13164062_MagDown:
  input:
    bk_query: /MC/2011/Beam3500GeV-2011-MagDown-Nu2-Pythia8/Sim09f-ReDecay01/Trig0x40760037/Reco14c/Stripping21r1NoPrescalingFlagged/13164062/ALLSTREAMS.DST
  options:
    - MC_options/MC_13164062_ALLSTREAMS_options.py

If your bookkeeping query refers to a very large amount of input data and would like to reduce the input, the query can be ‘prescaled’ like this:

2011_13164062_MagDown:
  input:
    bk_query: /MC/2011/Beam3500GeV-2011-MagDown-Nu2-Pythia8/Sim09f-ReDecay01/Trig0x40760037/Reco14c/Stripping21r1NoPrescalingFlagged/13164062/ALLSTREAMS.DST
    sample_fraction: 0.1  # sample approximately 10% of the input LFNs under this BK path
    sample_seed: HelloWorld  # the sampling is deterministic, so changing this "seed" will change the LFN sampling result.
  options:
    - MC_options/MC_13164062_ALLSTREAMS_options.py

Pre-scaling your bookkeeping queries like this is highly recommended if you are in the earlier stages of your analysis, and have not come up with final selection(s).

A bookkeeping query can be augmented with run-range constraints. This is particularly handy if you are only interested in a subset of runs under a bookkeeping path:

2011_13164062_MagDown:
  input:
    bk_query: /MC/2011/Beam3500GeV-2011-MagDown-Nu2-Pythia8/Sim09f-ReDecay01/Trig0x40760037/Reco14c/Stripping21r1NoPrescalingFlagged/13164062/ALLSTREAMS.DST
    start_run: 1234
    end_run: 2345  # you can omit this key if you only wish to exclude runs below the `start_run` value.
  options:
    - MC_options/MC_13164062_ALLSTREAMS_options.py

If necessary, multiple ranges can be specified like this:

2011_13164062_MagDown:
  input:
    bk_query: /MC/2011/Beam3500GeV-2011-MagDown-Nu2-Pythia8/Sim09f-ReDecay01/Trig0x40760037/Reco14c/Stripping21r1NoPrescalingFlagged/13164062/ALLSTREAMS.DST
    runs:
     - 1234:2345  # the ranges are inclusive
     - 3456  # single runs are also fine
  options:
    - MC_options/MC_13164062_ALLSTREAMS_options.py

In addition, you can apply Data Quality (DQ) flag constraints to the query through the dq_flags configurable. It takes a list of DQ flags, which are either “OK”, “UNCHECKED” or “BAD”. By default, APs only run on data that is flagged “OK”.

2011_13164062_MagDown:
  input:
    bk_query: /MC/2011/Beam3500GeV-2011-MagDown-Nu2-Pythia8/Sim09f-ReDecay01/Trig0x40760037/Reco14c/Stripping21r1NoPrescalingFlagged/13164062/ALLSTREAMS.DST
    runs:
     - 10000 # select 10000 from the above BK path
     - 10001 # select run 10001 from the above BK path
     - 10003:10005 # select inclusive run range, runs 10003, 10004, 10005 from the above BK path
    dq_flags:
     - OK
     - UNCHECKED
     - BAD
  options:
    - MC_options/MC_13164062_ALLSTREAMS_options.py

The run numbers used in the above are pure examples. Be aware that if you try to select run numbers that don’t exist under the specified BK path, then your BK query will not select anything!

`transform_ids`#

Alternatively, it is possible query input by using the DIRAC Transformation IDs corresponding to the desired data/MC input files. This may be necessary if more than one MC sample is grouped under the same bookkeeping path. As such, querying by transformation ID allows for a more precise query as needed.

2011_13164062_MagDown:
  input:
    transform_ids:
      - 132268
    filetype: ALLSTREAMS.DST
  options:
    - MC_options/MC_13164062_ALLSTREAMS_options.py

`job_name`#

It is also possible to set up “job chains”, so one job can take its input as the output of another job.

2015_12163001_MagDown_Strip:
  application: DaVinci/v44r10p5
  input:
    bk_query: /MC/2015/Beam6500GeV-2015-MagDown-Nu1.6-25ns-Pythia8/Sim09e-ReDecay01/Trig0x411400a2/Reco15a/Turbo02/Stripping24r1NoPrescalingFlagged/12163001/ALLSTREAMS.DST
  options:
    - strip_options/strip_ALLSTREAMS_options.py
  output: B02D0KPI.STRIP.DST

2015_12163001_MagDown:
  application: DaVinci/v45r6
  input:
    job_name: 2015_12163001_MagDown_Strip
    filetype: B02D0KPI.STRIP.DST
  options:
    - MC_options/MC_12163001_B02D0KPI.STRIP_options.py

This is useful for situations where you might want to restrip an existing MC sample. In which case, you should first run a stripping with DaVinci with the version used for the original stripping, that you are now trying to recreate. Then you can construct a job chain to pass this restripped MC to a tupling job, using the version of DaVinci chosen for your analysis.

A similar use case is a Run 3 chain of HLT1->HLT2->DaVinci.

Note

In the above example a version of DaVinci is set explicitly in both the stripping and tupling jobs, however it is best practice to set the tupling job DaVinci version as the default value using the defaults: special key.

Note

DIRAC does not currently support using the output of one job as an input to multiple other jobs. Therefore, no job should share the same input job with another job.

`sample` (Coming soon!)#

Soon it will be possible to query input samples by using tags, sample names, and versions. Watch this space!

Post-processing#

Recently, it has become possible to set up “post-processing” steps in analysis productions, in particular:

Splitting output file into multiple output files
Applying a filtering and/or a transformation to ntuples.

An example of how one might use these new features is to create a job chain such as:

DaVinci (My_DV_job): create file(s) of trees/ntuples
Post processing step (My_splitting_job): split a single file of many trees across multiple files with less trees
Post processing step (My_filtering_job_for_Xi): filter trees using an RDataFrame (in this case for the XI.ROOT filetype)

My_DV_job:
  input:
    bk_query: "/LHCb/.../"
  application: "DaVinci/..."
  output: NTUPLE.ROOT
  ...

My_splitting_job:
  # This job splits the content of "NTUPLE.ROOT" files
  # across multiple output filetypes (listed below)
  input:
    job_name: "My_DV_job"
    filetype: NTUPLE.ROOT
  application: "lb-conda/default/2024-12-12"
  output:  # list each output filetype here
    - BC.ROOT
    - BU.ROOT
    - BS.ROOT
    - B2.ROOT
    - B0.ROOT
    - LB.ROOT
    - XI.ROOT
    - OMEGA.ROOT
  options:
    entrypoint: LbExec:skim_and_merge
    extra_options:
      compression:
        optimise_baskets: false
    extra_args:
      - "--"
      # Write any TDirectory matching the regex "Tuple_(Bc).*?" and its content to
      # filetype "BC.ROOT"  (note that it must be lowercase and not contain ".ROOT")
      - "--write=bc=Tuple_(Bc).*?"
      - "--write=bu=Tuple_(Bu).*?"
      - "--write=bs=Tuple_(Bs).*?"
      - "--write=lb=Tuple_(Lb).*?"
      - "--write=b2=Tuple_(B2).*?"
      - "--write=b0=Tuple_(B0).*?"
      - "--write=omega=Tuple_(Omega).*?"
      - "--write=xi=Tuple_(Xi).*?"

My_filtering_job_for_Xi:
  input:
    job_name: My_splitting_job
    # Specifically only run these jobs on the filetype
    # "XI.ROOT" created by "My_splitting_job"
    filetype: XI.ROOT
  # in principle any lb-conda environment can be used!
  application: "lb-conda/default/2025-12-12"
  output: XI_FILTERED.ROOT
  options:
    entrypoint: filtering_script:filter_xi
    extra_options:
      # these options don't do anything, but left here as `extra_options` is not an optional key
      compression:
        optimise_baskets: false

As for the “filtering script”:

from LbExec import process_trees

# you can check what process_trees decorator is doing here:
# https://gitlab.cern.ch/lhcb-core/lbexec/-/blob/main/src/LbExec/workflows.py?ref_type=heads#L294
@process_trees
def filter_xi(tree_name, rdf):
    # this is an example, but RDataFrame has many features and possibilities!
    if tree_name == "Tuple_Xi/DecayTree":
        return rdf.Filter("Xi_M > some_value")
    print (f"Did nothing with {tree_name}")
    return rdf

There are many use-cases made possible by these post-processing features. A non-exhaustive list:

Filtering: Reduce output sizes to make the ntuples easier to work with.
Transform the ntuples (rename branches, remove branches, define new branches) e.g. into a format that your analysis scripts are happy with
Splitting: Split out ROOT files containing many TTrees into ROOT files containing perhaps one or two TTrees each
Apply an MVA selection on the grid.

Options files#

The options files to be used with your jobs must be placed in the same folder as info.yaml or a subdirectory of that folder.

Python environment#

The environment variables set by AnalysisProductions.xenv append the top-level directory of this datapackage to $PYTHONPATH. This means that as long as you include __init__.py files in your folders, you can import them as Python modules without needing to manipulate the environment yourself.

For example, say you have the following directory structure, where utils.py contains some classes or functions that you want to use in make_ntuple.py:

AnalysisProductions/
└── MyAnalysis/
    ├── __init__.py
    ├── info.yaml
    ├── make_ntuple.py
    └── utils.py

In make_ntuple.py you can add:

from MyAnalysis.utils import MyCoolClass, my_fantastic_function

Adding large files#

In some cases you might need a relatively large file to be available for your production (e.g. some kind of MVA model). Adding these to the main Analysis Productions repository is problematic as it permanently increases the size of the repo for all future clones. Even after “deleting” the files they remain in the git history.

To avoid this problem you can use Git LFS to store these files. See the associated GitLab documentation for more details.

Next steps#

Extensive local and continuous integration testing is available. Please move on to Testing for details on testing your configuration or first read YAML Configuration Sub-Keys for a detailed description of the YAML configuration options.

Getting Started#

Overview#

Creating an Analysis Production#

From scratch#

Based on a previous version#

YAML configuration#

defaults key#

Jinja templating#

Automatic Configuration#

Input#

bk_query#

transform_ids#

job_name#

sample (Coming soon!)#

Post-processing#

Options files#

Python environment#

Adding large files#

Next steps#

`bk_query`#

`transform_ids`#

`job_name`#

`sample` (Coming soon!)#