Getting Started#

Overview#

The Analysis Productions workflow is as follows:

  1. Analysts create a merge request to the lhcb-datapkg/AnalysisProductions repository, adding their options and the associated metadata.

  2. After the merge request is accepted and merged, the production is submitted to LHCbDIRAC.

  3. Productions are run using the DIRAC transformation system and can be monitored on the Analysis Productions webpage. Issues are followed up in the associated GitLab issue, created upon submission.

  4. After the transformations complete, the output data is replicated to CERN EOS.

Creating an Analysis Production#

From scratch#

To create and configure a new analysis production, you will need to clone the repository and create a new branch:

git clone ssh://git@gitlab.cern.ch:7999/lhcb-datapkg/AnalysisProductions.git
cd AnalysisProductions
git checkout -b ${USER}/my-analysis

Then you need to create a directory containing an info.yaml file and any options files your jobs may need. This directory name is the “analysis” name under which your samples will be grouped after they are ready, and this name must start with a letter followed by any sequence of alphanumeric characters and underscores.

Important

Make sure not to add your info.yaml file and options files to the top-level of the repository!

Once you have added these, you can commit and push your changes, which will be reviewed and subsequently approved.

git add <new directory>
git commit -m "<meaningful commit message>"
git push -u origin ${USER}/my-analysis

Based on a previous version#

Important

If you just want to look at the code that was used in a previous production you can browse the different tags of the lhcb-datapkg/AnalysisProductions repository in the GitLab webpage.

Find the version number of the analysis production you wish to base this new production on. You could use the analysis productions website, or query from the command line:

$ lb-ap versions B2OC B02DKPi
The available versions for B02DKPi are:
v0r0p1674088
v0r0p1735460
...

Now, create a new clone of the AnalysisProductions repository with lb-ap clone. For example:

$ lb-ap clone --clone-type krb5 B2OC B02DKPi v0r0p1674088 ${USER}-my-new-prod

This will clone the AnalysisProductions repository and create a new branch, which will contain a single commit restoring the files to the version requested. At this point you can edit the files in your analysis’s folder, commit and push as usual.

Important

If there are samples which haven’t changed, you should remove their jobs from the info.yaml file to avoid producing identical samples.

Alternatively, if you wish to create a new branch in an existing clone you can use lb-ap checkout. This should only be done in a clean repository and may result in data loss.

YAML configuration#

Write a file called info.yaml which will configure your jobs.

Each top-level key is the name of a job, and the value must be a dict whose allowed keys are:

Key

Type

Meaning

application

string

The application and version that you want to run, e.g. DaVinci/v46r5. A platform can be set with the syntax e.g. A/B@platform.

options (gaudirun.py-style)

string or list

The options files to pass to a Run 1 or 2 application.

options (lbexec-style)

dict

Configures lbexec entrypoint and YAML extra_options to configure the Run 3 application.

input

dict

The input to the job. You can use bk_query for data in BookKeeping, or job_name to use the output of another job in the same production as input. transform_ids can be used to specify input as the output from a specific transformation ID. For information on all possible options for input see here.

output

string

The output file to be registered in the BookKeeping. NB: must be upper-case.

wg

string

The Working Group that the analysis belongs to. The allowed values are listed here.

inform

string or list

Email address(es) to inform about the status of the production. (Default empty)

automatically_configure*

boolean

Deduce common options based on the input data. (Default false, see automatic configuration)

turbo*

boolean

Required to be true if using turbo stream data and auto matically_configure is enabled (Default false)

root_in_tes*

string

Set the value of RootInTES for when running over MDST input. Requires auto matically_configure to be enabled

checks*

dict

Additional tasks to perform while testing a job. See checks

* optional keys.

A job can therefore be created like this:

My_job:
  application: DaVinci/v45r4
  wg: WG
  automatically_configure: yes
  turbo: no
  inform:
    - someone@cern.ch
  options:
    - make_ntuple.py
  input:
    bk_query: /some/bookkeeping/path.DST
  output: DVNtuple.root

Instead of defining the same values for every job, you can use the special key defaults.

defaults:
  application: DaVinci/v45r4
  wg: WG
  automatically_configure: yes
  turbo: no
  inform:
    - someone@cern.ch
  options:
    - make_ntuple.py
  output: DVNtuple.root

My_MagUp_job:
  input:
    bk_query: /some/MagUp/bookkeeping/path.DST
    n_test_lfns: 3  # only to be used in special cases, default=1

My_MagDown_job:
  input:
    bk_query: /some/MagDown/bookkeeping/path.DST

You can use the Jinja templating language to add some python functionality, e.g. looping over years and polarities.

defaults:
  application: DaVinci/v45r4
  wg: WG
  automatically_configure: yes
  turbo: no
  inform:
    - someone@cern.ch
  options:
    - make_ntuple.py
  output: DVNtuple.root

{%- set datasets = [
  (11, 3500, '14', '21r1'),
  (12, 4000, '14', '21'),
  (15, 6500, '15a', '24r2'),
  (16, 6500, '16', '28r2'),
  (17, 6500, '17', '29r2'),
  (18, 6500, '18', '34'),
]%}

{%- for year, energy, reco, strip in datasets %}
  {%- for polarity in ['MagDown', 'MagUp'] %}

My_20{{year}}_{{polarity}}_job:
  input:
    bk_query: /LHCb/Collision{{year}}/Beam{{energy}}GeV-VeloClosed-{{polarity}}/Real Data/Reco{{reco}}/Stripping{{strip}}/90000000/BHADRON.MDST

  {%- endfor %}
{%- endfor %}

Automatic Configuration#

If automatically_configure is enabled, the following attributes of the Gaudi application will be deduced:

  • DataType

  • InputType

  • Simulation

  • Lumi (takes the opposite value than Simulation)

  • CondDBtag and DDDBtag (using LatestGlobalTagByDataType if running over real data)

Enabling automatically_configure also allows the attributes Turbo and RootInTES to be configured from keys in info.yaml.

Important

For Run 3 jobs, automatically_configure does not work yet! Please configure conddb_tag and dddb_tag manually!

Input#

There are three ways to define what input a job should take.

bk_query#

A bookkeeping query will specify a particular part of the Dirac bookkeeping to take input files from. All LFNs in this location will be used as input.

2011_13164062_MagDown:
  input:
    bk_query: /MC/2011/Beam3500GeV-2011-MagDown-Nu2-Pythia8/Sim09f-ReDecay01/Trig0x40760037/Reco14c/Stripping21r1NoPrescalingFlagged/13164062/ALLSTREAMS.DST
  options:
    - MC_options/MC_13164062_ALLSTREAMS_options.py

A bookkeeping query can involve a selection of one or more runs, or run ranges. This is particularly handy if you are only interested in a handful of runs and want to save time and compute.

In addition, you can require specific Data Quality (DQ) flags through the dq_flags configurable. It takes a list of DQ flags, which are either “OK”, “UNCHECKED” or “BAD”. By default, APs only run on data that is flagged “OK”.

2011_13164062_MagDown:
  input:
    bk_query: /MC/2011/Beam3500GeV-2011-MagDown-Nu2-Pythia8/Sim09f-ReDecay01/Trig0x40760037/Reco14c/Stripping21r1NoPrescalingFlagged/13164062/ALLSTREAMS.DST
    runs:
     - 10000 # select 10000 from the above BK path
     - 10001 # select run 10001 from the above BK path
     - 10003:10005 # select inclusive run range, runs 10003, 10004, 10005 from the above BK path
    dq_flags:
     - OK
     - UNCHECKED
     - BAD
  options:
    - MC_options/MC_13164062_ALLSTREAMS_options.py

The run numbers used in the above are pure examples. Be aware that if you try to select run numbers that don’t exist under the specified BK path, then your BK query will not select anything!

transform_ids#

Alternatively one could specify the transformation IDs that correspond to the desired data/MC. It might be that a given bk_query would correspond to two different MC samples so specifying the transformation IDs that correspond to just one of the samples will avoid running over undesired MC.

2011_13164062_MagDown:
  input:
    transform_ids:
      - 132268
    filetype: ALLSTREAMS.DST
  options:
    - MC_options/MC_13164062_ALLSTREAMS_options.py

job_name#

It is also possible to set up “job chains”, so one job can take its input as the output of another job.

2015_12163001_MagDown_Strip:
  application: DaVinci/v44r10p5
  input:
    bk_query: /MC/2015/Beam6500GeV-2015-MagDown-Nu1.6-25ns-Pythia8/Sim09e-ReDecay01/Trig0x411400a2/Reco15a/Turbo02/Stripping24r1NoPrescalingFlagged/12163001/ALLSTREAMS.DST
  options:
    - strip_options/strip_ALLSTREAMS_options.py
  output: B02D0KPI.STRIP.DST

2015_12163001_MagDown:
  application: DaVinci/v45r6
  input:
    job_name: 2015_12163001_MagDown_Strip
    filetype: B02D0KPI.STRIP.DST
  options:
    - MC_options/MC_12163001_B02D0KPI.STRIP_options.py

This is useful for situations where you might want to restrip an existing MC sample. In which case you should first run a stripping with DaVinci using the version used for the original stripping that you are now trying to recreate. Then you can use a job chain to pass this restripped MC to a tupling job using the version of DaVinci chosen for your analysis. A similar use case is a Run 3 chain of HLT1->HLT2->DaVinci.

Note

In the above example a version of DaVinci is set explicitly in both the stripping and tupling jobs, however it is best practice to set the tupling job DaVinci version as the default value using the defaults: special key.

Note

DIRAC does not currently support using the output of one job as an input to multiple other jobs. Therefore, no job should share the same input job with another job.

Checks#

Optionally, some checks can be defined that are automatically run during the test of a job. You can use the special top-level key checks, for example:

checks:
  histogram_deltaM:
    type: range
    expression: Dstar_M-D0_M
    limits:
      min: 139
      max: 155
    blind_ranges:
      - min: 143
        max: 147
  at_least_50_entries:
    type: num_entries
    tree_pattern: TupleDstToD0pi_D0ToKK/DecayTree
    count: 50

You must then add this check to at least one of the jobs in your production. This is done by specifying a checks option within that job’s definition, for example:

My_MagUp_job:
  input:
    bk_query: /some/MagUp/bookkeeping/path.DST
  checks:
    - histogram_deltaM
    - at_least_50_entries

You can also specify the checks under the special key defaults to apply that list of checks to all jobs. If you want to apply additional checks to specific jobs on top of a list of defaults, you can use the extra_checks keyword. The checks defined by extra_checks will be added to the ones defined in checks, for example:

defaults:
  checks:
    - at_least_50_entries

My_MagUp_job:
  input:
    bk_query: /some/MagUp/bookkeeping/path.DST
  extra_checks:
    - histogram_mag_up_only

My_MagDown_job:
  input:
    bk_query: /some/MagDown/bookkeeping/path.DST
  extra_checks:
    - histogram_mag_down_only

In this example, the check at_least_50_entries is applied to both jobs, but the magnet-specific checks histogram_mag_up_only and histogram_mag_down_only are only applied to their corresponding jobs.

The general form of a check definition is:

checks:
  <check_name>:
    type: <check_type>
    <check_options>

where <check_name> is your chosen name for the check, <check_type> is the type of check to be performed and <check_options> are the sub-keys for the chosen check type.

For each check, you must specify a type - the types of checks are explained below:

Key

Type

Meaning

Additional notes

range

dict

Create a 1D histogram of the specified expression and check that it contains events within the specified limits.

range_bkg_subtracted

dict

Create a 1D histogram of the specified expression and check that it contains events within the specified limits after subtracting background candidates.

The background is assumed to be linearly distributed on the control variable, and no fit is performed. In particular, signal ([m-s/2., m+s/2.]) and background ([m-b-delta, m-b] U [m+b, m+b+delta]) windows have to be defined on a control variable. * Here m= mean_sig, s=signal_window, delta=background_shift and b=background_window

range_nd

dict

Create a 2D or 3D histogram of the specified expressions and check that it contains events within the specified limits.

If 3 axes are specified, both the full 3D histogram and the set of all 2D histograms are created

num_entries

dict

Check that the number of entries in the test TTree is at least a specified number.

num_entries_per_invpb

dict

Check that the number of entries in the test TTree per pb^-1 of luminosity is at least a specified number.

This will only work with real data because Monte Carlo has no luminosity information.

branches_exist

dict

Check that the output ntuple contains a certain list of branches.

See here for the full list of sub-keys for each check type.

You can specify a tree_pattern for each check. This is a regex that will be compared against all TTrees in the ntuple created by the test, and the check will be run on all matching trees. If not specified, this will take the default value of r"(.*/DecayTree)|(.*/MCDecayTree)", i.e., the check will be run on all TTrees.

Checks are run when using lb-ap test for a job, eg. lb-ap test MyProduction MyJobName. Any histograms created by tests can be found in the checks folder within the output directory for that job.

Checks can also be run on the ntuple created in an earlier test using the check command. This saves you from having to re-run DaVinci if you’re only changing your checks, and not your options files. This command requires the same two arguments as lb-ap test, plus one extra for the path to the ntuple from a previous test. For example: lb-ap check MyProduction MyJobName local-tests/MyProduction-2021-06-10-18-09-01/output/00012345_00006789_1.D02KK.ROOT. If you want to save the results of these checks to a different folder, you can provide an optional fourth argument with a directory into which to save these files (if not specified, this will be saved to the same checks folder as described above for lb-ap test)

Options files#

The options files to be used with your jobs must be placed in the same folder as info.yaml or a subdirectory of that folder.

Python environment#

The environment variables set by AnalysisProductions.xenv append the top-level directory of this datapackage to $PYTHONPATH. This means that as long as you include __init__.py files in your folders, you can import them as Python modules without needing to manipulate the environment yourself.

For example, say you have the following directory structure, where utils.py contains some classes or functions that you want to use in make_ntuple.py:

AnalysisProductions/
└── MyAnalysis/
    ├── __init__.py
    ├── info.yaml
    ├── make_ntuple.py
    └── utils.py

In make_ntuple.py you can add:

from MyAnalysis.utils import MyCoolClass, my_fantastic_function

Adding large files#

In some cases you might need a relatively large file to be available for your production (e.g. some kind of MVA model). Adding these to the main Analysis Productions repository is problematic as it permanently increases the size of the repo for all future clones. Even after “deleting” the files they remain in the git history.

To avoid this problem you can use Git LFS to store these files. See the associated GitLab documentation for more details.

Next steps#

Extensive local and continuous integration testing is available. Please move on to Testing for details on testing your configuration or first read YAML Configuration Sub-Keys for a detailed description of the YAML configuration options.