Getting Started#
Overview#
The Analysis Productions workflow is as follows:
Analysts create a merge request to the lhcb-datapkg/AnalysisProductions repository, adding their options and the associated metadata.
After the merge request is accepted and merged, the production is submitted to LHCbDIRAC.
Productions are run using the DIRAC transformation system and can be monitored on the Analysis Productions webpage. Issues are followed up in the associated GitLab issue, created upon submission.
After the transformations complete, the output data is replicated to CERN EOS.
Creating an Analysis Production#
From scratch#
To create and configure a new analysis production, you will need to clone the repository and create a new branch:
git clone ssh://git@gitlab.cern.ch:7999/lhcb-datapkg/AnalysisProductions.git
cd AnalysisProductions
git checkout -b ${USER}/my-analysis
Then you need to create a directory containing an info.yaml file and any options files your jobs may need. This directory name is the “analysis” name under which your samples will be grouped after they are ready, and this name must start with a letter followed by any sequence of alphanumeric characters and underscores.
Important
Make sure not to add your info.yaml file and options files to the top-level of the repository!
Once you have added these, you can commit and push your changes, which will be reviewed and subsequently approved.
git add <new directory>
git commit -m "<meaningful commit message>"
git push -u origin ${USER}/my-analysis
Based on a previous version#
Important
If you just want to look at the code that was used in a previous production you can browse the different tags of the lhcb-datapkg/AnalysisProductions repository in the GitLab webpage.
Find the version number of the analysis production you wish to base this new production on. You could use the analysis productions website, or query from the command line:
$ lb-ap versions B2OC B02DKPi
The available versions for B02DKPi are:
v0r0p1674088
v0r0p1735460
...
Now, create a new clone of the AnalysisProductions
repository with lb-ap clone
.
For example:
$ lb-ap clone --clone-type krb5 B2OC B02DKPi v0r0p1674088 ${USER}-my-new-prod
This will clone the AnalysisProductions
repository and create a new branch, which
will contain a single commit restoring the files to the version requested.
At this point you can edit the files in your analysis’s folder, commit and push as
usual.
Important
If there are samples which haven’t changed, you should remove their jobs from
the info.yaml
file to avoid producing identical samples.
Alternatively, if you wish to create a new branch in an existing clone you can use
lb-ap checkout
. This should only be done in a clean repository and may result in data loss.
YAML configuration#
Write a file called info.yaml
which will configure your jobs.
Each top-level key is the name of a job, and the value must be a dict whose allowed keys are:
Key |
Type |
Meaning |
---|---|---|
|
string |
The application and
version that you want
to run, e.g.
|
|
string or list |
The options files to pass to a Run 1 or 2 application. |
|
dict |
Configures |
|
dict |
The input to the job.
You can use
|
|
string |
The output file to be registered in the BookKeeping. NB: must be upper-case. |
|
string |
The Working Group that the analysis belongs to. The allowed values are listed here. |
|
string or list |
Email address(es) to inform about the status of the production. (Default empty) |
|
boolean |
Deduce common options
based on the input
data. (Default
|
|
boolean |
Required to be
|
|
string |
Set the value of
|
|
dict |
Additional tasks to perform while testing a job. See checks |
* optional keys.
A job can therefore be created like this:
My_job:
application: DaVinci/v45r4
wg: WG
automatically_configure: yes
turbo: no
inform:
- someone@cern.ch
options:
- make_ntuple.py
input:
bk_query: /some/bookkeeping/path.DST
output: DVNtuple.root
Instead of defining the same values for every job, you can use the
special key defaults
.
defaults:
application: DaVinci/v45r4
wg: WG
automatically_configure: yes
turbo: no
inform:
- someone@cern.ch
options:
- make_ntuple.py
output: DVNtuple.root
My_MagUp_job:
input:
bk_query: /some/MagUp/bookkeeping/path.DST
n_test_lfns: 3 # only to be used in special cases, default=1
My_MagDown_job:
input:
bk_query: /some/MagDown/bookkeeping/path.DST
You can use the Jinja templating language to add some python functionality, e.g. looping over years and polarities.
defaults:
application: DaVinci/v45r4
wg: WG
automatically_configure: yes
turbo: no
inform:
- someone@cern.ch
options:
- make_ntuple.py
output: DVNtuple.root
{%- set datasets = [
(11, 3500, '14', '21r1'),
(12, 4000, '14', '21'),
(15, 6500, '15a', '24r2'),
(16, 6500, '16', '28r2'),
(17, 6500, '17', '29r2'),
(18, 6500, '18', '34'),
]%}
{%- for year, energy, reco, strip in datasets %}
{%- for polarity in ['MagDown', 'MagUp'] %}
My_20{{year}}_{{polarity}}_job:
input:
bk_query: /LHCb/Collision{{year}}/Beam{{energy}}GeV-VeloClosed-{{polarity}}/Real Data/Reco{{reco}}/Stripping{{strip}}/90000000/BHADRON.MDST
{%- endfor %}
{%- endfor %}
Automatic Configuration#
If automatically_configure
is enabled, the following attributes of
the Gaudi application will be deduced:
DataType
InputType
Simulation
Lumi
(takes the opposite value thanSimulation
)CondDBtag
andDDDBtag
(usingLatestGlobalTagByDataType
if running over real data)
Enabling automatically_configure
also allows the attributes
Turbo
and RootInTES
to be configured from keys in info.yaml
.
Important
For Run 3 jobs, automatically_configure
does not work yet! Please
configure conddb_tag
and dddb_tag
manually!
Input#
There are three ways to define what input a job should take.
bk_query
#
A bookkeeping query will specify a particular part of the Dirac bookkeeping to take input files from. All LFNs in this location will be used as input.
2011_13164062_MagDown:
input:
bk_query: /MC/2011/Beam3500GeV-2011-MagDown-Nu2-Pythia8/Sim09f-ReDecay01/Trig0x40760037/Reco14c/Stripping21r1NoPrescalingFlagged/13164062/ALLSTREAMS.DST
options:
- MC_options/MC_13164062_ALLSTREAMS_options.py
A bookkeeping query can involve a selection of one or more runs, or run ranges. This is particularly handy if you are only interested in a handful of runs and want to save time and compute.
In addition, you can require specific Data Quality (DQ) flags through the dq_flags configurable. It takes a list of DQ flags, which are either “OK”, “UNCHECKED” or “BAD”. By default, APs only run on data that is flagged “OK”.
2011_13164062_MagDown:
input:
bk_query: /MC/2011/Beam3500GeV-2011-MagDown-Nu2-Pythia8/Sim09f-ReDecay01/Trig0x40760037/Reco14c/Stripping21r1NoPrescalingFlagged/13164062/ALLSTREAMS.DST
runs:
- 10000 # select 10000 from the above BK path
- 10001 # select run 10001 from the above BK path
- 10003:10005 # select inclusive run range, runs 10003, 10004, 10005 from the above BK path
dq_flags:
- OK
- UNCHECKED
- BAD
options:
- MC_options/MC_13164062_ALLSTREAMS_options.py
The run numbers used in the above are pure examples. Be aware that if you try to select run numbers that don’t exist under the specified BK path, then your BK query will not select anything!
transform_ids
#
Alternatively one could specify the transformation IDs that correspond
to the desired data/MC. It might be that a given bk_query
would
correspond to two different MC samples so specifying the transformation
IDs that correspond to just one of the samples will avoid running over
undesired MC.
2011_13164062_MagDown:
input:
transform_ids:
- 132268
filetype: ALLSTREAMS.DST
options:
- MC_options/MC_13164062_ALLSTREAMS_options.py
job_name
#
It is also possible to set up “job chains”, so one job can take its input as the output of another job.
2015_12163001_MagDown_Strip:
application: DaVinci/v44r10p5
input:
bk_query: /MC/2015/Beam6500GeV-2015-MagDown-Nu1.6-25ns-Pythia8/Sim09e-ReDecay01/Trig0x411400a2/Reco15a/Turbo02/Stripping24r1NoPrescalingFlagged/12163001/ALLSTREAMS.DST
options:
- strip_options/strip_ALLSTREAMS_options.py
output: B02D0KPI.STRIP.DST
2015_12163001_MagDown:
application: DaVinci/v45r6
input:
job_name: 2015_12163001_MagDown_Strip
filetype: B02D0KPI.STRIP.DST
options:
- MC_options/MC_12163001_B02D0KPI.STRIP_options.py
This is useful for situations where you might want to restrip an existing MC sample. In which case you should first run a stripping with DaVinci using the version used for the original stripping that you are now trying to recreate. Then you can use a job chain to pass this restripped MC to a tupling job using the version of DaVinci chosen for your analysis. A similar use case is a Run 3 chain of HLT1->HLT2->DaVinci.
Note
In the above example a version of DaVinci is set explicitly in both the stripping and tupling jobs, however it is best practice to set the tupling job DaVinci version as the default value using the defaults: special key.
Note
DIRAC does not currently support using the output of one job as an input to multiple other jobs. Therefore, no job should share the same input job with another job.
Checks#
Optionally, some checks can be defined that are automatically run during
the test of a job. You can use the special top-level key checks
, for
example:
checks:
histogram_deltaM:
type: range
expression: Dstar_M-D0_M
limits:
min: 139
max: 155
blind_ranges:
- min: 143
max: 147
at_least_50_entries:
type: num_entries
tree_pattern: TupleDstToD0pi_D0ToKK/DecayTree
count: 50
You must then add this check to at least one of the jobs in your
production. This is done by specifying a checks
option within that
job’s definition, for example:
My_MagUp_job:
input:
bk_query: /some/MagUp/bookkeeping/path.DST
checks:
- histogram_deltaM
- at_least_50_entries
You can also specify the checks
under the special key defaults
to apply that list of checks to all jobs. If you want to apply
additional checks to specific jobs on top of a list of defaults, you can
use the extra_checks
keyword. The checks defined by extra_checks
will be added to the ones defined in checks
, for example:
defaults:
checks:
- at_least_50_entries
My_MagUp_job:
input:
bk_query: /some/MagUp/bookkeeping/path.DST
extra_checks:
- histogram_mag_up_only
My_MagDown_job:
input:
bk_query: /some/MagDown/bookkeeping/path.DST
extra_checks:
- histogram_mag_down_only
In this example, the check at_least_50_entries
is applied to both
jobs, but the magnet-specific checks histogram_mag_up_only
and
histogram_mag_down_only
are only applied to their corresponding
jobs.
The general form of a check definition is:
checks:
<check_name>:
type: <check_type>
<check_options>
where <check_name>
is your chosen name for the check, <check_type>
is the type of check to be performed and <check_options>
are the sub-keys for the chosen check type.
For each check, you must specify a type
- the types of checks are explained below:
Key |
Type |
Meaning |
Additional notes |
---|---|---|---|
|
dict |
Create a 1D histogram of the specified expression and check that it contains events within the specified limits. |
|
|
dict |
Create a 1D histogram of the specified expression and check that it contains events within the specified limits after subtracting background candidates. |
The background is assumed to be linearly distributed on
the control variable, and no fit is performed. In
particular, signal ([m-s/2., m+s/2.]) and background
([m-b-delta, m-b] U [m+b, m+b+delta]) windows have to be
defined on a control variable. * Here m= |
|
dict |
Create a 2D or 3D histogram of the specified expressions and check that it contains events within the specified limits. |
If 3 axes are specified, both the full 3D histogram and the set of all 2D histograms are created |
|
dict |
Check that the number of entries in the test TTree is at least a specified number. |
|
|
dict |
Check that the number of entries in the test TTree per pb^-1 of luminosity is at least a specified number. |
This will only work with real data because Monte Carlo has no luminosity information. |
|
dict |
Check that the output ntuple contains a certain list of branches. |
See here for the full list of sub-keys for each check type.
You can specify a tree_pattern
for each check. This is a regex
that will be compared against all TTrees in the ntuple created by the
test, and the check will be run on all matching trees. If not specified,
this will take the default value of
r"(.*/DecayTree)|(.*/MCDecayTree)"
, i.e., the check will be run on
all TTrees.
Checks are run when using lb-ap test
for a job, eg.
lb-ap test MyProduction MyJobName
. Any histograms created by tests
can be found in the checks
folder within the output
directory
for that job.
Checks can also be run on the ntuple created in an earlier test using
the check
command. This saves you from having to re-run DaVinci if
you’re only changing your checks, and not your options files. This
command requires the same two arguments as lb-ap test
, plus one
extra for the path to the ntuple from a previous test. For example:
lb-ap check MyProduction MyJobName local-tests/MyProduction-2021-06-10-18-09-01/output/00012345_00006789_1.D02KK.ROOT
.
If you want to save the results of these checks to a different folder,
you can provide an optional fourth argument with a directory into which
to save these files (if not specified, this will be saved to the same
checks
folder as described above for lb-ap test
)
Options files#
The options files to be used with your jobs must be placed in the same
folder as info.yaml
or a subdirectory of that folder.
Python environment#
The environment variables set by AnalysisProductions.xenv
append the
top-level directory of this datapackage to $PYTHONPATH
. This means
that as long as you include __init__.py
files in your folders, you
can import
them as Python modules without needing to manipulate the
environment yourself.
For example, say you have the following directory structure, where
utils.py
contains some classes or functions that you want to use in
make_ntuple.py
:
AnalysisProductions/
└── MyAnalysis/
├── __init__.py
├── info.yaml
├── make_ntuple.py
└── utils.py
In make_ntuple.py
you can add:
from MyAnalysis.utils import MyCoolClass, my_fantastic_function
Adding large files#
In some cases you might need a relatively large file to be available for your production (e.g. some kind of MVA model). Adding these to the main Analysis Productions repository is problematic as it permanently increases the size of the repo for all future clones. Even after “deleting” the files they remain in the git history.
To avoid this problem you can use Git LFS to store these files. See the associated GitLab documentation for more details.
Next steps#
Extensive local and continuous integration testing is available. Please move on to Testing for details on testing your configuration or first read YAML Configuration Sub-Keys for a detailed description of the YAML configuration options.