Accessing Output#

Once a production has completed, the output will be replicated to the CERN EOS Storage Element (SE).

Currently the easiest way to access production output directly is by using XRootD. There are various methods available to access the XRootD PFNs.

There are two components in the system that allow gathering information from the Analysis productions:
  • A web application that allows viewing the status of the Analysis production and related pipelines.

It also allows enriching the information about the produced datasets, via meaningful tags and archiving unused datasets. * A client python package that allows querying the information available analysis productions and retrieving the physical file names (PFN) of the ntuples produced.

The Analysis Productions web application#

Information pertaining to available productions and datasets, grouped by analysis, are available on the LHCb Analysis Productions website.

It is possible to view the samples as a list, with their state:

../_images/list_datasets.png

Or to have a tree-view that shows the number of samples after grouping per tag:

../_images/grouped_tag.png

Tags are simple key/value mappings attached to samples produced by the analysis productions. Some of the tags are automatically generated e.g.

  • config: either data of MC

  • datatype: The LHCb datatype, e.g. 2018

  • polarity: magdown, magup

  • eventtype: In the case of MC, event type for the data simulated

Some tags should be added in the web interface, to describe the meaning of the ntuples produced.

TODO: Explaining tagging system as well as the check on the number of samples here

The apd python package#

The apd python package is a programmatic interface to the Analysis Productions database, that allows retrieving information about the samples produced. It queries a REST endpoint provided by the Web application, and caches the data locally.

Authentication#

As the access to the LHCb data is restricted, one must first login to get a token that will give access to the REST endpoint:

$ apd-login
CERN SINGLE SIGN-ON

On your tablet, phone or computer, go to:
https://auth.cern.ch/auth/realms/cern/device
and enter the following code:
CODE-CODE

You may also open the following link directly and follow the instructions:
https://auth.cern.ch/auth/realms/cern/device?user_code=CODE-CODE

Waiting for login...
Login successful as xxxxx

After using a browser to access the CERN SSO, an access token has been granted to the session.

Finding PFNs#

$ python
>>> import apd
>>> datasets = apd.get_analysis_data("sl", "rds_hadronic", polarity=["magdown", "magup"])
>>> paths = datasets(eventtype=13563002, datatype=2012, sign="rs")
>>> paths
['root://eoslhcb.cern.ch//eos/lhcb/grid/prod/lhcb/MC/2012/BSDSTAUNU.ROOT/00173031/0000/00173031_00000001_1.bsdstaunu.root', 'root://eoslhcb.cern.ch//eos/lhcb/grid/prod/lhcb/MC/2012/BSDSTAUNU.ROOT/00173023/0000/00173023_00000001_1.bsdstaunu.root']
>>> rdf = ROOT.RDataFrame("SignalTuple/DecayTree", paths)

The list of paths obtained can then for example be used to create a ROOT.RDataFrame to analyze the data

N.B. To avoid analysis running twice on the same file, processed in different ways, apd raises an exception if the combination of tags matches several samples:

ValueError: Error loading data: 2 problem(s) found
{'eventtype': '13563002', 'datatype': '2012', 'polarity': 'magdown'}: 2 samples for the same configuration found, this is ambiguous:
    {'config': 'mc', 'polarity': 'magdown', 'eventtype': '13563002', 'datatype': '2012', 'sign': 'rs', 'version': 'v0r0p4882558', 'name': '2012_magdown_mc_bsdstaunu', 'state': 'ready'}
    {'config': 'mc', 'polarity': 'magdown', 'eventtype': '13563002', 'datatype': '2012', 'sign': 'ws', 'version': 'v0r0p4882558', 'name': '2012_magdown_mc_bsdstaunu_ws', 'state': 'ready'}

Snakemake integration#

apd allows querying the analysis productions database for PFns, but it also provides an interface easy to integrate in Snakemake workflows. The following code illustrates a simple workflow to gather data from different samples:

# Import the apd tools, vcersions customized for Snakemake
from apd.snakemake import get_analysis_data, remote

# Get the APD dataset for my analysis
dataset = get_analysis_data("sl", "rds_hadronic")

# Parameters for the datasets tp be analyzed
SHARED_AREA = "root://eoslhcb.cern.ch//eos/lhcb/user/l/lben/test"
CONFIG = "mc"
DATATYPE = "2012",
EVENTTYPE = ["13266069", "11266009"]
POLARITY = [ "magdown", "magup" ]

# main rule for this workflow, which requires building the file bmass.root
# in EOS. This uses the XRootD API to query information about the
# remote files, hence the need to wrap the path (str) with the remote method
# which returns a Snakemake XRootD remote
rule all:
    input:
        remote(f"{SHARED_AREA}/bmass.root")

# templated rule to produce a ROOT file with the histogram for B_M in a
# specific sample, notice that:
#  - the input uses the dataset object and specifies the wildcards to use
#  - the output is local in this case, we could temp() if we want them cleared
#     after completion of the workflow
#
rule create_histo:
    input:
        data=lambda w: dataset(datatype=w.datatype, eventtype=w.eventtype, polarity=w.polarity)
    output: f"bmass_{{config}}_{{datatype}}_{{eventtype}}_{{polarity}}.root"
    run:
        # Embedded script, for demo only, not a good idea in general
        import ROOT
        inputfiles =  [ f for f in input ]
        print(f"==== Reading {inputfiles}")
        f = ROOT.TFile.Open(output[0], "RECREATE")
        rdf = ROOT.RDataFrame("SignalTuple/DecayTree", input)
        hname = f"B_M_{wildcards.config}_{wildcards.datatype}_{wildcards.eventtype}_{wildcards.polarity}"
        h = rdf.Histo1D((hname, hname, 200, 0., 25e3), "B_M")
        h.Write()
        f.Close()
        print(f"==== Created {output[0]}")

# Rule to gather the files produced by create_histo
# - the Snakemake expand() method can be used to create all the combimnations
#   of all parameters
# - the remote() method is needed to have the final created remotely via the XRootD
#   interface. Notice that in this case we request a token to WRITE the file
#   by specifying "rw=True"
rule gather:
        input:
            expand("bmass_{config}_{datatype}_{eventtype}_{polarity}.root",
                    config=CONFIG, datatype=DATATYPE,
                    eventtype=EVENTTYPE,
                    polarity=POLARITY)
        output:
            remote(f"{SHARED_AREA}/bmass.root", rw=True)
        shell:
            "hadd {output} {input}"
N.B.
  • the AnalysisData object returned by the snakemake interface returns XRootD remotes for the files, to make the integration easier.

  • When creating remote files, it is necessary to get a XRootD remote with the appropriate credentials (specify rw=True for the output). This is needed when using EOS tokens

Gitlab CI integration#

Running workflows and accessing data from Gitlab CI requires credentials to do so. It is possible for a Gitlab CI to get EOS tokens for data access, using the apd package.

Gitlab CI jobs are given a JSON Web Token by the Gitlab server (via the variable CI_JOB_JWT_v2), and apd can use this to authenticate itself to the Analysis Production server which will provide it with EOS token for data access, for the directories (and with the mode) specified in:

https://lhcb-analysis-productions.web.cern.ch/settings/

From python code, the “apd.auth” function call be invoked to wrap a XRootD file URL with a read-only token, and “authw” with a read-write token.

The Snakemake interface automatically wraps the input and output files with the appropriate tokens.

To enable this for a specific project, modify .gitlab-ci.yml in the following way:

before_script:
  - source /cvmfs/lhcb.cern.ch/lib/LbEnv
  - eval "$(apd-login)"

Access LHCb data from GitLab CI within Gaudi#

apd can be used to access LHCb data that is stored at CERN from within GitLab CI jobs. To do this you must give Gaudi one or more LFNs as input data and provide the pool_xml_catalog.xml file that maps the LFN to a PFN. This works both locally and on GitLab CI. When running locally your kerberos credentials or grid proxy will be used as normal. When running in GitLab CI you can use apd to inject short lived access tokens into the pool_xml_catalog.xml file.

This can be set up as follows:

  1. Locally generate a pool_xml_catalog.xml file and commit it to your repository

    $ lb-dirac dirac-bookkeeping-genXMLCatalog -l /lhcb/MC/Dev/DIGI/00205444/0000/00205444_00000231_1.digi
    
  2. Adjust the input_files option of your job to be the LFN and set xml_file_catalog, i.e.

     input_files:
    -    - root://eoslhcb.cern.ch//eos/lhcb/grid/prod/lhcb/MC/Dev/DIGI/00205444/0000/00205444_00000231_1.digi
    +    - LFN:/lhcb/MC/Dev/DIGI/00205444/0000/00205444_00000231_1.digi
    +xml_file_catalog: pool_xml_catalog.xml
    
  3. Adjust your gitlab-ci.yml file to inject the tokens after sourcing LbEnv

     before_script:
      - source /cvmfs/lhcb.cern.ch/lib/LbEnv
      - eval "$(apd-login)"
    + - apd-add-tokens-pool-xml pool_xml_catalog.xml