Accessing Output#

Note

This documentation is incomplete and rapidly evolving as new features are developed. If you find anything out of date or unclear - please let us know!

Once a production has completed and been marked ready, the output will be stored at CERN on LHCb EOS.

Currently the easiest way to access production output directly is by using XRootD. There are various methods available to access the XRootD PFNs:

  • A web application that allows viewing the status of analysis productions, test pipelines, and sample metadata. It also allows modification of the produced sample metadata, via meaningful tags and assignment of publication numbers. Data management functions are additionally provided to enable a more efficient, fair use of LHCb’s storage.

  • A client python package, apd, that allows querying the information available analysis productions and retrieving the physical file names (PFN) of the ntuples produced.

The web application#

All available productions and datasets produced by Analysis Productions system, grouped by analysis, are available on the LHCb Analysis Productions website. It is possible to view the samples as a list, with their state:

../_images/list_datasets.png

Or to have a tree-view that shows the number of samples after grouping per tag:

../_images/grouped_tag.png

Tags#

Tags are simple key/value mappings attached to samples produced by the analysis productions. Some of the tags are automatically generated e.g.

  • config: either data of MC

  • datatype: The LHCb datatype, e.g. 2018

  • polarity: magdown, magup

  • eventtype: In the case of MC, event type for the data simulated

Tags should be added in the web application, to describe the meaning of the ntuples produced, ideally such that any combination of tags uniquely refer to one sample.

  • Navigate to your analysis.

  • Click the Select... button to begin selecting samples.

  • Select samples by clicking on individual rows in the table, and/or using the search functionality and buttons above the table.

  • Click Finalise selection to see how the metadata of selected samples may be modified.

The apd python package#

The apd python package is a programmatic interface to the Analysis Productions database, that allows retrieving information about the samples produced. It queries a REST endpoint provided by the Web application, and caches the data locally.

Authentication#

As the access to the LHCb data is restricted, one must first login to get a token that will give access to the REST endpoint:

$ apd-login
CERN SINGLE SIGN-ON

On your tablet, phone or computer, go to:
https://auth.cern.ch/auth/realms/cern/device
and enter the following code:
CODE-CODE

You may also open the following link directly and follow the instructions:
https://auth.cern.ch/auth/realms/cern/device?user_code=CODE-CODE

Waiting for login...
Login successful as xxxxx

After using a browser to access the CERN SSO, an access token has been granted to the session.

Finding PFNs#

$ python
>>> import apd
>>> datasets = apd.get_analysis_data("sl", "rds_hadronic", polarity=["magdown", "magup"])
>>> paths = datasets(eventtype=13563002, datatype=2012, sign="rs")
>>> paths
['root://eoslhcb.cern.ch//eos/lhcb/grid/prod/lhcb/MC/2012/BSDSTAUNU.ROOT/00173031/0000/00173031_00000001_1.bsdstaunu.root', 'root://eoslhcb.cern.ch//eos/lhcb/grid/prod/lhcb/MC/2012/BSDSTAUNU.ROOT/00173023/0000/00173023_00000001_1.bsdstaunu.root']
>>> rdf = ROOT.RDataFrame("SignalTuple/DecayTree", paths)

The list of paths obtained can then for example be used to create a ROOT.RDataFrame to analyze the data

N.B. To avoid analysis running twice on the same file, processed in different ways, apd raises an exception if the combination of tags matches several samples:

ValueError: Error loading data: 2 problem(s) found
{'eventtype': '13563002', 'datatype': '2012', 'polarity': 'magdown'}: 2 samples for the same configuration found, this is ambiguous:
    {'config': 'mc', 'polarity': 'magdown', 'eventtype': '13563002', 'datatype': '2012', 'sign': 'rs', 'version': 'v0r0p4882558', 'name': '2012_magdown_mc_bsdstaunu', 'state': 'ready'}
    {'config': 'mc', 'polarity': 'magdown', 'eventtype': '13563002', 'datatype': '2012', 'sign': 'ws', 'version': 'v0r0p4882558', 'name': '2012_magdown_mc_bsdstaunu_ws', 'state': 'ready'}

Snakemake integration#

apd allows querying the analysis productions database for PFns, but it also provides an interface easy to integrate in Snakemake workflows. The following code illustrates a simple workflow to gather data from different samples:

# Import the apd tools, vcersions customized for Snakemake
from apd.snakemake import get_analysis_data, remote

# Get the APD dataset for my analysis
dataset = get_analysis_data("sl", "rds_hadronic")

# Parameters for the datasets tp be analyzed
SHARED_AREA = "root://eoslhcb.cern.ch//eos/lhcb/user/l/lben/test"
CONFIG = "mc"
DATATYPE = "2012",
EVENTTYPE = ["13266069", "11266009"]
POLARITY = [ "magdown", "magup" ]

# main rule for this workflow, which requires building the file bmass.root
# in EOS. This uses the XRootD API to query information about the
# remote files, hence the need to wrap the path (str) with the remote method
# which returns a Snakemake XRootD remote
rule all:
    input:
        remote(f"{SHARED_AREA}/bmass.root")

# templated rule to produce a ROOT file with the histogram for B_M in a
# specific sample, notice that:
#  - the input uses the dataset object and specifies the wildcards to use
#  - the output is local in this case, we could temp() if we want them cleared
#     after completion of the workflow
#
rule create_histo:
    input:
        data=lambda w: dataset(datatype=w.datatype, eventtype=w.eventtype, polarity=w.polarity)
    output: f"bmass_{{config}}_{{datatype}}_{{eventtype}}_{{polarity}}.root"
    run:
        # Embedded script, for demo only, not a good idea in general
        import ROOT
        inputfiles =  [ f for f in input ]
        print(f"==== Reading {inputfiles}")
        f = ROOT.TFile.Open(output[0], "RECREATE")
        rdf = ROOT.RDataFrame("SignalTuple/DecayTree", input)
        hname = f"B_M_{wildcards.config}_{wildcards.datatype}_{wildcards.eventtype}_{wildcards.polarity}"
        h = rdf.Histo1D((hname, hname, 200, 0., 25e3), "B_M")
        h.Write()
        f.Close()
        print(f"==== Created {output[0]}")

# Rule to gather the files produced by create_histo
# - the Snakemake expand() method can be used to create all the combimnations
#   of all parameters
# - the remote() method is needed to have the final created remotely via the XRootD
#   interface. Notice that in this case we request a token to WRITE the file
#   by specifying "rw=True"
rule gather:
        input:
            expand("bmass_{config}_{datatype}_{eventtype}_{polarity}.root",
                    config=CONFIG, datatype=DATATYPE,
                    eventtype=EVENTTYPE,
                    polarity=POLARITY)
        output:
            remote(f"{SHARED_AREA}/bmass.root", rw=True)
        shell:
            "hadd {output} {input}"
N.B.
  • the AnalysisData object returned by the snakemake interface returns XRootD remotes for the files, to make the integration easier.

  • When creating remote files, it is necessary to get a XRootD remote with the appropriate credentials (specify rw=True for the output). This is needed when using EOS tokens.

Gitlab CI integration#

Accessing data on EOS requires credentials to do so, typically these credentials are provided in the form of Grid certificates and proxies (lhcb-proxy-init) or Kerberos (kinit). In GitLab CI, to workaround the need to embed long-lived tickets or grid proxies in CI variables, it is possible to use EOS tokens instead for data access, using the apd package.

Gitlab CI jobs can be configured with the necessary credentials, by adding the following to your .gitlab-ci.yml:

default:
  id_tokens:
    LBAP_CI_JOB_JWT:  # generates a JWT and assigns it to the env var $LBAP_CI_JOB_JWT
      aud: https://lbap.app.cern.ch
  before_script:
    - source /cvmfs/lhcb.cern.ch/lib/LbEnv
    - eval "$(apd-login)"  # authenticate with Analysis Productions using this JWT

The JSON Web Token (JWT) is generated by GitLab, and set inside the CI job in an environment variable LBAP_CI_JOB_JWT. The before_script: configurations use this JWT to authenticate to our API, which issues an EOS token in return.

When running apd inside a job with this environment variable set, the EOS tokens are made available and issued under the identity of person who triggered the CI job. As these tokens grant access to files on EOS under your identity, any user triggering a CI workflow will need to authorise the repository and paths that can be accessed on the following link:

https://lhcb-analysis-productions.web.cern.ch/settings/

To actually use these EOS tokens to access files, you’ll need to import apd directly and use the apd.auth function to add them correctly to each PFN, like this:

import apd
import uproot
analysis_data = apd.AnalysisData("charm", "my_very_charming_analysis")
pfns = [
  {apd.auth(pfn): "MyWonderfulDecays/DecayTree"}
  for pfn in analysis_data(name="2016_magup", version="vXX")
]

df = uproot.concatenate(pfns, library="pd")

print(
  f"The mean of my Superb particle mass distribution is {np.mean(df.SuperbParticle_MASS)}"
)

The apd.auth function can be called to wrap a XRootD file URL with a read-only token, and apd.authw with a read-write token, should you need it (access needs to be specifically granted at the above link).

The apd.snakemake interface automatically wraps the input and output files with the appropriate tokens.

Access LHCb data from GitLab CI within Gaudi#

apd can be used to access LHCb data that is stored at CERN from within GitLab CI jobs. To do this you must give Gaudi one or more LFNs as input data and provide the pool_xml_catalog.xml file that maps the LFN to a PFN. This works both locally and on GitLab CI. When running locally your kerberos credentials or grid proxy will be used as normal. When running in GitLab CI you can use apd to inject short lived access tokens into the pool_xml_catalog.xml file.

This can be set up as follows:

  1. Locally generate a pool_xml_catalog.xml file and commit it to your repository

    $ lb-dirac dirac-bookkeeping-genXMLCatalog -l /lhcb/MC/Dev/DIGI/00205444/0000/00205444_00000231_1.digi
    
  2. Adjust the input_files option of your job to be the LFN and set xml_file_catalog, i.e.

     input_files:
    -    - root://eoslhcb.cern.ch//eos/lhcb/grid/prod/lhcb/MC/Dev/DIGI/00205444/0000/00205444_00000231_1.digi
    +    - LFN:/lhcb/MC/Dev/DIGI/00205444/0000/00205444_00000231_1.digi
    +xml_file_catalog: pool_xml_catalog.xml
    
  3. Adjust your gitlab-ci.yml file to inject the tokens after sourcing LbEnv.

     id_tokens:
       LBAP_CI_JOB_JWT:
         aud: https://lbap.app.cern.ch
    
     before_script:
      - source /cvmfs/lhcb.cern.ch/lib/LbEnv
      - eval "$(apd-login)"
    + - apd-add-tokens-pool-xml pool_xml_catalog.xml