Overview#

This page describes the interface which is used to submit productions to the LHCbDIRAC production system. This primarily includes these use cases:

  • Analysis Productions: LHCb members can request data is processed by committing their application options and a AnalysisProductions info.yaml file to lhcb-datapkg/AnalysisProductions. These are then tested in CI and, upon merging, submitted to LHCbDIRAC. This system is the primary purpose of DPA-WP2 and the origin of all the new interface to the LHCbDIRAC production system.

  • Open Data Productions: This is a work-in-progress system to accept production requests from researchers outside of the collaboration. Application configurations are created using the ntuple wizard which is then used to create a Analysis Production. The data can then exported to the CERN open data portal. This is being developed in collaboration with DPA-WP6 and CERN IT.

  • Monte Carlo Productions: Requests for simulated data are created in the LbMCSubmit YAML format. These are then committed to lhcb-simulation/mc-requests where the CI converts them into the LHCbDIRAC YAML format and runs tests. When the MR is merged productions are created. This is managed by Simulation’s WP-P.

  • Ad-hoc productions (e.g. sprucing, stripping, ….): There is an occasional need for other types of productions which don’t neatly fit into a standard format. To accommodate this LHCbDIRAC YAML files are written explicitly and then communicated to the LHCbDIRAC operations team using lhcb-dpa/prod-requests. The primary users of this are DPA however other projects use the same repository in the DPA GitLab namespace.

The main difference between these use cases is that format that is used as the source of the production. For commonly types of productions (AnalysisProductions/MC) we have specific high-level formats which allow users to request productions more simply with a lower risk of bugs.

Regardless of the origin of the production, eventually one or more production requests are created LHCbDIRAC via a YAML file which can be tested locally with dirac-production-request-run-local and submitted dirac-production-request-submit. This YAML format is designed to encapsulate all of the information which is used by the ProductionManagementSystem in LHCbDIRAC (note this is distinct from thr ProductionSystem in vanilla DIRAC) that was previously managed through the LHCbDIRAC web portal. Production requests are used to create one or more transformations in the vanilla DIRAC TransformationSystem. These transformations are then responsible for creating the jobs in the vanilla DIRAC WorkloadManagementSystem. The group responsible for taking the production request and managing the operational aspects of running it varies by the type:

  • AnalysisProductions: The operational task of running the productions is handled by DPA-WP2.

  • Open Data Productions: This is currently forseen to piggy-back on the Analysis Productions system.

  • Simulation Productions: The dedicated simulation productions manager, this is currently Vladimir Romanovskiy.

  • Ad-hoc Productions: The dedicated data productions manager, this is currently Raja Nandakumar.

Disclaimers#

A few parts of this document are written from the perspective of how things will be rather than how things currently are.

LbAnalysisProductions#

This is a legacy component which is used for testing and submitting productions from lhcb-datapkg/AnalysisProductions. It works by impersonating the official GitLab CI runner however this approach has several limitations. More detail about this approach can be found from the proceedings of CHEP-2018 however no more documentation is included here as it should be replaced with the webhook-based system in the near future.

Testing production requests#

All productions (except Ad-hoc Productions) are automatically tested with CI prior to submission. The same CI is then used to submit the production requests in LHCbDIRAC in the “Submitted” state.

Creating a new test#

  • Merge request is opened in GitLab

  • GitLab sends a webhook to LbAPI for all events in the repository

  • LbAPI triggers a celery task which starts a pipeline and marks the commit status as pending using the the GitLab commit status API [1]. (See here.)

  • The celery task the looks at the webhook type that the repository was registered as (MC_REQUEST/ANA_PROD) and uses this to determine how to generate the YAML representation of each production in the request.

  • A test for each production is then submitted as a user job to LHCbDIRAC, running as the same user as the one who originally triggered the CI pipeline. This job is submitted to LHCbDIRAC with no input data defined so it can run at any site which has network connectivity to CERN. Each test job does however have a bearer token which allows the job to connect back to LbAPI, retrieve information about the test to run and upload it’s output in realtime while the test runs. This connection is also used to abort jobs early if requested via the website.

  • When all tests have reported their status back to LbAPI the commit status is updated in GitLab to show if the test succeeded.

Ensuring robustness#

To avoid tests from being excessively flakey various methods are used to ensure test results are typically prompt and accurate:

  • Test jobs should always be marked as successful in LHCbDIRAC even if the production under test is buggy, therefore, any failed jobs in LHCbDIRAC which LbAPI thinks should still be running are retried ((see here)[https://gitlab.cern.ch/lhcb-dpa/analysis-productions/LbAPI/-/blob/95928217ac796a5f12b85c6a4e7e74a864c59473/src/LbAPI/celery/consistency.py#L33]).

  • Jobs which haven’t sent any logging information in over an hour are killed and retried (see here).

  • Tests with input data download the data before starting the test and treat the production as if it was ran with the DownloadInputData plugin with an already warm local cache. This avoids network issues affecting the test results or timing and is especially useful as the input data may be at a different site. The input data is downloaded with a 10 minute timeout using xrdcp and a metalink file to allow XRootD to download from multiple sources concurrently.

Submission of productions#

Merge requests with productions are typically merged by project liasons. The list of liasons for each project is managed with a CERN egroup and the approval rule is associated with a GitLab group (e.g. lhcb-simulation/liasons). Upon merging a merge request in GitLab, a webhook is used to trigger:

  • The submission the most recent test production to LHCbDIRAC.

  • Opening an issue in the associated GitLab repository to track the production (this is monitored and closed automatically here).

  • Revert the merge commit to ensure the repository is left in a clean state.

In the case of Analysis Productions:

  • Automatically generated options files are committed and the data package is tagged.

  • The tagged repository is deployed to CVMFS (using the standard data package deployment CI).

Software Components#

Deployment#

The various DPA/Simulation managed components are are hosted in the lhcb-productions project of CERN’s Platform-as-a-Service (PaaS).

Helm is used to manage the configuration of the project and the main branch is deployed for each update to the main branch of the lbap-helm-chart repository. The chart repository contains git submodules for LbAPI, LbAPWeb and LbAnalysisProductions. When pushes are made to the main branch of LbAPI/LbAPWeb/LbAnalysisProductions the CI in the respective repository updates the submodule reference in the lbap-helm-chart repository which in turn triggers the lhcb-productions project to be re-deployed.

The lbap-helm-chart repository can also be used to run a local instance of LbAPI/LbAPWeb for development. See the repositories README for details.

LbAPI#

LbAPI provides two main services:

  • A REST-like API using FastAPI: https://lbap.app.cern.ch/docs

  • A celery application which is used for offloading slow operations and triggering scheduled cron-style task with celery beat.

The REST-like API serves multiple functions:

  • Collect information in pseudo-realtime about test pipelines

  • Serve information about test pipelines

  • Act as a proxy to the AnalysisProductionsHandler in LHCbDIRAC make it possible to interact without requiring grid certificates and a local installation of the DIRAC client.

  • Serve a stable API for apd

  • Allow CERN’s GitLab CI workers to access the following resources without requiring any secrets to be set up in the repository configuration (e.g. keytabs):

    • the /stable/ API routes

    • EOS tokens which can be used to access data on LHCb’s EOS instance at CERN

There are three ways to authenticate against LbAPI:

  • Using the application which is registered in the CERN SSO to get a JWT directly.

  • By exchanging the JWT which is provided by the CERN SSO for an LbAPI specific bearer token. This token grants limited access and is intended for interactive CLI use with LbAPLocal and apd.

  • By exchanging CI_JOB_JWT_V2 from CERN GitLab. This token is limited to being used from within CERN and can only be used with the /stable/ API routes.

LbAPWeb#

LbAPWeb is a static website which is built using React and NextJS and written in TypeScript. It interacts directly with the CERN SSO and LbAPI from within the user’s browser. This allows anyone to develop LbAPWeb locally without requiring any special access as the standard CERN SSO flow works from within a local instance of LbAPWeb.

LbAPLocal#

LbAPLocal provides the lb-ap CLI which is documented here.

LbAPDoc#

LbAPDoc is the source of the Analysis Productions documentation at https://lhcb-ap.docs.cern.ch/ and is hosted on GitLab pages at CERN.

apd#

apd

TODO

The stable API#

The API routes under /stable/ are used by apd and are versioned. This is with the intention of keeping the output of the API stable to ease data preservation.

Eventually it will cease to be possible for old software to access even the stable API endpoints (HTTPS certificates, TLS versions, authentication, …) at which point apd is designed to support running without any network access. This is done by having environment variable can be set which points to a static snapshot of the required API responses. This mechanism also makes it possible to use apd in a fully offline mode in situations where this might be desirable.

Analysis Productions specificities#

For Analysis Productions there is an additional set of components and infrastructure for getting enhanced information about productions after they have been ran.

AnalysisProductions Handler/Client/DB in LHCbDIRAC#

Additional metadata for Analysis Productions is provided to analysts such as:

  • Aggregating the information from the ProductionMangementSystem, TransformationSystem, Bookkeeping and DFC.

  • Grouping production requests together into an “analysis” (in a per-WG namespace).

  • Allowing key-value (“tags”) to be assigned to

Some terminology:

  • Request: Associated with the original ProductionRequest which was generated via the lhcb-datapkg/AnalysisProductions CI.

  • Sample: An additional abstraction which has a foreign key relationship with the “request” which is assigned to an “analysis” and holds analysis specific metadata.

  • Archival: “Samples” have a validity_end field in the DB. When the validity_end is not null and current time is after validity_end the sample is treated as archived.

  • Published: In principle samples can be assigned to LHCb publication numbers however this functionality has not yet been exposed.

Additional technicalities:

  • Both a first request and sample is created automatically after productions are submitted via the lhcb-datapkg/AnalysisProductions CI.

  • Additional samples can be assigned to a request using the web interface and LbAPI.

  • All strings in the AnalysisProductionsHandler are treated as case-insensitive with lower case being preferred.

  • The APSyncAgent in LHCbDIRAC is responsible for keeping some infomration in the in the AnalysisProductionsDB up to date.

  • Most information in the AnalysisProductionsDB (e.g. samples, custom tags) is never deleted and instead two columns (validity_start/validity_end) are used to enable “time travel” to see a previous state.

Data lifecycle#

All AnalysisProductions output data is stored with a single replica at CERN in the CERN-DST and CERN_MC-DST storage elements. This is done to strongly discourage analysts from duplicating the data to other LHCb storage resources (eosuser, eoslhcb, AFS). In particular:

  • Having data split across multiple sites vastly increases the risk of being affected by downtimes.

  • Most tools used by analysts cannot transparently handle the concept of having multiple sources for the same file (with failover).

  • Data access is often affected by latency and having the data all geolocated makes the effects of latency more predictable and easier to debug.

apd provides mechanisms for analysts to transparently cache data on local or institute specific storage if desired.

When all samples for a given request have been archived the data is a candidate for being deleted from disk. No replicas are kept on tape as the output of Analysis Productions is typically iterated upon with the previous versions having little long term value. All metadata about the creation of a requests output files is preserved indefinitely.

When a sample has been used in a publication it is proposed to archive it to two disk replicas for permanent storage.

Ownership#

Owners are assigned to an “analysis” which grants them permission to manage samples. Multiple sets of people can therefore own the same underlying production request.