Overview#
This page describes the interface which is used to submit productions to the LHCbDIRAC production system. This primarily includes these use cases:
Analysis Productions: LHCb members can request data is processed by committing their application options and a AnalysisProductions
info.yaml
file tolhcb-datapkg/AnalysisProductions
. These are then tested in CI and, upon merging, submitted to LHCbDIRAC. This system is the primary purpose of DPA-WP2 and the origin of all the new interface to the LHCbDIRAC production system.Open Data Productions: This is a work-in-progress system to accept production requests from researchers outside of the collaboration. Application configurations are created using the ntuple wizard which is then used to create a Analysis Production. The data can then exported to the CERN open data portal. This is being developed in collaboration with DPA-WP6 and CERN IT.
Monte Carlo Productions: Requests for simulated data are created in the LbMCSubmit YAML format. These are then committed to
lhcb-simulation/mc-requests
where the CI converts them into the LHCbDIRAC YAML format and runs tests. When the MR is merged productions are created. This is managed by Simulation’s WP-P.Ad-hoc productions (e.g. sprucing, stripping, ….): There is an occasional need for other types of productions which don’t neatly fit into a standard format. To accommodate this LHCbDIRAC YAML files are written explicitly and then communicated to the LHCbDIRAC operations team using
lhcb-dpa/prod-requests
. The primary users of this are DPA however other projects use the same repository in the DPA GitLab namespace.
The main difference between these use cases is that format that is used as the source of the production. For commonly types of productions (AnalysisProductions/MC) we have specific high-level formats which allow users to request productions more simply with a lower risk of bugs.
Regardless of the origin of the production, eventually one or more
production requests are created LHCbDIRAC via a YAML file which can be
tested locally with dirac-production-request-run-local
and submitted
dirac-production-request-submit
. This YAML format is designed to
encapsulate all of the information which is used by the
ProductionManagementSystem in LHCbDIRAC (note this is distinct from thr
ProductionSystem in vanilla DIRAC) that was previously managed through
the LHCbDIRAC web portal. Production requests are used to create one or
more transformations in the vanilla DIRAC TransformationSystem. These
transformations are then responsible for creating the jobs in the
vanilla DIRAC WorkloadManagementSystem. The group responsible for taking
the production request and managing the operational aspects of running
it varies by the type:
AnalysisProductions: The operational task of running the productions is handled by DPA-WP2.
Open Data Productions: This is currently forseen to piggy-back on the Analysis Productions system.
Simulation Productions: The dedicated simulation productions manager, this is currently Vladimir Romanovskiy.
Ad-hoc Productions: The dedicated data productions manager, this is currently Raja Nandakumar.
Testing production requests#
All productions (except Ad-hoc Productions) are automatically tested with CI prior to submission. The same CI is then used to submit the production requests in LHCbDIRAC in the “Submitted” state.
Creating a new test#
Merge request is opened in GitLab
GitLab sends a webhook to LbAPI for all events in the repository
LbAPI triggers a celery task which starts a pipeline and marks the commit status as pending using the the GitLab commit status API [1]. (See here.)
The celery task the looks at the webhook type that the repository was registered as (
MC_REQUEST
/ANA_PROD
) and uses this to determine how to generate the YAML representation of each production in the request.A test for each production is then submitted as a user job to LHCbDIRAC, running as the same user as the one who originally triggered the CI pipeline. This job is submitted to LHCbDIRAC with no input data defined so it can run at any site which has network connectivity to CERN. Each test job does however have a bearer token which allows the job to connect back to LbAPI, retrieve information about the test to run and upload it’s output in realtime while the test runs. This connection is also used to abort jobs early if requested via the website.
When all tests have reported their status back to LbAPI the commit status is updated in GitLab to show if the test succeeded.
Ensuring robustness#
To avoid tests from being excessively flakey various methods are used to ensure test results are typically prompt and accurate:
Test jobs should always be marked as successful in LHCbDIRAC even if the production under test is buggy, therefore, any failed jobs in LHCbDIRAC which LbAPI thinks should still be running are retried ((see here)[https://gitlab.cern.ch/lhcb-dpa/analysis-productions/LbAPI/-/blob/95928217ac796a5f12b85c6a4e7e74a864c59473/src/LbAPI/celery/consistency.py#L33]).
Jobs which haven’t sent any logging information in over an hour are killed and retried (see here).
Tests with input data download the data before starting the test and treat the production as if it was ran with the DownloadInputData plugin with an already warm local cache. This avoids network issues affecting the test results or timing and is especially useful as the input data may be at a different site. The input data is downloaded with a 10 minute timeout using
xrdcp
and a metalink file to allow XRootD to download from multiple sources concurrently.
Submission of productions#
Merge requests with productions are typically merged by project liasons.
The list of liasons for each project is managed with a CERN egroup and
the approval rule is associated with a GitLab group
(e.g. lhcb-simulation/liasons
).
Upon merging a merge request in GitLab, a webhook is used to trigger:
The submission the most recent test production to LHCbDIRAC.
Opening an issue in the associated GitLab repository to track the production (this is monitored and closed automatically here).
Revert the merge commit to ensure the repository is left in a clean state.
In the case of Analysis Productions:
Automatically generated options files are committed and the data package is tagged.
The tagged repository is deployed to CVMFS (using the standard data package deployment CI).
Software Components#
Deployment#
The various DPA/Simulation managed components are are hosted in the
lhcb-productions
project of CERN’s Platform-as-a-Service
(PaaS).
Helm is used to manage the configuration of the project and the main
branch is deployed for each update to the main branch of the
lbap-helm-chart
repository. The chart repository contains git submodules for LbAPI, LbAPCommon,
LbAPWeb
. When pushes are made to the main
branch of LbAPI
/LbAPWeb
the CI in the respective
repository updates the submodule reference in the lbap-helm-chart
repository which in turn triggers the lhcb-productions
project to be
re-deployed.
The
lbap-helm-chart
repository can also be used to run a local instance of LbAPI
/LbAPWeb
for
development. See the repositories README for details.
LbAPI#
LbAPI
provides two main services:
A REST-like API using FastAPI: https://lbap.app.cern.ch/docs
A celery application which is used for offloading slow operations and triggering scheduled cron-style task with celery beat.
The REST-like API serves multiple functions:
Collect information in pseudo-realtime about test pipelines
Serve information about test pipelines
Act as a proxy to the AnalysisProductionsHandler in LHCbDIRAC make it possible to interact without requiring grid certificates and a local installation of the DIRAC client.
Serve a stable API for
apd
Allow CERN’s GitLab CI workers to access the following resources without requiring any secrets to be set up in the repository configuration (e.g. keytabs):
the
/stable/
API routesEOS tokens which can be used to access data on LHCb’s EOS instance at CERN
There are three ways to authenticate against LbAPI:
Using the application which is registered in the CERN SSO to get a JWT directly.
By exchanging the JWT which is provided by the CERN SSO for an LbAPI specific bearer token. This token grants limited access and is intended for interactive CLI use with
LbAPLocal
andapd
.By exchanging
CI_JOB_JWT_V2
from CERN GitLab. This token is limited to being used from within CERN and can only be used with the/stable/
API routes.
LbAPWeb#
LbAPWeb
is a static website which is built using React and NextJS and written in
TypeScript. It interacts directly with the CERN SSO and LbAPI from
within the user’s browser. This allows anyone to develop LbAPWeb
locally without requiring any special access as the standard CERN SSO
flow works from within a local instance of LbAPWeb.
LbAPLocal#
LbAPDoc#
LbAPDoc
is the source of the Analysis Productions documentation at
https://lhcb-ap.docs.cern.ch/ and is hosted on GitLab pages at CERN.
apd#
TODO
The stable API#
The API routes under /stable/
are used by apd
and are versioned.
This is with the intention of keeping the output of the API stable to
ease data preservation.
Eventually it will cease to be possible for old software to access even
the stable API endpoints (HTTPS certificates, TLS versions,
authentication, …) at which point apd
is designed to support running
without any network access. This is done by having environment variable
can be set which points to a static snapshot of the required API
responses. This mechanism also makes it possible to use apd
in a
fully offline mode in situations where this might be desirable.
Analysis Productions specificities#
For Analysis Productions there is an additional set of components and infrastructure for getting enhanced information about productions after they have been ran.
AnalysisProductions Handler/Client/DB in LHCbDIRAC#
Additional metadata for Analysis Productions is provided to analysts such as:
Aggregating the information from the ProductionMangementSystem, TransformationSystem, Bookkeeping and DFC.
Grouping production requests together into an “analysis” (in a per-WG namespace).
Allowing key-value (“tags”) to be assigned to
Some terminology:
Request: Associated with the original ProductionRequest which was generated via the
lhcb-datapkg/AnalysisProductions
CI.Sample: An additional abstraction which has a foreign key relationship with the “request” which is assigned to an “analysis” and holds analysis specific metadata.
Archival: “Samples” have a
validity_end
field in the DB. When thevalidity_end
is not null and current time is aftervalidity_end
the sample is treated as archived.Published: In principle samples can be assigned to LHCb publication numbers however this functionality has not yet been exposed.
Additional technicalities:
Both a first request and sample is created automatically after productions are submitted via the
lhcb-datapkg/AnalysisProductions
CI.Additional samples can be assigned to a request using the web interface and LbAPI.
All strings in the AnalysisProductionsHandler are treated as case-insensitive with lower case being preferred.
The
APSyncAgent
in LHCbDIRAC is responsible for keeping some infomration in the in the AnalysisProductionsDB up to date.Most information in the AnalysisProductionsDB (e.g. samples, custom tags) is never deleted and instead two columns (
validity_start
/validity_end
) are used to enable “time travel” to see a previous state.
Data lifecycle#
All AnalysisProductions output data is stored with a single replica at
CERN in the CERN-ANAPROD
storage element. (Some legacy data remains on CERN-DST
and CERN_MC-DST
and should be migrated at some point).
This is done to strongly discourage analysts from duplicating the data to other
LHCb storage resources (eosuser, eoslhcb, AFS). In particular:
Having data split across multiple sites vastly increases the risk of being affected by downtimes.
Most tools used by analysts cannot transparently handle the concept of having multiple sources for the same file (with failover).
Data access is often affected by latency and having the data all geolocated makes the effects of latency more predictable and easier to debug.
apd
provides mechanisms for analysts to transparently cache data on
local or institute specific storage if desired.
When all samples for a given request have been archived the data is a candidate for being deleted from disk. No replicas are kept on tape as the output of Analysis Productions is typically iterated upon with the previous versions having little long term value. All metadata about the creation of a requests output files is preserved indefinitely.
When a sample has been used in a publication it is proposed to archive it to two tape replicas for permanent storage. The disk copy can then be removed following the previously mentioned procedure.
Ownership#
Owners are assigned to an “analysis” which grants them permission to manage samples. Multiple sets of people can therefore own the same underlying production request.