Flavour tagging performance of the updated Unified Particle Transformer algorithm with the CMS experiment at √s=13.6 TeVlink

CDS link: DP-2025-081

Abstractlink

Recent advances in machine learning have delivered substantial gains across high-energy physics, particularly in object identification. In jet flavour classification, increasingly sophisticated architectures exploit low-level inputs and their correlations to improve performance. Over successive LHC data-taking periods, deep-learning methods have markedly enhanced the tagging of heavy-flavour jets originating from the hadronization of b- and c-quarks. This note reports the performance of the updated Unified Particle Transformer for jet-based flavor jet tagging, called UParT v2. This new training of the unified tagger aims for the identical classification and regression task as its predecessor: b/c-jet, s-jet and hadronic τ identification and a flavour-aware jet-energy and resolution regression. Relative to the first release, UParT v2 is trained on a substantially larger dataset composed of the same physics processes as UParT v1, and benefits from architectural refinements, yielding improved accuracy. This update documents training efficiency and scalability of deep neural networks for jet-algorithm tasks, demonstrating practical pathways to deploy high-capacity models under realistic computing constraints for the CMS experiment.

Glossaries:link

Particle-level jet: A jet clustered from stable (cτ > 1cm) particles, excluding neutrinos, before any detector simulation.
Reconstruction-level jet: A jet clustered from reconstructed particle flow (PF) candidate [1].
Small-cone jets (AK4 jets): Jets are clustered from PF candidates using the anti-kT algorithm [2] with parameter R= 0.4. The Pileup Per Particle Identification (PUPPI) algorithm [3,4] is used for pileup (PU) mitigation.
Ghost association: the jet flavour is defined via ghost hadron and parton association [5]. It consists on a jet–truth matching technique based on the clustering of a generator-level hadron or parton inside the particle-level jet (boosted to infinitesimal momentum - so called "ghost").
Heavy-flavour jets: reconstruction-level jets matched with a particle-level jet associated with a b/c ghost hadron
Light-flavour jets: reconstruction-level jets matched with a particle-level jet associated with a light parton. The flavour is defined by the hardest parton associated.
Tau jets: Jets originating from tau leptons decaying to hadrons
Secondary Vertex (SVs): The point from where the b or c hadron decays. The vertex reconstruction is performed using the adaptive vertex fitter (AVF) and inclusive vertex finding (IVF) algorithm [5]. The resulting list of vertices is subject to a cleaning procedure, rejecting SV candidates that share 70% or more of their tracks, or if the significance of the flight distance between the two secondary vertices is less than 2, one of the two secondary vertices is dropped from the collection of secondary vertices.
DeepJet: A multi-classification deep neural network algorithm [6] employing general (low-level) properties of several charged and neutral particle-flow jet constituents, supplemented with properties of SVs associated with a jet. This is the state-of-the-art tagger during the 2015–2018 data-taking period at √s = 13 TeV (Run 2) used for heavy flavor tagging.
ParticleNet: A Graph Neural Network architecture [7] customised for AK4 jet classification, to perform in an inclusive way heavy flavour and hadronic τ combined with a flavour aware jet energy regression and jet energy resolution. ParticleNet is a Dynamic Graph Convolutional Neural Network based jet tagging algorithm. Instead of treating the jet as a collection of ordered constituents like DeepJet, a jet is considered as an unordered set of its constituent particles or a “particle cloud”. This representation eﬀectively proves to be more eﬃcient in incorporating additional low-level jet information and also explicitly respects the permutation symmetry.
RobustParT: A ParticleTransformer [8] model specific for the classification of AK4 jets. The transformer model introduces pairwise "interaction" features between all input jet constituents and SVs. This additional layer of inputs give better view of the internal relations of the jet constituents, thus improving the performance of the model. For AK4 jet classification, a smaller model is used. In addition, an Adversarial Training (AT) [9] is used to enhance the robustness of the model against mismodeling in the Monte-Carlo (MC) simulation. AT performs a distortion of the inputs features with respect to the loss function of the neural network. This allows the model to learn how to classify the jet flavour in a region around the jet input features distributions observed on our MC simulation, later reducing the impact of the mismodeling. A combination of these two approaches is used to preserve the performance and improve the robustness of heavy flavor tagging and the tagger is called RobustParT.
UParT v1/2: A ParticleTransformer [8] model designed for AK4 jet tasks, performing in an inclusive way heavy flavour and hadronic τ identification [10] combined with a flavour-aware jet energy regression and jet energy resolution estimation [11]. It introduces pairwise "interaction" features between jet constituents and SVs, enhancing internal jet relations. UParT [12] utilizes a novel Adversarial Training (AT) paradigm maintaining the Particle Cloud representation, allowing the model to learn from input feature distortions and improve robustness against Monte-Carlo (MC) simulation mismodeling. Additionally, it preserves the feature importance mapping from AA gradient, highlighting critical features for jet classification and ensuring high tagging performance and robustness. UParT also introduces an s-jet classifier allowing for the first time to identify jets originating from s-quarks. UParT v2 is a modern version with architecture/training improvements and a bigger training dataset.

Difference between UParT v1/v2link

Hyperparameter	UParT v1	UParT v2
Number of layers	6	6
Embedding (hidden) size	128	192
Total number of parameters	2M	5.7M
Activation function	GELU	SILU / SwiGLU
Normalization layer	LayerNorm	RMSNorm
Size of training dataset (after reweighting)	35M	70M
Optimizer	Ranger	AdamW
Learning rate	2e-03	1e-03

Table 1: The values of different hyperparameters and their difference between versions of UnifiedParticleTransformer.

Run 3 сonditions and evaluation sampleslink

BPix issue: After Technical Stop 1 of 2023 (June 19-24), 27 modules in the Barrel Pixel Layer 3 and 4 (BPix 3 and BPix 4) became inoperable due to an issue in distributing the LHC clock signals to these modules. Since this incident, these modules have remained deactivated. They cover a sector spanning approximately 0.4 radians (~23 degrees) in phi at negative pseudorapidity (Bml Sector 7). Since the regions covered by these modules are fully overlapping in eta and phi across the two detector layers, a full gap in acceptance is produced while attempting to seed tracks with traditional ”high purity” pixel-hit combinations (triplets and quadruplets). A dedicated jet energy scale and resolution correction is introduced to account the eﬀect of the loss in jet energy in the issue regions. Dedicated simulations used in 2023 are sub-divided into pre-BPix and post-BPix periods.
FPix issue: After Technical Stop 1 of 2023 (June 19-24), communication to 27 modules in the BPIX L3 and L4 was lost due to the fact that the QPLL circuit could not lock anymore to the LHC clock. In early July 2024 communication with 14 modules in FPix Disk -1 attached to the same portcard (FPix_BmO_D1_PRT3) was lost. All of these modules have been turned oﬀ since the respective incidents.
Training samples: The training dataset consists of simulated events from various Standard Model processes, including top quark pair production, QCD multijet, vector boson production in association with jets, as well as single Higgs and di-Higgs production processes. 85% of the jets are used for training the model while the remaining 15% serve for the validation of the performance during the model training. In total, about 70M jets serve for training and 12M for validation. The simulation conditions of the training dataset reflect the conditions of data taking after the 2023 BPix issue.
Evaluation samples: While the training was performed using the conditions of 2023 with the BPix issue simulated, UParT v2 has been introduced for the beginning of the 2024 data taking. Therefore, we report the performance of UParT v2 using simulation reflecting the 2024 data taking conditions. The reported performance was evaluated in two distinct regimes. The first corresponds to the usual transverse-momentum (pT) region, pT > 30 GeV, and is assessed using top–antitop pair production: fully hadronic decays (ttˉ→4q+2b) for heavy-flavour tagging, and fully leptonic decaying (ttˉ→2ℓ2ν+2b) for the appraisal of τ-tagging in association with Drell–Yan processes with additional jets (DY+jets). The higher-pT regime, pT > 300 GeV, is evaluated using QCD multijet processes, with the scalar sum of jet transverse momenta in the range 600–800 GeV. For each process, a total of 2.5 million events were used for evaluation.

Tagging performancelink

$\begin{equation} \mathrm{BvsAll} = \frac{\mathrm{prob}(\mathrm{b})}{\mathrm{prob}(\mathrm{b}) + \mathrm{prob}(\mathrm{c}) + \mathrm{prob}(\mathrm{udsg})} \end{equation}$

Figure 1

Figure 1: b-tagging ROC curves. UParT v2 shows improved performance for both c and light jet rejection.

$\begin{equation} \mathrm{BvsAll} = \frac{\mathrm{prob}(\mathrm{b})} {\mathrm{prob}(\mathrm{b}) + \mathrm{prob}(\mathrm{c}) + \mathrm{prob}(\mathrm{udsg})} \end{equation}$

$\begin{equation} \mathrm{BvsL} = \frac{\mathrm{prob}(\mathrm{b})} {\mathrm{prob}(\mathrm{b}) + \mathrm{prob}(\mathrm{udsg})} \end{equation}$

$\begin{equation} \mathrm{BvsC} = \frac{\mathrm{prob}(\mathrm{b})} {\mathrm{prob}(\mathrm{b}) + \mathrm{prob}(\mathrm{c})} \end{equation}$

$\begin{equation} \mathrm{BvsAll~weighted} = \frac{\mathrm{prob}(\mathrm{b})} {k_{\mathrm{c}}\!\cdot\!\mathrm{prob}(\mathrm{c}) + (1 - k_{\mathrm{c}})\!\cdot\!\mathrm{prob}(\mathrm{udsg})} \end{equation}$

Figure 2

Figure 2. ROC curves of post-training reweighted discriminators for b-tagging. We observe both the BvsL (kc=0) and BvsC (kc=1) show the best rejection against their specific flavor but fail to achieve a good rejection for the other. The BvsAll is the usual b jet discriminator used at CMS. The diﬀerent weighted BvsAll show a trade-oﬀ for BvsC and BvsL showing a significant improvement in BvsL for a limited impact on BvsC performance.

Figure 3

Figure 3. The grid search for the best value of post-training tuning parameter. Upper plot: the background misidentification rejection at 70% (40%) b (c) eﬃciency; the statistical uncertainty of eﬃciency is negligible. Lower plot: the multiplication of the rejections from b tagging to get the balanced trade-oﬀ between BvsL and BvsC. A smoothing cubic B-spline was used to treat noise and find the best tuning parameter values.

Figure 4

Figure 4. b-tagging ROC curves of the pre- and post-training tuning. The post-training tuning significantly improves BvsL tagging performance.

$\begin{align} \mathrm{CvsB} &= \frac{\mathrm{prob}(\mathrm{c})} {\mathrm{prob}(\mathrm{c}) + \mathrm{prob}(\mathrm{b})} \\ \mathrm{CvsL} &= \frac{\mathrm{prob}(\mathrm{c})} {\mathrm{prob}(\mathrm{c}) + \mathrm{prob}(\mathrm{udsg})} \end{align}$

Figure 5

Figure 5. c-tagging ROC curves. UParT v2 shows state-of-the-art performance for both b and light jet rejections except high pT regime for c vs b.

$\begin{align} \mathrm{SvsX} &= \frac{\mathrm{prob}(\mathrm{S})} {\mathrm{prob}(\mathrm{S}) + \mathrm{prob}(\mathrm{X})} \end{align}$

Figure 6

Figure 6. s-tagging ROC curves for UParT v2 (left) and UParT v1 (right). Performance indicates we can achieve a low eﬃciency s-tagger.

$\begin{align} \mathrm{UDSvsG} &= \frac{\mathrm{prob}(\mathrm{uds})} {\mathrm{prob}(\mathrm{uds}) + \mathrm{prob}(\mathrm{g})} \\ \end{align}$

Figure 7

Figure 7. quark/gluon-jet tagging ROC curves (uds vs g).

$\begin{align} \tau_h\:\mathrm{vs}\;\mathrm{X} &= \frac{\mathrm{prob}(\tau_h)} {\mathrm{prob}(\tau_h) + \mathrm{prob}(\mathrm{X})} \end{align}$

Figure 8

Figure 8. τ-tagging ROC curves for tau vs quark/gluon jets.

Figure 9

Figure 9. τ-tagging ROC curves for tau vs all jets. Models show similar performance with ParticleNet performing slightly better.

Figure 10

Figure 10. τ-tagging ROC curves for tau vs leptons. ParticleNet and UParT show similar performance. ParticleNet performs better at high misidentification rate and UParT at lower rate.

Jet energy regression and quantile regressionlink

Figure 11

Figure 11. Jet energy response of the inclusive jet algorithms.

Figure 12

Figure 12. Jet energy resolution. ParticleNet shows the best regressed jet energy resolution. UParT v2 shows worse resolution performance due to the simulation condition of training data for UParT v2.

Scaling lawslink

Figure 13

Figure 13. The b-jets eﬃciency at 1% mistag ratio of light jets for diﬀerent size of model/training dataset. The numbers on the plot show (number of model’s parameters). A fit was done using Trust Region Reflective algorithm [13] and the uncertainty band was estimated using error propagation. A strong scaling law [14] is observed in both variables.

Figure 14

Figure 14. The b-jets eﬃciency at 1% mistag ratio of charm jets for diﬀerent size of model/training dataset. The numbers on the plot show (number of model’s parameters). A fit was done using Trust Region Reflective algorithm [13] and the uncertainty band was estimated using error propagation. A strong scaling law [14] is observed in both variables.

Comparison with previous tagging algorithmslink

Using the Positive vs Negative discriminator defined earlier, the signal efficiency (signal events passing the cut over total signal events) for b⁺ jets is fixed at 90%, 70%, and 50%. The corresponding efficiencies for the remaining categories are then extracted. Results for b⁻ are compatible with those of b⁺, as the discriminant is symmetric.

Figure 15

Figure 14. Evolution of the light- (udsg, yellow bars) and c-jet (red bars) rejection for a fixed b-jet identification eﬃciency of 70% for taggers from DeepJet to UParT v2. The BvsAll discriminator is used to derive all numbers. Also, the weighted BvsAll discriminator utilized on this plot is refined for this specific kinematics regime, yielding an optimal trade-oﬀ between light- and c-jet rejection.

Efficiencies and mistag rateslink

.

	Loose [10%]	Medium [1%]	Tight [0.1%]	XT [0.05%]	XXT [0.01%]
DeepJet	0.936	0.828	0.667	0.614	0.485
RobustParT	0.909	0.817	0.687	0.643	0.530
ParticleNet	0.952	0.858	0.706	0.650	0.512
UParT v1	0.951	0.854	0.718	0.673	0.565
UParT v2	0.955	0.859	0.734	0.694	0.596

Table 2: b-tagging eﬃciency at 10, 1, 0.1, 0.05, 0.01% cudsg-mistag rate. The working points have been derived from the fully hadronically decaying top pair events used for the evaluation of this model.

Figure 16

Figure 16. b-tagging eﬃciency and light misidentification of UParT v1 versus the transverse momentum, pseudorapidity and number of primary vertices.

Figure 17

Figure 17. b-tagging eﬃciency and light misidentification of UParT v2 versus the transverse momentum, pseudorapidity and number of primary vertices.

Figure 18

Figure 18. b-tagging eﬃciency and light misidentification of UParT v1 versus the transverse momentum, pseudorapidity and number of primary vertices.

Figure 19

Figure 19. b-tagging eﬃciency and light misidentification of UParT v2 versus the transverse momentum, pseudorapidity and number of primary vertices.

Summarylink

Recent advances in deep learning have lead to enhanced performance of jet-tagging algorithms. Building on this progress, a unified Particle Transformer framework (UParT) has been developed, delivering state-of-the-art results across multiple tagging tasks. The latest iteration, UParT v2, introduces a substantially enlarged training dataset and an optimised architecture, yielding further gains and establishing new performance benchmarks.
A retrospective from DeepJet to UParT v2 underscores the methodological advances and documents performance milestones. Relative to DeepJet, light-jet rejection in b-tagging improves by a factor of 11, while c-jet rejection improves by a factor of 5. The dependence on both dataset scale and model capacity indicates headroom for additional improvements through further scaling.he Higgs boson.

Referenceslink

[1] CMS Collaboration, "Particle-flow reconstruction and global event description with the CMS detector", JINST 12 (2017) P10003, DOI:10.1088/1748-0221/12/10/P10003.

[2] M. Cacciari, G. P. Salam and G. Soyez, “The anti-kt jet clustering algorithm,” JHEP 0804 (2008) 063.

[3] Bertolini, Harris, Low, Tran, "Pileup Per Particle Identification". JHEP 10 (2014) 059, DOI:10.1007/JHEP10(2014)059.

[4] CMS Collaboration, "Pileup mitigation at CMS in 13 TeV data", JINST 15 (2020) P09018, DOI: 10.1088/1748-0221/15/09/P09018.

[5] CMS Сollaboration, “Identification of heavy-flavour jets with the CMS detector in pp collisions at 13 TeV” 2018 JINST 8 P04013 [1211.4462]

[6] E. Bols, J. Kieseler, M. Verzetti, M. Stoye and A. Stakia, “Jet flavour classification using DeepJet” JINST 15 (2020) P12012.

[7] H. Qu and L. Gouskos, “Jet tagging via particle clouds”, Phys. Rev. D 101, 056019 (2020).

[8] H. Qu, C. Li, S. Qian, “Particle Transformer for Jet Tagging,” PMLR 162:18281-18292, 2022.

[9] Hongyang Zhang, undefined., et al, "Theoretically Principled Trade-oﬀ between Robustness and Accuracy," 2019, arXiv:1901.08573.

[10] CMS Collaboration, “Comparison of the performance of tau reconstruction and identification algorithms in Run 3”, CMS Detector Performance Summary CMS-DP-2025-073.

[11] CMS Collaboration, “Jet energy scale and resolution of jets with ParticleNet pT regression using Run3 data collected by the CMS experiment in 2022 and 2023 at 13.6 TeV”, CMS Detector Performance Summary CMS-DP-2024-064.

[12] CMS Collaboration, “A unified approach for jet tagging in Run 3 at √s=13.6 TeV in CMS”, CMS Detector Performance Summary CMS-DP-2024-066.

[13] M. Branch, T. Coleman and Y. Li, “A Subspace, Interior, and Conjugate Gradient Method for Large-Scale Bound-Constrained Minimization Problems”, SIAM Journal on Scientific Computing, V. 21, 1, 1999.

[14] Jared Kaplan et al., “Scaling Laws for Neural Language Models”, 2020, arxiv:2001.08361.