Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

In the following, the most important excerpts are described.

Annotated Description 

Load data using PANDAS data frames

...

Let's see what you have uploaded in your Colab notebook!

In [ ]:


# Look at the signal and bkg events before applying physical requirement

df['sig'] = pd.DataFrame(upfile['sig'][treename].arrays(VARS, library="np"),columns=VARS)
print(df['sig'].shape)

...

Let's print out the first rows of this data set!

...

df['sig'].head()

...


f_runf_eventf_weightf_massjjf_deltajjf_mass4lf_Z1massf_Z2massf_lept1_ptf_lept1_etaf_lept1_phif_lept2_ptf_lept2_etaf_lept2_phif_lept3_ptf_lept3_etaf_lept3_phif_lept4_ptf_lept4_etaf_lept4_phif_jet1_ptf_jet1_etaf_jet1_phif_jet2_ptf_jet2_etaf_jet2_phi
013852280.000176667.2714233.739947124.96657690.76861620.50827482.8904570.8222031.34370665.4869460.3829222.56848539.8385310.5469172.49720428.5622060.1746662.013540116.326035-1.126533-1.75923890.3338932.613415-0.096671
113852330.000127129.0858920.046317120.23192680.78231834.26172641.195362-0.5342452.80268424.911942-2.0659280.37115021.959597-1.219900-2.93891416.676077-0.1629151.783374105.4918823.253374-1.29728338.9784933.2070561.553476
213852540.000037285.1652223.166899125.25464691.39269325.69529080.7880020.9437780.72963235.5497210.9352411.28854923.2062840.236346-2.67054014.5818541.5166230.28465869.3151702.573589-2.03081151.972664-0.593310-2.799394
313852600.00004352.0067940.150803125.06700991.18370819.631315129.8834230.235406-1.72938437.9507901.226075-2.54035617.6784130.096546-1.5331208.197763-0.1575770.339215202.6894682.5308021.32578641.3437582.6816050.858582
413852630.0000921044.0834964.315164124.30574872.48051543.82650486.220734-0.2266530.11727780.451378-0.5367490.38567827.4972400.827591-0.07223621.243813-0.579560-0.884727127.192223-2.362456-2.945257115.2002721.9527082.053301
  • The first 2 columns contain information which are is provided by experiments at the LHC that will not be used in the training of our Machine Learning algorithms, therefore we skip our explanation to the next columns.

  • The next variable is the f_weights. This corresponds to the probability of having that particular kind of physical process on the whole experiment. Indeed, it is a product of Branching Ratio (BR), geometrical acceptance of the detector, and kinematic phase-space. It is very important for the trainings phase and you will use it later.

  • The variables f_massjj,f_deltajj,f_mass4l,f_Z1mass, and f_Z2mass are named high-level features (event features) since they contain overall information about the final-state particles (the mass of the two jets, their separation in space, the invariant mass of the four leptons, the masses of the two Z bosons). Note that the mass is lighter w.r.t. the one. Why is that? In the Higgs boson production (hypothesis of mass = 125 GeV) only one of the Z bosons is an actual particle that has the nominal mass of 91.18 GeV. The other one is a virtual (off-mass shell) particle.

  • The remnant columns represent the low-level features (object kinematics observables), the basic measurements which are made by the detectors for the individual final state objects (in our case four charged leptons and jets) such as f_lept1(2,3,4)_pt(phi,eta) corresponding to their transverse momentum and the spatial distribution of their tracks ().

...

# Part of the code in "#" can be used in the second part of the exercise
# for trying to use alternative datasets for the training of our ML algorithms

#df['bkg'] = pd.DataFrame(upfile['bkg'][treename].arrays(VARS, library="np"),columns=VARS)
#df['bkg'].head()
df['bkg_ggHtoZZto4mu'] = pd.DataFrame(upfile['bkg_ggHtoZZto4mu'][treename].arrays(VARS, library="np"),columns=VARS)
df['bkg_ggHtoZZto4mu'].head()
#df['bkg_ggHtoZZto4e'] = pd.DataFrame(upfile['bkg_ggHtoZZto4e'][treename].arrays(VARS, library="np"),columns=VARS)
#df['bkg_ggHtoZZto4e'].head()
#df['bkg_ZZto4e'] = pd.DataFrame(upfile['bkg_ZZto4e'][treename].arrays(VARS, library="np"),columns=VARS)
#df['bkg_ZZto4e'].head()


Out[ ]:


f_runf_eventf_weightf_massjjf_deltajjf_mass4lf_Z1massf_Z2massf_lept1_ptf_lept1_etaf_lept1_phif_lept2_ptf_lept2_etaf_lept2_phif_lept3_ptf_lept3_etaf_lept3_phif_lept4_ptf_lept4_etaf_lept4_phif_jet1_ptf_jet1_etaf_jet1_phif_jet2_ptf_jet2_etaf_jet2_phi
015816320.000225-999.0-999.0120.10110588.26235222.05154057.572330-0.433627-0.88607356.9337350.4965560.40467533.584896-0.0373870.29186610.881461-1.1129600.05109773.5412601.6832802.736636-999.0-999.0-999.0
115816590.000277-999.0-999.0124.59281282.17468317.61341750.3651200.0013620.93371331.5482250.598417-1.86355622.7580550.220867-2.76724617.2646260.361964-1.859138-999.000000-999.000000-999.000000-999.0-999.0-999.0
215816710.000278-999.0-999.0125.69223079.91576429.99801172.355927-0.238323-2.33562320.644920-0.2415601.85553616.031651-1.4469931.18501611.0682960.366903-0.60684564.4405441.8862441.635723-999.0-999.0-999.0
315817240.000336-999.0-999.0125.02750485.20095823.44015143.0592350.759979-1.71477819.2489830.5359790.42033716.595169-1.3303261.65606111.407483-0.6861181.295116-999.000000-999.000000-999.000000-999.0-999.0-999.0
415817440.000273-999.0-999.0124.91728265.97139014.96830552.585011-0.656421-2.93365135.095982-1.0025680.86517328.146715-0.730926-0.8764428.034222-1.0944361.783626-999.000000
df['bkg_ZZto4mu'] = pd.DataFrame(upfile['bkg_ZZto4mu'][treename].arrays(VARS, library="np"),columns=VARS)
df['bkg_ZZto4mu'].head()

f_runf_eventf_weightf_massjjf_deltajjf_mass4lf_Z1massf_Z2massf_lept1_ptf_lept1_etaf_lept1_phif_lept2_ptf_lept2_etaf_lept2_phif_lept3_ptf_lept3_etaf_lept3_phif_lept4_ptf_lept4_etaf_lept4_phif_jet1_ptf_jet1_etaf_jet1_phif_jet2_ptf_jet2_etaf_jet2_phi
0119911170.001420384.3941650.235409309.92147893.53839987.43604384.918190-0.073681-1.33923460.143539-1.229701-1.40914954.8926811.125339-1.04643342.1393970.9661092.593184240.8285060.1033002.408482195.8382260.3387080.285348
1119911920.000893110.5898440.956070326.48190392.94893685.379288124.2702181.388811-1.73809787.3797230.7665402.50284354.472603-0.1066141.62693318.5059592.0121722.22967777.2104112.061765-0.53257248.4323651.1056951.128457
2119913310.000839-999.000000-999.00000091.16704656.16121714.53508425.2415731.4105292.08008921.9712581.4658001.86850520.6483120.7871212.0178639.831321-1.3295392.28660066.6427921.176917-1.089489-999.000000-999.000000-999.000000
3119913640.000906-999.000000-999.000000323.42834588.71727094.94034665.728729-0.5611132.59644850.5285952.2279710.10131039.3923800.294608-1.75667433.1694870.367907-0.241346-999.000000-999.000000-999.000000-999.000000-999.000000-999.000000
4119913600.001034-999.000000-999.000000274.20791690.79927190.156898101.9313050.8287782.44013389.171135-0.052834
# Let's merge our background processes together!
df['bkg'] = pd.concat([df['bkg_ZZto4mu'],df['bkg_ggHtoZZto4mu']])
# Let's shuffle them!
df['bkg']= shuffle(df['bkg'])
# Let's see its shape!
print(df['bkg'].shape)
#print(len(df['bkg']))
#print(len(df['bkg_ZZto4mu']))
#print(len(df['bkg_ggHtoZZto4mu']))
#print(len(df['bkg_ggHtoZZto4e']))
#print(len(df['bkg_ZZto4e']))

...

Number of Signal events = 14260 
Number of Background events = 100724 
#Showing that the variable isSignal is correctly assigned for VBF signal events print(df['sig']['isSignal'])
0        1.0
1        1.0
2        1.0
3        1.0
4        1.0
        ... 
24858    1.0
24859    1.0
24860    1.0
24861    1.0
24862    1.0
Name: isSignal, Length: 14260, dtype: float64

References

Attachments