Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

0        1.0
1        1.0
2        1.0
3        1.0
4        1.0
        ... 
24858    1.0
24859    1.0
24860    1.0
24861    1.0
24862    1.0
Name: isSignal, Length: 14260, dtype: float64
# Showing that the variable isSignal is correctly assigned for bkg events
# Some events are missing because of the selection. So we do not have in total 134682 
# background events anymore!
print(df['bkg']['isSignal'])
42646     0.0
619246    0.0
360856    0.0
727095    0.0
8984      0.0
         ... 
551642    0.0
315737    0.0
759363    0.0
535030    0.0
189636    0.0
Name: isSignal, Length: 100724, dtype: float64

Let's see in which way we have to use the f_weight variable!

# Renormalizes the events weights to give unit sum in the signal and background dataframes
# This is necessary for the ML algorithms to learn signal and background 
# in the same proportion,independently of number of events 
# and absolute weights of events in each sample of events!
# The relative contributions of each background process is retained - so the classifier
# learns to focus more on the importance backgrounds, and the background matches the data
# shape - but overall signal and background have equal importance (the classifier
# learns to identify signal and background equally well).
# In the pandas technical vocabolary axis=0 stands for columns, axis=1 for rows.

df['sig']['f_weight']=df['sig']['f_weight']/df['sig']['f_weight'].sum(axis=0)
df['bkg']['f_weight']=df['bkg']['f_weight']/df['bkg']['f_weight'].sum(axis=0)

# Note: Number of events remain unchanged after this "normalization procedure"
print("Number SIG events=", len(df['sig']['f_weight']))
print("Number BKG events=", len(df['bkg']['f_weight']))
Number SIG events= 14260
Number BKG events= 100724

Let's merge our signal and background events!

# Concatenate the signal and background dfs in a single data frame 
df_all = pd.concat([df['sig'],df['bkg']])

# Random shuffles the data set to mix signal and background events 
# before the splitting between train and test datasets
df_all = shuffle(df_all)

Preparing input features for the ML algorithms

We have our data set ready to train our ML algorithms! Before doing that we have to decide from which input variables the computer algorithms have to learn to distinguish between signal and background events.

We can use:

  1. The five high-level input variables f_massjj,f_deltajj,f_mass4l,f_Z1mass, and f_Z2mass .
  2. The 18 kinematic variables characterize the four-lepton + two jest final states objects.

To make this choice, we can look at the two sets of correlation plots - the so-called scatter plots using the seaborn library - among the features at our disposal and see which set captures better the differences between signal and background events.

Note: this operation is quite long for both sets since we are dealing with quite a lot of events. Skip the following two code cells and trust us in using the high-level features for building your ML models! Indeed, we will obtain better discriminators' performance using high-level features. You can always return to this part of the exercise and try to use the low-level features.


References

Attachments