Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

# Plot  dijets mass for signal, background and NN/RF selected events 
plt.xlabel('massjj (GeV)')
X = np.linspace(0.0,1000.,100)
plt.rcParams['figure.figsize'] = (10,5)
df_all['f_massjj'][(df_all['isSignal'] == 0)].plot.hist(bins=X, label='bkg',histtype='step', density=1 )
df_all['f_massjj'][(df_all['isSignal'] == 1)].plot.hist(bins=X, label='signal',histtype='step', density=1)
df_sel['f_massjj'].plot.hist(bins=X, label='NN',histtype='step', density=1)
df_sel_rf['f_massjj'].plot.hist(bins=X, label='RF',histtype='step', density=1)
plt.title('$m_{jj}$ normalized distribution',fontsize=12,fontweight='bold', color='r')
plt.legend(loc='upper right')
plt.xlim(0,1000)

# Plot  dijets mass for signal, background and NN/RF selected events 
plt.xlabel('f_mass4l (GeV)')
X = np.linspace(50, 400, 100)
plt.rcParams['figure.figsize'] = (10,5)
df_all['f_mass4l'][(df_all['isSignal'] == 0)].plot.hist(bins=X, label='bkg',histtype='step',log=True, density=1)
df_all['f_mass4l'][(df_all['isSignal'] == 1)].plot.hist(bins=X, label='signal',histtype='step',log=True, density=1)
df_sel['f_mass4l'].plot.hist(bins=X, label='NN',histtype='step', log=True, density=1)
df_sel_rf['f_mass4l'].plot.hist(bins=X, label='RF',histtype='step',log=True, density=1)
plt.title('$mass(4\mu)$ normalized distribution',fontsize=12,fontweight='bold', color='r')
plt.legend(loc='upper right')
plt.xlim(50,400)

Image Added

plt.xlabel('f_Z1mass (GeV)')
X = np.linspace(20, 150, 100)
plt.rcParams['figure.figsize'] = (10,5)
df_all['f_Z1mass'][(df_all['isSignal'] == 0)].plot.hist(bins=X, label='bkg',histtype='step',log=True ,density=1)
df_all['f_Z1mass'][(df_all['isSignal'] == 1)].plot.hist(bins=X, label='signal',histtype='step',log=True, density=1)
df_sel['f_Z1mass'].plot.hist(bins=X, label='NN',histtype='step', log=True,density=1)
df_sel_rf['f_Z1mass'].plot.hist(bins=X, label='RF',histtype='step',log=True, density=1)
plt.title('$mass(Z_{1})$ normalized distribution',fontsize=12,fontweight='bold', color='r')
plt.legend(loc='upper right')
plt.xlim(20,150)

Image Added

plt.xlabel('f_Z2mass (GeV)')
X = np.linspace(0., 150, 100)
plt.rcParams['figure.figsize'] = (10,5)
df_all['f_Z2mass'][(df_all['isSignal'] == 0)].plot.hist(bins=X, label='bkg',histtype='step', density=1)
df_all['f_Z2mass'][(df_all['isSignal'] == 1)].plot.hist(bins=X, label='signal',histtype='step', density=1)
df_sel['f_Z2mass'].plot.hist(bins=X, label='NN',histtype='step', density=1)
df_sel_rf['f_Z2mass'].plot.hist(bins=X, label='RF',histtype='step', density=1)
plt.title('$mass(Z_{2})$ normalized distribution',fontsize=12,fontweight='bold', color='r')
plt.legend(loc='upper right')
plt.xlim(0.,150)

Image Added

Let's do the same for some variables which we have not used during the training phase. What can you say about them?


# Plot Jet1 eta for signal, background and NN/RF selected events 
plt.xlabel('$\eta$(Jet1)')
X = np.linspace(-5.,5.,100)
plt.rcParams['figure.figsize'] = (10,5)
df_all['f_jet1_eta'][(df_all['isSignal'] == 0)].plot.hist(bins=X, label='bkg',histtype='step', density=1)
df_all['f_jet1_eta'][(df_all['isSignal'] == 1)].plot.hist(bins=X, label='signal',histtype='step', density=1)
df_sel['f_jet1_eta'].plot.hist(bins=X, label='NN',histtype='step', density=1)
df_sel_rf['f_jet1_eta'].plot.hist(bins=X, label='RF',histtype='step', density=1)
plt.legend(loc='upper right')
plt.title('$jet1(\eta)$ normalized distribution',fontsize=12,fontweight='bold', color='r')
plt.xlim(-5,5)

Image Added

# Plot Jet2 eta for signal, background and NN/RF selected events 
plt.xlabel('$\eta$(Jet2)')
X = np.linspace(-5.,5.,100)
plt.rcParams['figure.figsize'] = (10,5)
df_all['f_jet2_eta'][(df_all['isSignal'] == 0)].plot.hist(bins=X, label='bkg',histtype='step', density=1)
df_all['f_jet2_eta'][(df_all['isSignal'] == 1)].plot.hist(bins=X, label='signal',histtype='step', density=1)
df_sel['f_jet2_eta'].plot.hist(bins=X, label='NN',histtype='step', density=1)
df_sel_rf['f_jet2_eta'].plot.hist(bins=X, label='RF',histtype='step', density=1)
plt.title('$jet2(\eta)$ normalized distribution',fontsize=12,fontweight='bold', color='r')
plt.legend(loc='upper right')
plt.xlim(-5,5)
Image Added

Optional Exercise 1 - Change the decay channel

Question to students: What happens if you switch to the Image Added decay channel? You can submit your model (see the ML challenge below) for this physical process as well!

Optional Exercise 2 - Merge the backgrounds

Question to students: Merge the backgrounds used up to now for the training of our ML algos together with the ROOT File named ttH_HToZZ_4L.root. In this case you will use also the QCD Image Added irreducible background. Uncomment the correct lines of code to proceed!

Machine Learning challenge

Once you manage to improve the network (random forest) performances, you can submit your results and participate to our ML challenge. The challenge samples are available in this workspace, but the true labels (isSignal) are removed, so that you can't compute the AUC.

  • You can participate as a single participant or as a team
  • The winner is the one scoring the best AUC in the challenge samples!
  • In the next box, you will find some lines of code for preparing an output csv file, containing your y_predic for this new data sets!
  • Choose a meaningful name for your result csv file (i.e. your name, or your team name, the model used for the training phase,and the decay channel - 4Image Added or 4Image Added - but avoid to submit results.csv)
  • Download the csv file and upload it here: https://recascloud.ba.infn.it/index.php/s/CnoZuNrlr3x7uPI
  • You can submit multiple results, paying attention to name them accordingly (add the version number, such as v1, v34, etc.)
  • You can use this exercise as a starting point (train over constituents)
  • We will consider your best result for the final score.
  • The winner will be asked to present the ML architecture!

Have fun!

### Evaluate performance on an independent sample
# DO NOT CHANGE BELOW!
from google.colab import files


files = {
    "input_hl.csv":"dBHt9vsvKDUkJNt" #high level features
    }
!rm -f *.root
import os 
for file in files.items():
  if not os.path.exists(file[0]):
    b = os.system ( "wget -O %s --no-check-certificate 'https://recascloud.ba.infn.it/index.php/s/%s/download'" % file )
    if b: raise IOError ( "Error in downloading the file %s : (%s)" % file )

filename = {}

df_challenge = {}
#Open the file with dat aset without y_true (only features used for the training of the previous NN model)
filename['input'] = 'input_hl.csv'
df_challenge['input']  = pd.read_csv(filename['input'])
print(df_challenge['input'].shape)
df_challenge['input'].columns= NN_VARS 
X_challenge  = np.asarray( df_challenge['input'].values ).astype(np.float32)
ret = model.predict(X_challenge[:,0:NDIM])
print(ret.shape)
print(ret)
#Convert the y_pred in a dataframe 
df_answer= pd.DataFrame(ret)
df_answer.head()
(164560, 5)
(164560, 1)
[[1.7398037e-05]
 [3.2408145e-01]
 [1.1487612e-04]
 ...
 [2.4130943e-01]
 [1.4921818e-05]
 [8.3920550e-01]]

Out[ ]:



0
01.739804e-05
13.240815e-01
21.148761e-04
36.713818e-10
44.403101e-01
#Check of the input data set without y_true
df_challenge['input'].head()


Out[ ]:


f_massjj

f_deltajj

f_mass4lf_Z1massf_Z2mass
45539137.849240.265574250.7157990.25855092.479004
420184.817502.526370124.0913651.86719019.181890
57798398.078681.013983242.5197190.87160584.223770
588664311.600223.564375445.3876396.45542085.431280
694879314.229641.99784287.8745760.76875317.779630
# Converting the dataframe into a csv file
# Modify the 'answer.csv' string in the line code below and insert your name and the ML model trained (rf or nn)!
# Example: df_answer.to_csv('mario_rossi_rf_4mu.csv')
df_answer.to_csv('answer_2017_trial.csv')
print('Your y_pred has been created! Download it from your Drive directory!\n')
!ls -l
Your y_pred has been created! Download it from your Drive directory!

total 12884
-rw-r--r-- 1 root root   37013 Apr 22 18:11 05.08-decision-tree.png
-rw-r--r-- 1 root root   46864 Apr 22 18:10 ANN_model.h5
-rw-r--r-- 1 root root 2981984 Apr 22 18:13 answer_2017_trial.csv
-rw-r--r-- 1 root root 9160896 Apr 22 18:13 input_hl.csv
-rw-r--r-- 1 root root   44288 Apr 22 18:08 model.png
-rw-r--r-- 1 root root  902391 Apr 22 18:12 rf_model.pkl
drwxr-xr-x 1 root root    4096 Apr 21 13:39 sample_data

Upload your results here:

https://recascloud.ba.infn.it/index.php/s/CnoZuNrlr3x7uPI

References



Attachments