Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Preparing input features for the ML algorithms

We have our data set datasets ready to train our ML algorithmsalgorithm! Before doing that, we have to decide from which input variables the computer algorithms have to be passed to learn the algorithm to let the model distinguish between signal and background events.

We can use:

  1. The five high-level input variables f_massjj,f_deltajj,f_mass4l,f_Z1mass, and f_Z2mass .
  2. The 18 kinematic variables characterize the four-lepton + two jest final states objects.

To make this the best choice, we can look at the two sets of correlation plots - the so-called scatter plots using the seaborn library - among the features at our disposal available and see which set captures better the differences between signal and background events.

Note: this operation is quite long for both the sets since we are dealing with quite a lot lots of events. Skip the following two code cells and trust us in using the high - level features for building your ML models! Indeed, we will obtain better discriminators' performance using high-level features. You can always return to this part of the exercise and try to use the low - level features.

# It will take a while (5 minutes), you can skip it as said before.
# We leave you the output of this code cell using a .png format
# VAR = [ 'f_massjj', 'f_deltajj', 'f_mass4l', 'f_Z1mass' , 'f_Z2mass', 'isSignal']
# sns.pairplot( data=df_all.filter(VAR), hue='isSignal' , kind='scatter', diag_kind='auto' );

...

Number NN input variables= 5
NN input variables= ['f_massjj', 'f_deltajj', 'f_mass4l', 'f_Z1mass', 'f_Z2mass']
(114984, 5)
(114984, 1)
(114984, 1)

Dividing the data into testing and training

...

dataset

You can split now the datasets into two parts (one for the training and validation steps and one for testing phase).

Question to students: Have a look to the parameter setting test_size. Why did we choose that small fraction of events to be used for the testing phase?

#

...

Classical

...

way

...

to

...

proceed,

...

using

...

a

...

scikit-learn

...

algorithm:


# X_train_val, X_test, Y_train_val , Y_test , W_train_val , W_test = 
# train_test_split(X, Y, W , test_size=0.2,shuffle=None,stratify=None )

# Alternative way, the one that we chose in order to study the model's performance 
# with ease (with an analogous procedure used by TMVA in ROOT framework)
# to keep information about the flag isSignal in both training and test steps. 

size= int(len(X[:,0]))
test_size = int(0.2*len(X[:,0]))
print('X (features) before splitting')
print('\n')
print(X.shape)
print('X (features) splitting between test and training')
X_test= X[0:test_size+1,:]
print('Test:')
print(X_test.shape)
X_train_val= X[test_size+1:len(X[:,0]),:]
print('Training:')
print(X_train_val.shape)
print('\n')
print('Y (target) before splitting')
print('\n')
print(Y.shape)
print('Y (target) splitting between test and training ')
Y_test= Y[0:test_size+1,:]
print('Test:')
print(Y_test.shape)
Y_train_val= Y[test_size+1:len(Y[:,0]),:]
print('Training:')
print(Y_train_val.shape)
print('\n')
print('W (weights) before splitting')
print('\n')
print(W.shape)
print('W (weights) splitting between test and training ')
W_test= W[0:test_size+1,:]
print('Test:')
print(W_test.shape)
W_train_val= W[test_size+1:len(W[:,0]),:]
print('Training:')
print(W_train_val.shape)
print('\n')

...

  • The Sequential model, which is very straightforward (a simple list of layers), but is limited to single-input, single-output stacks of layers (as the name gives away).
  • The Functional API, which is an easy-to-use, fully-featured API that supports arbitrary model architectures. For most people and most use cases, this is what you should be using. This is the Keras "industry-strength" model. We will use it.
  • Model subclassing, where you implement everything from scratch on your own. Use this if you have complex, out-of-the-box research use cases.

Introduction to the Neural Network algorithm

A Neural Network (NN) is a biology-inspired analytical model, but not a bio-mimetic one. It is formed by a network of basic elements called neurons or perceptrons (see the picture below), which receive input, change their state according to the input and produce an output.

The neuron/perceptron concept


The perceptron, while it has a simple structure, has the ability to learn and solve very complex problems.

  • It takes the inputs which

...

  • feed into the perceptrons, multiplies them by their weights, and computes the sum. In the first iteration the weights are set randomly.
  • It adds the number one, multiplied by a “bias weight”.

...

  • It feeds the sum through the activation function in a simple perceptron system, the activation function is a step function.
  • The result of the step function is the neuron output.

Neural Network Topologies

A Neural Networks (NN) can be classified according to the type of neuron interconnections and the flow of information.

Feed Forward Networks

Image Modified
A feedforward NN is a neural network where connections between the nodes do not form a cycle. In a feed-forward network information always moves one direction, from input to output, and it never goes backward. Feedforward NN can be viewed as mathematical models of a function Image Modified.

Recurrent Neural Network

Image Modified
A Recurrent Neural Network (RNN) is

...

the one that allows connections between nodes in the same layer,

...

among each other or with previous layers.

Unlike feedforward neural networks, RNNs can use their internal state (memory) to process sequential input data.

...

A Neural Network layer is called a dense layer to indicate that it’s fully connected. For information

Information about the Neural Network architectures seecan be found here: https://www.asimovinstitute.org/neural-network-zoo/

...

In an ANN the input variable values are passed to a first input layer, whose output is passed as input to the next layer, and so on.

The last output layer is usually constituted by consists of a single node that provides the discriminant output. Intermediate layers between the input and the output layers are called hidden layers. Usually, if a Neural Network has more than one hidden layer is called Deep Neural Network and theoretically it is able to do the feature extraction by itself (it becomes a Deep Learning algorithm).

Such a structure is also called Feedforward Multilayer Perceptron (MLP, see the picture).

Image Modified

The output of the node of the layers is computed as the weighted average of the input variables, with weights that are subject to optimization via training.

...

Then a bias or threshold parameter is applied. This bias accounts for the random noise, in the sense that it measures how well the model fits the training set (i.e. how much the model is able to correctly predict the known outputs of the training examples.) The output of a given node is: .

...

A popular algorithm to optimize the weights consists of iteratively modifying the weights after each training observation or after a bunch of training observations by doing a minimization of the loss function.

The minimization usually proceeds via the so-called Stochastic Gradient Descent (SGD) which modifies weight at each iteration according to the following formula: Image Modified

...

.

Other more complex optimization algorithms are available in KERAS API.

...

Metric functions are similar to loss functions, except that the results from evaluating a metric are not used when during the training of the model. . Note that you may use any loss function as a metric.

...

  • Number of Hidden Layers and units: the hidden layers are the layers between the input layer and the output layer. Many hidden units within a layer can increase accuracy. A smaller number of units may cause underfitting.
  • Network Weight Initialization: ideally, it may be better to use different weight initialization schemes according to the activation function used on each layer. Mostly uniform distribution is used.
  • Activation functions: they are used to introduce nonlinearity to models, which allows deep learning models to learn nonlinear prediction boundaries.
  • Learning Rate: it defines how quickly a network updates its parameters. A low learning rate slows down the learning process but converges smoothly. A larger learning rate speeds up the learning but may not converge. Usually , a decaying Learning rate is preferred.
  • Number of epochs: it in terms of artificial neural networks, an epoch refers to one cycle through the full training dataset. Usually, training a neural network takes more than a few epochs. An epoch is often mixed up with an iteration. Iterations is the number of batches or steps through partitioned packets of times the whole training data is shown to the network while training. Increase , needed to complete one epoch. You must increase the number of epochs until the validation accuracy starts decreasing even when the training accuracy is increasing (in order to avoid overfitting).
  • Batch size: a number of subsamples (events) given to the network after the update of the parameters. A good default for batch size might be 32. Also try 32, 64, 128, 256, and so on.
  • Dropout: regularization technique to avoid overfitting thus increasing the generalizing power. Generally, use a small dropout value of 10%-50% of neurons.Too Considering a too low value has minimal effect. Value , while a too high results one could result in a network under-learning by the network.

Applications in High Energy Physics

Nowadays ANNs are used on a variety of tasks: image and speech recognition, translation,filtering, game playing games, medical diagnosis, autonomous vehicles. There are also many applications in High Energy Physics: classification of signal and background events, particle tagging, simulation of event reconstruction...

...

Layers are the basic building blocks of neural networks in Keras. A layer consists of a tensor-in tensor-out computation function (the layer's call method) and some state, held in TensorFlow variables (the layer's weights).

Callbacks API

Image Modified

A callback is an object that can perform actions at various stages of training (e.g. at the start or end of an epoch, before or after a single batch, etc).

...

Regularization layers : the dropout layer

Image Modified

The Dropout layer randomly sets input units to 0 with a frequency of rate at each step during training time, which helps prevent overtraining. Inputs not set to 0 are scaled up by 1/(1-rate) such that the sum over all inputs is unchanged.

Note that the Dropout layer only applies when training is set to True such that no values are dropped during inference. When using model.fit, training will be appropriately set to Trueautomatically, and in other contexts, you can set the flag explicitly to True when calling the layer.

Artificial Neural Network implementation

We can now start to define the first architecture. The most simple approach is using fully connected layers (Dense layers in Keras/Tensorflow), with seluactivation function and a sigmoid final layer, since we are affording a binary classification problem.

We are using the binary_crossentropy loss function during training, a standard loss function for binary classification problems. We will optimize the model with the RMSprop algorithm and we will collect accuracy metrics while the model is trained.

To In order to avoid overfitting we use also Dropout layers and some callback functions.

...

model = keras.models.load_model('ANN_model.h5')

Description of the Random Forest (RF) and Scikit-learn library

In this section you will find the following subsections:

  • Introduction to the Random Forest algorithm
    If you have the knowledge about RF you may skip it.
  • Optional exercise: draw a decision tree
    Here you find an atypical exercise in which it is suggested to think about the growth of a decision tree in this specific physics problem.

Here we will use it to build a Random Forest Model and compare its discriminating power w.r.t. the Neural Network previously implemented.

What is Scikit-learn library?

Scikit-learn is a simple and efficient free-access tool for predictive data analysis , accessible to everybody, and reusable and it can be used in various contexts. It is built on NumPy, SciPy, and matplotlib scientific libraries.

Introduction to the Random Forest algorithm

Image Modified

Decision Trees and their extension Random Forests are robust and easy-to-interpret machine learning algorithms for classification tasks.

Decision Trees comprise represent a simple and fast way of learning a function that maps data x to outputs y, where x can be a mix of categorical and numeric variables numericvvariables and y can be categorical for classification, or numeric for regression.

...

(Deep) Neural Networks pretty much do the same thing. However, despite their power against larger and more complex data setsdatasets, they are extremely hard to interpret and they can take many iterations and hyperparameter adjustments before a good result is hadobtained.

One of the biggest advantages of using Decision Trees and Random Forests is the ease in which we can see what features or variables contribute to the classification or regression and their relative importance based on their location depthwise in the tree.

Decision Tree

Image Modified

A decision tree is a sequence of selection cuts that are applied in a specified order on a given variable data setsdataset.

Each cut splits the sample into nodes, each of which corresponds to a given number of observations classified as class1 (signal) or as class2 (background).

A node may be further split by the application of the subsequent cut in the tree. Nodes in which either signal or background is are largely dominant are classified as leafs, and no further selection is applied.

A node may also be classified as leaf, and the selection path is stopped, in case too few observations per node remainare counted, or in case the total number of identified nodes is too large, and different criteria have been proposed and applied in real implementations.

...

The gain due to the splitting of a node A into the nodes B1 and B2, which depends on the chosen cut, is given by ΔI=I(A)-I(B1)-I(B2) Image Added , where I denotes the adopted metric (G or E, in case of the Gini index or cross-entropy introduced above). By varying the cut, the optimal gain may be achieved.

Pruning Tree

Image Modified
A solution to the overtraining is pruning, which is eliminating subtrees (branches) that seem too specific to the training sample:

  • a node and all its descendants turn into a leaf
  • stop tree growth during the building phase

Be Careful: early stopping conditions may prevent from discovering further useful splitting. Therefore, grow the full tree and when results from subtrees are not significantly different from the results of coming from the parent one, prune them!

From tree to the forest

The random forest algorithm consists of ‘growing’ a large number of individual decision trees that operate as an ensemble from replicas of the training samples obtained by randomly resampling the input data (features and examples).

...

Image Added
Its main characteristics are:

  • No minimum size is required for leaf nodes. The final score of the algorithm is given by an unweighted average of the prediction (zero or one) by each individual tree.

  • Each individual tree in the random forest splits out a class prediction and the class with the most votes becomes our model’s prediction.

  • As a large number of relatively uncorrelated models (trees) operating as a committee,this algorithm will outperform any of the individual constituent models. The reason for this wonderful effect is that the trees protect each other from their individual errors (as long as they don’t constantly all err in the same direction). While some trees may be wrong, many other trees will be right, so as a group the trees are able to move in the correct direction.

  • In a single decision tree, we consider every possible feature and pick the one that produces the best separation between the observations in the left node vs. those in the right node. In contrast, each tree in a random forest can pick only from a random subset of features (bagging).This forces even more variation amongst the trees in the model and ultimately results in lower correlation across trees and more diversification.

...

The relative rank (i.e. depth) of a feature used as a decision node in a tree can be used to assess the relative importance of that feature with respect to the predictability of the target variable. Features used at the top of the tree contribute to the final prediction decision of a larger fraction of the input samples. The expected fraction of the samples to which they contribute to can thus be used as an estimate of the relative importance of the features. In scikit-learn, the fraction of samples to which a feature contributes to is combined with the decrease in impurity from splitting them to create a normalized estimate of the predictive power of that feature.

...

  • They are computed on statistics derived from the training dataset and therefore do not necessarily inform us on which features are most important to make good predictions on the held-out dataset.
  • They favor high cardinality features, that are featured with many unique values. Permutation feature importance is an alternative to impurity-based feature importance that does not suffer from these flaws.

For this Due to its complexity, we will not use show it in this exercise. Learn more hereFor more details, see the following link: https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html.

Optional exercise : Draw a decision tree

Here you are Exercise for students: Here it is an example of how you can build a decision tree by yourself! Try to imagine how the decision tree's growth could proceed in our analysis case and complete it! We give you some hints!

...

could be the decision tree's

...

growth

...

in

...

our

...

analysis

...

case

...

and

...

complete

...

it!

...

We

...

give

...

you

...

some

...

hints!

import

...

matplotlib.pyplot

...

as

...

plt


%matplotlib inline
fig = plt.figure(figsize=(10, 4))
ax = fig.add_axes([0, 0, 0.8, 1], frameon=False, xticks=[], yticks=[])
ax.set_title('Decision Tree: Higgs Boson events Classification', size=24,color='red')

def text(ax, x, y, t, size=20, **kwargs):
    ax.text(x, y, t,
            ha='center', va='center', size=size,
            bbox=dict( boxstyle='round', ec='blue', fc='w' ), **kwargs)
    
# Here you are the variables we can use for the training phase:
# --------------------------------------------------------------------------
# High level features:
# ['f_massjj', 'f_deltajj', 'f_mass4l', 'f_Z1mass' , 'f_Z2mass']
# --------------------------------------------------------------------------
# Low level features:
#        [ 'f_lept1_pt','f_lept1_eta','f_lept1_phi', \
#          'f_lept2_pt','f_lept2_eta','f_lept2_phi', \
#          'f_lept3_pt','f_lept3_eta','f_lept3_phi', \
#          'f_lept4_pt','f_lept4_eta','f_lept4_phi', \
#          'f_jet1_pt','f_jet1_eta','f_jet1_phi', \
#          'f_jet2_pt','f_jet2_eta','f_jet2_phi']
#---------------------------------------------------------------------------
text(ax, 0.5, 0.9, "How large is\n\"f_lepton1_pt\"?", 20,color='red')
text(ax, 0.3, 0.6, "How large is\n\"f_lepton2_pt\"?", 18,color='blue')
text(ax, 0.7, 0.6, "How large is\n\"f_lepton3_pt\"?", 18)
text(ax, 0.12, 0.3, "How large is\n\"f_lepton4_pt\"?", 14,color='magenta')
text(ax, 0.38, 0.3, "How large is\n\"f_jet1_eta\"?", 14,color='violet')
text(ax, 0.62, 0.3, "How large is\n\"f_jet2_eta\"?", 14,color='orange')
text(ax, 0.88, 0.3, "How large is\n\"f_jet1_phi\"?", 14,color='green')

text(ax, 0.4, 0.75, ">= 1 GeV", 12, alpha=0.4,color='red')
text(ax, 0.6, 0.75, "< 1 GeV", 12, alpha=0.4,color='red')

text(ax, 0.21, 0.45, ">= 3 GeV", 12, alpha=0.4,color='blue')
text(ax, 0.34, 0.45, "< 3 GeV", 12, alpha=0.4,color='blue')

text(ax, 0.66, 0.45, ">= 2 GeV", 12, alpha=0.4,color='black')
text(ax, 0.79, 0.45, "< 2 GeV", 12, alpha=0.4,color='black')

ax.plot([0.3, 0.5, 0.7], [0.6, 0.9, 0.6], '-k',color='red')
ax.plot([0.12, 0.3, 0.38], [0.3, 0.6, 0.3], '-k',color='blue')
ax.plot([0.62, 0.7, 0.88], [0.3, 0.6, 0.3], '-k')
ax.plot([0.0, 0.12, 0.20], [0.0, 0.3, 0.0], '--k')
ax.plot([0.28, 0.38, 0.48], [0.0, 0.3, 0.0], '--k')
ax.plot([0.52, 0.62, 0.72], [0.0, 0.3, 0.0], '--k')
ax.plot([0.8, 0.88, 1.0], [0.0, 0.3, 0.0], '--k')
ax.axis([0, 1, 0, 1])

fig.savefig('05.08-decision-tree.png')

...