Name | Institution | Mail Address |
---|---|---|
Matteo Migliorini | INFN Sezione di Padova | matteo.migliorini@pd.infn.it |
ML/DL Technologies | Feedforward neural network |
---|---|
Science Fields | High Energy Physics |
Difficulty | Low |
Language | English |
Type | Runnable |
Programming Language | Python |
---|---|
ML Toolset | BigDL |
Additional libraries | BigDL, Spark |
Suggested Environments | INFN-Cloud VM, Spark Cluster |
The effective utilization at scale of complex machine learning techniques for HEP use cases poses several technological challenges, most importantly on the actual implementation of dedicated end-to-end data pipelines. In this paper we presented a possible solution to this challenges built using industry standard big data tools, such as Apache Spark. In the presented pipeline we exploited Spark in all the steps, from ingesting and processing ROOT files containing the dataset to the distributed training of the model.
In this use case we focus on the training of one of the classifiers using Spark and Intel BigDL, an open source deep learning library from Intel written on top of Spark hence allowing easy scale out computing. Furthermore, BigDL makes it easy to distribute model training between the workers, allowing to obtain an almost linear speedup in the training time. All this complexity is hidden from the user, which is exposed only to an API similar to Keras/Pytorch.
In this simple example, we will train a deep neural network with the goal of classifying three different kind of events. A better description of the HEP use case and model used is provided in the original paper.
A Spark cluster is required in order to perform the training. This can be created on INFN Cloud by performing the following steps:
A subset of the dataset has already been uploaded on minio and should be accessible by everybody. The full dataset used in this example can be downloaded from here and uploaded on a workspace on minio.
In the notebooks, the first three paragraphs, are needed to respectively
In principle they should not be modified and work out of the box. If one wish to use more workers for the training and slaves ara available, the number of executors can be increased in the spark session in the option "spark.executor.instances" and "spark.cores.max" (which should be equal to executor.cores*executor.instances).