Author(s)

Name	Institution	Mail Address
Matteo Migliorini	INFN Sezione di Padova	matteo.migliorini@pd.infn.it

General Information

ML/DL Technologies	Feedforward neural network
Science Fields	High Energy Physics
Difficulty	Low
Language	English
Type	Runnable

Software and Tools

Programming Language	Python
ML Toolset	BigDL
Additional libraries	BigDL, Spark
Suggested Environments	INFN-Cloud VM, Spark Cluster

Short Description of the Use Case

The effective utilization at scale of complex machine learning techniques for HEP use cases poses several technological challenges, most importantly on the actual implementation of dedicated end-to-end data pipelines. In this paper we presented a possible solution to this challenges built using industry standard big data tools, such as Apache Spark. In the presented pipeline we exploited Spark in all the steps, from ingesting and processing ROOT files containing the dataset to the distributed training of the model.

...

In this simple example, we will train a deep neural network with the goal of classifying three different kind of events. A better description of the HEP use case and model used is provided in the original paper.

How to execute it

A Spark cluster is required in order to perform the training. This can be created on INFN Cloud by performing the following steps:

...

In principle they should not be modified and work out of the box. If one wish to use more workers for the training and slaves ara available, the number of executors can be increased in the spark session in the option "spark.executor.instances" and "spark.cores.max" (which should be equal to executor.cores*executor.instances).

References

A detailed description of the work around this example can be found in this paper:
Migliorini, M., Castellotti, R., Canali, L. et al. Machine Learning Pipelines with Modern Big Data Tools for High Energy Physics. Comput Softw Big Sci 4, 8 (2020). https://doi.org/10.1007/s41781-020-00040-0
Data and code used for this work can be found in this repository.
Related blog entries:
- Machine Learning Pipelines for High Energy Physics Using Apache Spark with BigDL and Analytics Zoo
- Distributed Deep Learning for Physics with TensorFlow and Kubernetes
Presentations:
- Deep Learning Pipelines for High Energy Physics using Apache Spark with Distributed Keras on Analytics Zoo

Attachments

View file

name	HLF classifier.ipynb
height	250

...

Space shortcuts

Page tree

Versions Compared

Old Version 2

New Version 3

Key

Table of Contents

Author(s)

General Information

Software and Tools

Short Description of the Use Case

How to execute it

References

Attachments

Space shortcuts

Page tree

Page History

Versions Compared

Old Version 2

New Version 3

Key

Table of Contents

Author(s)

General Information

Software and Tools

Short Description of the Use Case

How to execute it

References

Attachments