Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents


Author(s)

NameInstitutionMail AddressSocial Contacts
INFN Section Perugiamirco.tracolli@pg.infn.itN/A




How to Obtain Support

Mailmirco.tracolli@pg.infnit
SocialN/A
JiraN/A

General Information

ML/DL TechnologiesML/RL
Science FieldsHigh Energy Physics, Computing, Cache
Difficultymedium
Language

English

Type

runnable, external resource

Software and Tools

Programming LanguagePython3, Go
ML ToolsetKeras, Tensorflow, sklearn
Additional libraries
Suggested Environmentsbare Linux Node

Needed datasets

Data CreatorAd hoc tool
Data Typelogs of data analysis file requests
Data Sizedepending on the configuration of the generator tool
Data Sourcedata generator tool: https://github.com/Cloud-PG/dataset-generator


Short Description of the Use Case

Accessing data is a very important task in the data analysis flow and usually, there are several frameworks and software layers that make it possible to accomplish such a target. In particular, recent studies are focused on Data Lake Cache management to optimize the data flow through the clients. The caching layer is a very important part of the data flow that should be optimized, especially if the data are distributed and also the compute centers are decentralized.

Since the infrastructure part is in continuous development, a simulation environment is needed to test and experiments with different approaches to improve the caching performances in a Data Lake. As a result, this project allows you to have a playground where to test new features or algorithms.

How to execute it


Base requirement packages (Debian based distro)

  • git
  • python3 (python3-dev, python3-pip)
  • golang

Get the tools

First, you need the data generator to create a synthetic dataset. The data generator used in this project is the following:

...

# Download the repository
git clone --branch v2.0.2 <https://github.com/Cloud-PG/smart-cache.git>

# Enter the project folder
cd smart-cache

# Install the Utilities
cd SmartCache/sim/Utilities
pip3 install -e .
cd ../../..

# Install the Probe module
cd Probe
python3 setup.py install
cd ..

# Install general requirements
pip3 install coloredlogs colorama dash_daq biokit

# Back to main project folder
cd ..

Create a dataset

You can use a preset config to generate a dataset with the following command:

...

After the dataset creation, you will see the dataset files into the dataset folder in the main of the project.

Run the simulation

First, you need to compile the simulator:

...

$SIM sim simulation.conf.yaml

Explore the results

The simulation results will be stored in a folder named results/run_full_normal/aiRL_SCDL2-onK_10G_1Gbit_it/, that may change based on the simulation configuration file. The folder contains a .csv file with the simulation results and and other files containing some simulation statistics. You can load these results using a Python library like pandas or you can examine them using the dashboard from the Probe module:

...

Finally, the dashboard will be available at http://localhost:8050/ by default. If you need a specific ip for the dashboard, you can set the proper parameter (e.g. -dash-ip 0.0.0.0).

Annotated Description

N/A

References

Attachments

N/A