Table of Contents |
---|
Author(s)
Name | Institution | Mail Address | Social Contacts |
---|---|---|---|
INFN Section Perugia | mirco.tracolli@pg.infn.it | N/A | |
How to Obtain Support
mirco.tracolli@pg.infnit | |
Social | N/A |
Jira | N/A |
General Information
ML/DL Technologies | ML/RL |
---|---|
Science Fields | High Energy Physics, Computing, Cache |
Difficulty | medium |
Language | English |
Type | runnable, external resource |
Software and Tools
Programming Language | Python3, Go |
---|---|
ML Toolset | Keras, Tensorflow, sklearn |
Additional libraries | |
Suggested Environments | bare Linux Node |
Needed datasets
Data Creator | Ad hoc tool |
---|---|
Data Type | logs of data analysis file requests |
Data Size | depending on the configuration of the generator tool |
Data Source | data generator tool: https://github.com/Cloud-PG/dataset-generator |
Short Description of the Use Case
Accessing data is a very important task in the data analysis flow and usually, there are several frameworks and software layers that make it possible to accomplish such a target. In particular, recent studies are focused on Data Lake Cache management to optimize the data flow through the clients. The caching layer is a very important part of the data flow that should be optimized, especially if the data are distributed and also the compute centers are decentralized.
Since the infrastructure part is in continuous development, a simulation environment is needed to test and experiments with different approaches to improve the caching performances in a Data Lake. As a result, this project allows you to have a playground where to test new features or algorithms.
How to execute it
Base requirement packages (Debian based distro)
- git
- python3 (python3-dev, python3-pip)
- golang
Get the tools
First, you need the data generator to create a synthetic dataset. The data generator used in this project is the following:
...
# Download the repository
git clone --branch v2.0.2 <https://github.com/Cloud-PG/smart-cache.git>
# Enter the project folder
cd smart-cache
# Install the Utilities
cd SmartCache/sim/Utilities
pip3 install -e .
cd ../../..
# Install the Probe module
cd Probe
python3 setup.py install
cd ..
# Install general requirements
pip3 install coloredlogs colorama dash_daq biokit
# Back to main project folder
cd ..
Create a dataset
You can use a preset config to generate a dataset with the following command:
...
After the dataset creation, you will see the dataset files into the dataset
folder in the main of the project.
Run the simulation
First, you need to compile the simulator:
...
$SIM sim simulation.conf.yaml
Explore the results
The simulation results will be stored in a folder named results/run_full_normal/aiRL_SCDL2-onK_10G_1Gbit_it/
, that may change based on the simulation configuration file. The folder contains a .csv
file with the simulation results and and other files containing some simulation statistics. You can load these results using a Python
library like pandas
or you can examine them using the dashboard from the Probe
module:
...
Finally, the dashboard will be available at http://localhost:8050/ by default. If you need a specific ip for the dashboard, you can set the proper parameter (e.g. -dash-ip 0.0.0.0
).
Annotated Description
N/A
References
Attachments
N/A