Author(s)

NameInstitutionMail AddressSocial Contacts
INFN Section Perugiamirco.tracolli@pg.infn.itN/A




How to Obtain Support

Mailmirco.tracolli@pg.infnit
SocialN/A
JiraN/A

General Information

ML/DL TechnologiesML/RL
Science FieldsHigh Energy Physics, Computing, Cache
Difficultymedium
Language

English

Type

runnable, external resource

Software and Tools

Programming LanguagePython3, Go
ML ToolsetKeras, Tensorflow, sklearn
Additional libraries
Suggested Environmentsbare Linux Node

Needed datasets

Data CreatorAd hoc tool
Data Typelogs of data analysis file requests
Data Sizedepending on the configuration of the generator tool
Data Sourcedata generator tool: https://github.com/Cloud-PG/dataset-generator

Short Description of the Use Case

Accessing data is a very important task in the data analysis flow and usually, there are several frameworks and software layers that make it possible to accomplish such a target. In particular, recent studies are focused on Data Lake Cache management to optimize the data flow through the clients. The caching layer is a very important part of the data flow that should be optimized, especially if the data are distributed and also the compute centers are decentralized.

Since the infrastructure part is in continuous development, a simulation environment is needed to test and experiments with different approaches to improve the caching performances in a Data Lake. As a result, this project allows you to have a playground where to test new features or algorithms.

How to execute it


Base requirement packages (Debian based distro)

  • git
  • python3 (python3-dev, python3-pip)
  • golang

Get the tools

First, you need the data generator to create a synthetic dataset. The data generator used in this project is the following:

With such a generator, you can create a dataset that has requests similar to the HEP context in which this project was born.

Note: all the Python commands refers to the Python 3 environment

# Create a folder for the whole project
mkdir myProject
cd myProject

# Download the repository
git clone https://github.com/Cloud-PG/dataset-generator.git

# Enter the project folder
cd dataset-generator

# Install dependencies
pip3 install -r requirements.txt

# Back to main project folder
cd ..

Secondly, you can download the simulation environment:

# Download the repository
git clone --branch v2.0.2 https://github.com/Cloud-PG/smart-cache.git

# Enter the project folder
cd smart-cache

# Install the Utilities
cd SmartCache/sim/Utilities
pip3 install -e .
cd ../../..

# Install the Probe module
cd Probe
python3 setup.py install
cd ..

# Install general requirements
pip3 install coloredlogs colorama dash_daq biokit

# Back to main project folder
cd ..

Create a dataset

You can use a preset config to generate a dataset with the following command:

python3 dataset-generator/dataset_generator.py gen dataset-generator/configs/HighFreqDataset.json --dest-folder ./dataset

Of course, you can edit the HighFreqDataset.json file in the configs folder to customize your data generator. Here you can see an example of such a configuration:

{
    "seed": 42,
    "num_days": 365,
    "num_req_x_day": -1,
    "dest_folder": "HighFrequencyDataset",
    "function": {
        "function_name": "HighFrequencyDataset",
        "kwargs": {
            "num_files": 1000,
            "min_file_size": 1000,
            "max_file_size": 4000,
            "lambda_less_req_files": 1.0,
            "lambda_more_req_files": 10.0,
            "perc_more_req_files": 25.0,
            "perc_files_x_day": 1.0,
            "size_generator_function": "gen_random_sizes"
        }
    }
}

After the dataset creation, you will see the dataset files into the dataset folder in the main of the project.

Run the simulation

First, you need to compile the simulator:

# Compile the simulator
python3 -m utils compile --release --fast

Then, you can get the simulator executable with the following command:

# Get simulator exec path
export SIM=$(python3 -m utils sim-path)

# Check simulator executable is working
$SIM help

After that, you can run a simulation using the datatset previously generated. To do this, you need to create a proper simulation config file, like the following:

--- # Simulation parameters
sim:
  data: ./dataset
  outputFolder: ./results/
  type: normal
  window:
    start: 0
    stop: 52
  region: it
  overwrite: true
  cache:
    # Use Reinforcement learning AI
    type: aiRL
    watermarks: false
    # Create a cache with 10G size
    size:
      value: 10
      unit: G
    bandwidth: 
      value: 1
      redirect: true
  ai:
    rl:
      epsilon:
        decay: 0.001
      addition:
        featuremap: ./smart-cache/featureMaps/rlAdditionFeatureMap.json
      eviction:
        featuremap: ./smart-cache/featureMaps/rlEvictionFeatureMap.json

Create the above config with the following command:

cat << EOF > simulation.conf.yaml
--- # Simulation parameters
sim:
  data: $(pwd)/dataset
  outputFolder: $(pwd)/results
  type: normal
  window:
    start: 0
    stop: 52
  region: it
  overwrite: true
  cache:
    # Use Reinforcement learning AI
    type: aiRL
    watermarks: false
    # Create a cache with 10G size
    size:
      value: 10
      unit: G
    bandwidth: 
      value: 1
      redirect: true
  ai:
    rl:
      epsilon:
        decay: 0.001
      addition:
        featuremap: $(pwd)/smart-cache/featureMaps/rlAdditionFeatureMap.json
      eviction:
        featuremap: $(pwd)/smart-cache/featureMaps/rlEvictionFeatureMap.json
EOF

Finally, run the simulation with:

$SIM sim simulation.conf.yaml

Explore the results

The simulation results will be stored in a folder named results/run_full_normal/aiRL_SCDL2-onK_10G_1Gbit_it/, that may change based on the simulation configuration file. The folder contains a .csv file with the simulation results and and other files containing some simulation statistics. You can load these results using a Python library like pandas or you can examine them using the dashboard from the Probe module:

python3 -m probe.results dashboard results

Finally, the dashboard will be available at http://localhost:8050/ by default. If you need a specific ip for the dashboard, you can set the proper parameter (e.g. -dash-ip 0.0.0.0).

Annotated Description

N/A

References

Attachments

N/A

  • No labels