Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

If a transfer fails, an error message is generated and stored. Failed transfers are of the order of a few hundred thousand per day. Understanding and possibly fixing the cause of failed transfers is part of the duties of the experiment operation teams. Due to the large number of failed transfers, not all can be addressed. We developed a pipeline to discover failure patterns from the analysis of FTS error logs. Error messages are read in, cleaned from meaningless parts (file paths, host names), and the text is analysed using NLP (Natural Language Processing) techniques such as word2vec. FInally the messages can be grouped in clusters based on the similarity of their text using the Levenshtein distance or using ML algorithms for unsupervised clustering such as DBSCAN. The biggest clusters and their relationship with the host names with largest numbers of failing transfers is presented in a dedicated dashboard for the CMS experiment (access to the dashboard requires login with CERN SSO). The clusters can be used by the operation teams to quickly identify anomalies in user activities, tackle site issues related to the backlog of data transfers, and in the future to implement automatic recovery procedures for the most common error types.

...

The input file can be found in zipped form in the github repo (message_example.zip), or read in from Minio.

Annotated Description

The error message analysis, and in general any text analysis, can be divided in two main phases: pre-processing, made of Data preparation and Tokenization, and processing, specific of the analysis. Before going deeper into the description of each phase, we have to define what we mean for similarity. Indeed this is a key concept for our purpose, being the metric that will regulate the clustering. In our approach we map the words in numeric vectors so that similarity of x and y can be expressed, for instance, by the cosine of the angle between the corresponding vectors.

Pre-processing Phase

Let us begin by showing two error messages as example:

TRANSFER ERROR: Copy failed with mode 3rd push, with error: Transfer failed: failure: problem sending data: 
java.security.cert.CertificateException: The peers certificate with subjects DN CN=pps05.lcg.triumf.ca, OU=triumf.ca, O=Grid, C=CA was rejected.
The peers certificate status is: FAILED The following validation errors were found:;error at position 0 in chain, problematic certificate
subject: CN=pps05.lcg.triumf.ca, OU=triumf.ca, O=Grid,C=CA (category: CRL): Signature of a CRL corresponding to this certificates CA is invalid
TRANSFER ERROR: Copy failed with mode 3rd pull, with error: copy 0) Could not get the delegation id: Could
not get proxy request: Error 404 fault: SOAP-ENV:Server [no subcode] HTTP/1.1 404 Not Found Detail:
<!DOCTYPE HTML PUBLIC -//IETF//DTD HTML 2.0//EN> <html><head> <title>404 Not Found</title> </head><body>
<h1>Not Found</h1> <p>The requested URL /gridsite-delegation was not found on this server.</p>
</body></html> \n.

As we can see, the error message contains "particular information": words between tags, URLs, sites, usernames to mention a few. Therefore the first step, Data preparation, aims to reduce the variety of error messages by applying cleaning. The cleaning is performed by using regular expressions, namely REGEX.

After cleaning from a selected set of punctuation, messages are broken into their constituents (i.e words). This is the Tokenization phase.


After Data Preparation:

TRANSFER ERROR Copy failed with mode pull with error copy Could not get the delegation id Could not get proxy request Error fault SOAP-ENVServer no subcode HTTP Not Found Detail .

After Tokenization:

TRANSFER, ERROR, Copy, failed, with, mode, pull, with, error, copy, Could, not, get, the, delegation, id, Could,
not, get, proxy, request, Error, fault, SOAP-ENVServer, no, subcode, HTTP, Not, Found, Detail, .

Processing Phase



References

Attachments

message_example.zip

...