Python‎ > ‎

Quiet log noise with Python and machine learning

posted May 8, 2020, 6:49 AM by Chris G   [ updated May 8, 2020, 6:52 AM ]

Quiet log noise with Python and machine learning

Logreduce saves debugging time by picking out anomalies from mountains of log data.

radio communication signals
Image by : 

Continuous integration (CI) jobs can generate massive volumes of data. When a job fails, figuring out what went wrong can be a tedious process that involves investigating logs to discover the root cause—which is often found in a fraction of the total job output. To make it easier to separate the most relevant data from the rest, the Logreduce machine learning model is trained using previous successful job runs to extract anomalies from failed runs' logs.

This principle can also be applied to other use cases, for example, extracting anomalies from Journald or other systemwide regular log files.

Using machine learning to reduce noise

A typical log file contains many nominal events ("baselines") along with a few exceptions that are relevant to the developer. Baselines may contain random elements such as timestamps or unique identifiers that are difficult to detect and remove. To remove the baseline events, we can use a k-nearest neighbors pattern recognition algorithm (k-NN).

Log events must be converted to numeric values for k-NN regression. Using the generic feature extraction tool HashingVectorizer enables the process to be applied to any type of log. It hashes each word and encodes each event in a sparse matrix. To further reduce the search space, tokenization removes known random words, such as dates or IP addresses.

Once the model is trained, the k-NN search tells us the distance of each new event from the baseline.

This Jupyter notebook demonstrates the process and graphs the sparse matrix vectors.

Introducing Logreduce

The Logreduce Python software transparently implements this process. Logreduce's initial goal was to assist with Zuul CI job failure analyses using the build database, and it is now integrated into the Software Factory development forge's job logs process.

At its simplest, Logreduce compares files or directories and removes lines that are similar. Logreduce builds a model for each source file and outputs any of the target's lines whose distances are above a defined threshold by using the following syntax: distance | filename:line-number: line-content.

$ logreduce diff /var/log/audit/audit.log.1 /var/log/audit/audit.log
INFO  logreduce.Classifier - Training took 21.982s at 0.364MB/(1.314kl/s) (8.000 MB - 28.884 kilo-lines)
0.244 | audit.log:19963:        type=USER_AUTH acct="root" exe="/usr/bin/su"
INFO  logreduce.Classifier - Testing took 18.297s at 0.306MB/(1.094kl/s) (5.607 MB - 20.015 kilo-lines)
99.99% reduction (from 20015 lines to 1

A more advanced Logreduce use can train a model offline to be reused. Many variants of the baselines can be used to fit the k-NN search tree.

$ logreduce dir-train audit.clf /var/log/audit/audit.log.*
INFO  logreduce.Classifier - Training took 80.883s at 0.396MB/(1.397kl/s) (32.001 MB - 112.977 kilo-lines)
DEBUG logreduce.Classifier - audit.clf: written
$ logreduce dir-run audit.clf /var/log/audit/audit.log

Logreduce also implements interfaces to discover baselines for Journald time ranges (days/weeks/months) and Zuul CI job build histories. It can also generate HTML reports that group anomalies found in multiple files in a simple interface.