In the first post of this two-part series, we demonstrated how to use Striim’s ability to collect, integrate, process, analyze and visualize large streaming data in real time. In this post, we will provide more details about the machine learning pipeline to forecast network traffic and detect anomalies in Striim.
First, Striim reads raw data from files and then processes and aggregates the raw data. After aggregation, the data volume is dramatically reduced. In this application, it aggregates total network traffic for all IPs per minute. Thus it only has one data point for each minute to train the machine learning models.
In the machine learning pipeline, it first preprocesses the data, then fits the data into the machine learning models for prediction. Finally, it detects anomalies if the actual value is far from the predicted one.
- Data Processing: Before fitting data into sophisticated models, it is usually necessary to transform and preprocess the data. Striim supports a collection of data preprocessing methods including standardization and normalization to make training less sensitive to the scale of features, power transformation (like square root, log) to stabilize volatile data, and seasonality decomposition like Loess or STL decomposition to remove the seasonality in time-series data, etc. In this application, we use log transformation to make data more stable and improve prediction accuracy.
- Automated Prediction: Striim first transforms time series regression to a standard regression problem with lagged variables (i.e. autoregression), and then applies any regression model to it for prediction. Striim has a plethora of machine learning models including linear and non-linear regression, SVM, gaussian process regression, random forest, Bayesian networks, deep learning models, etc. In this application, we use Random Forest for several reasons:
- It is efficient enough to retrain continuously over streaming data. Some models, like deep neural networks, are time-consuming and not suitable for online training.
- It uses the ensembling method which combines multiple ML models to obtain better performance.
- It achieves good predictive performance compared to other models according to our comprehensive empirical study.
- Anomaly Detection: Striim detects anomalies based on its prediction. The main idea is that anomalies tend to have a large difference between the actual value and predicted value. Striim uses a statistical-based threshold to find anomalies if the percentage error between actual and predicted values is larger than the threshold. In this application, Striim assumes percentage errors follow the normal distribution and defines the threshold accordingly. e.g. we can define the threshold as mean + 2 * standard deviation of percentage errors.
The models can update continuously on the streaming data. One can bound the training data size and control update frequency. In this application, we use the recent 200 data points as training data and update the ML models when it receives a new data point.
For more details, please see our 2019 DEBS paper: A Demonstration of Striim.
A Streaming Integration and Intelligence Platform. Striim won Best Demo Award in the 13th ACM International Conference on Distributed and Event‐based Systems (DEBS).