Machine Learning for Streaming Data

4 Minute Read

The world is generating an astonishing amount of data every second of every day, and this data continues to increase exponentially. It was estimated that by 2020, the total amount of data in the world would be 44 zettabytes. By 2025, this figure is expected to increase up to 463 exabytes

Businesses in all industries are moving from batch processing to real-time data streams (also known as streaming data) to make the most of this data. Applying machine learning to streaming data can help organizations with a wide range of applications. These include fraud detection from real-time financial transactions, real-time operations management (e.g., stock monitoring in the supply chain), or sentiment analysis over live social media trends on Facebook, Twitter, etc. 

However, when it comes to using streaming data for machine learning, the traditional machine learning methods are ineffective. This is because these methods were designed to process data in batches rather than handle continuous streams in real time. Online machine learning is an approach that feeds data to the machine learning model in an incremental manner, which can leverage continuous streams. 

Why Offline Machine Learning Is Not Good for Streaming Data

Offline learning — also known as batch learning — refers to an approach where the system lacks the ability to learn incrementally. It can only train the model using all available data at once, taking up a lot of computing resources. However, the bigger issue is the time that this process takes.  

Streaming data deals with continuous streams of data, and if you want your offline learning system to learn from this data, you must train a new version of the system again by adding this new data to your old data. The system will now learn from a newly formed dataset, and when it’s ready, it will replace the older system. 

Thanks to automated tools, steps like training data, evaluating, and launching the ML system can be performed smoothly, allowing your offline learning system to be adaptive. 

However, this process of training your dataset can take a lot of time. This makes it unsuitable for a number of use cases, such as forecasting use cases. Suppose your model has to forecast stock prices. In forecasting, your model predicts a value for the future, which in this example is the expected price of a stock at some point in the future. For this purpose, you need the latest data to train and retrain your model. This isn’t possible with offline machine learning, where retraining the model with a complete dataset takes time.

How Online Machine Learning Can Make a Difference

Online machine learning is an approach in which training occurs incrementally by feeding the model continuous data as it arrives from the source. The data from real-time streams are broken down into mini-batches and fed to the model.  

Save computing resources

Online learning is ideal if you have minimal computing resources and a lack of space to store streaming data. Once an online learning system is done learning from a data stream, it can discard it or move the data to a storage medium. This can save a significant amount of cost and space. You don’t require powerful and heavy-end hardware to process streaming data. That’s because only one mini-batch is processed in the memory at a time, unlike offline machine learning, where everything has to be processed at once. This means that you can even use an affordable piece of hardware like Raspberry Pi to perform online machine learning. 

With online learning, your model can learn in multiple passes without forcing you to rebuild it from scratch. This is useful for dealing with streaming data in large volumes, especially when your computing resources are limited. 

Prevent the occurrence of concept drifts

Online machine learning can also address concept drift — a known problem in machine learning. In machine learning, concept refers to a variable or a quantity that a machine learning model is trying to predict.

Concept drift refers to the phenomenon in which the target concept’s statistical properties change over time. This can be a sudden change in variance, mean, or any other characteristics of data. In online machine learning, the model computes one mini-batch of data at a time and can be updated on the fly. This can help to prevent concept drift as new streams of data are continuously used to update the model.

Learning from large amounts of data streams can help with applications that deal with forecasting, spam filtering, and recommender systems. For example, if a user buys multiple products (e.g., a winter coat and gloves) within a space of minutes on an e-commerce website, an online machine learning model can use this real-time information to recommend products that can complement their purchase (e.g., a scarf). 

Use cases for online machine learning

Online machine learning can be an ideal solution for the following scenarios:

  • When your data has no end and is effectively continuous
  • Where your training data is sensitive due to privacy issues, and you can’t move it to an offline environment
  • Where you can’t transfer training data to an offline environment due to device or network limitations
  • Where the size of training datasets is too large, making it impossible to fit into the memory of a single machine at a specific time 

Striim Can Operationalize Machine Learning for Your Data Streams

While working with machine learning for streaming data, companies want to ensure that they can continuously serve their model for real-time predictions. However, there can be some challenges to operationalizing this type of machine learning. 

For your model, you need to improve the handling of high volumes of data in real time. You have to consider data velocity — the speed at which data is being generated, distributed, and collected. Another thing to manage is high data variety (e.g., video files, text data). 

Striim can help you in this regard with the following: 

  • Event-driven data capture and processing (transformation) to train models in an incremental fashion
  • Out-of-the-box machine learning models such as Linear Regression, Polynomial Regression, Least Squares and others
  • Capture schema changes from source systems and specifying how you want to handle data drift 
  • Handle enterprise workloads, allowing it to handle large volumes of streaming data from several sources at once. 
  • Perform filtering, enriching, and data preparation operations on streaming data. This can be useful to handle data variety and prepare data in a format that your users want. 
  • Provide data-driven insights and predictions by integrating your trained model with your organization’s real-time data streams. 
  • Track how data evolves and assesses model performance. Based on this monitoring, Striim can initiate retraining of data on its own, reducing human intervention. 

With these benefits, Striim can be the foundational platform upon which your streaming data can be reliably fed as training data to your models. Learn more about it in our part 1 and part 2 of operationalizing machine learning.