Best Practices for Real-Time Data Movement and Stream Processing
What are some of the best practices for real-time data movement, and for the stream processing necessary to give the data real value?
Streaming integration is all about data movement. Between continuous real-time collection of data, and it’s delivery to enterprise and cloud destinations, data has to move in a reliable and scalable way. And, while it is moving, the data often has to undergo stream processing to give it real value through transformations and enrichment. There are architectural and technology decisions every step of the way – not just at design time, but also at run time.
So what are the best practices for real-time data movement? How do you create reliable and scalable data movement pipelines?
1. Take a Streaming-First approach
The first, and most important decision is to take a streaming first approach to integration. This means that at least the initial collection of all data should be continuous and real-time. Batch or microbatch data collection can never attain real-time latencies and guarantee your data is always up-to-date. Technologies like change data capture (CDC), and file tailing need to be adopted to ensure your data is always fresh.
2. Raw data is seldom useful for analytics
There may be a lot of superfluous fields that need to be removed, or even whole records that don’t make sense for an analytics use case, that need to be filtered out. The data can be further reduced through redundancy removal, summarization, and change detection. It may also be necessary to enrich the data to add value, such as denormalizing IDs in database records, or correlating data from multiple sources to create structure in a destination. To reduce storage costs in the target, and optimize analytics, it is often much more efficient to perform these tasks in-memory at the point of ingestion, rather than after the data has been stored.
3. Moving data at scale, with low latency, requires minimal disk I/O
The whole point of doing real-time data movement and stream processing is to deal with huge volumes of data with very low latency. If you are writing to disk at each stage of a data flow, then you risk slowing down the whole architecture. This includes the use of intermediate topics on a persistent messaging system such as Kafka. These should be used sparingly, possibly just at the point of ingestion, with all other processing and movement happening in-memory.
4. Real-time streaming data can often be used for more than one purpose
To optimize data flows, and minimize resource usage, it is important that this data is collected only once, but able to be processed in different ways and delivered to multiple endpoints. Our customers often utilize a single streaming source for delivery into Kafka, Hadoop on-premises, and cloud storage, simultaneously and in real time.
5. Custom coding should not be required
Building data pipelines and working with streaming data should not require custom coding. Piecing together multiple open source components, and writing processing code requires teams of developers, reduces flexibility, and causes maintenance headaches. The Striim platform enables those that know data, including Business Analysts and Data Scientists, to work with the data directly using SQL-based queries, speeding development and handling scalability and reliability issues automatically.
6. Real-time processing operates continuously
Real-time data movement and stream processing applications need to operate continuously for years. Administrators of these solutions need to understand the status of data pipelines and be alerted immediately for any issues. Continuous validation of data movement from source to target, coupled with real-time monitoring, can provide peace of mind. This monitoring can incorporate intelligence, looking for anomalies in data formats, volumes, or seasonal characteristics to support reliable mission-critical data flows.
The Striim platform directly implements all of these best practices, and more, to ensure real-time data movement and stream processing best support hybrid cloud, real-time applications, and fast data use cases. To learn more about the benefits of real-time data movement using Striim, visit our Striim Platform Overview page, schedule a quick demo with a Striim technologist, or test drive Striim today to get your data where you want it, when you want it.
Editor’s note: This was originally published as a blog post here.