Streaming Data Integration to AWS

3 Minute Read

As businesses adopt Amazon Web Services, streaming data integration to AWS – with change data capture (CDC) and stream processing – becomes a necessary part of the solution.

You’ve already decided that you want to enable integration to AWS. This could be to Amazon RDS or Aurora, Amazon Redshift, Amazon S3, Amazon Kinesis, Amazon EMR, or any number of other technologies.

You may want to migrate existing applications to AWS, scale elastically as necessary, or use the cloud for analytics or machine learning, but running applications in AWS, as VMs or containers, is only part of the problem. You also need to consider how to you move data to the cloud, ensure your applications or analytics are always up to date, and make sure the data is in the right format to be valuable.

The most important starting point is ensuring you can stream data to the cloud in real time. Batch data movement can cause unpredictable load on cloud targets, and has a high latency, meaning your data is often hours old. For modern applications, having up-to-the-second information is essential, for example to provide current customer information, accurate business reporting, or for real-time decision making.

integration to wasStreaming data integration to AWS from on-premise systems requires making use of appropriate data collection technologies. For databases, this is change data capture, or CDC, which directly and continuously intercepts database activity, and collects all the inserts, updates, and deletes as events, as they happen. Log data requires file tailing, which reads at the end of one or more files across potentially multiple machines and streams the latest records as they are written. Other sources like IoT data, or third party SaaS applications, also require specific treatment in order to ensure data can be streamed in real time.

Once you have streaming data, the next consideration is what processing is necessary to make the data valuable for your specific AWS destination, and this depends on the use-case.

For database migration or elastic scalability use-cases, where the target schema is similar to the source, moving raw data from on-premise databases to Amazon RDS or Aurora may be sufficient. The important consideration here is that the source applications typically cannot be stopped, and it takes time to do an initial load. This is why collecting and delivering database change, during and after the initial load, is essential for zero downtime migrations.

For real-time applications sourcing from Amazon Kinesis, or analytics use-cases built on Amazon Redshift or Amazon EMR, it may be necessary to perform stream processing before the data is delivered to the cloud. This processing can transform the data structure, and enrich it with additional context information, while the data is in-flight, adding value to the data and optimizing downstream analytics.

Striim’s streaming integration to AWS can continuously collect data from on-premise, or other cloud databases, and deliver to all of your Amazon Web Services endpoints. Striim can take care of initial loads, as well as CDC for the continuous application of change, and these data flows can be created rapidly, and monitored and validated continuously through our intuitive UI.

With Striim, your cloud migrations, scaling, and analytics can be built and iterated-on at the speed of your business, ensuring your data is always where you want it, when you want it.

To learn more about streaming integration to AWS with Striim, visit our “Striim for Amazon Web Services” product page, schedule a demo with a Striim technologist, or download a free trial of the platform.

Streaming Data Integration

Declarative, Fully Managed Data Streaming Pipelines

Data pipelines can be tricky business —failed jobs, re-syncs, out-of-memory errors, complex jar dependencies, making them not only messy but often disastrously unreliable. Data teams