Cloud ETL with Reliability

Implementing Streaming Cloud ETL with Reliability

 

 

If you’re considering adopting a cloud-based data warehousing and analytics solution the most important consideration is how to populate it with current on-site and cloud data in real time with cloud etl.

As noted in a recent study by IBM and Forrester, 88% of companies want to run near-real-time analytics on stored data. Streaming Cloud ETL enables real-time analytics by loading data from operational systems from private and public clouds to your analytics cloud of choice.

A streaming cloud ETL solution enables you to continuously load data from your in-house or cloud-based mission-critical operational systems to your cloud-based analytical systems in a timely and dependable manner. Reliable and continuous data flow is essential to ensuring that you can trust the data you use to make important operational decisions.

In addition to high-performance and scalability to handle your large data volumes, you need to look for data reliability and pipeline high availability. What should you look for in a streaming cloud ETL solution to be confident that it offers the high degree of reliability and availability your mission-critical applications demand?

For data reliability, the following two common issues with data pipelines should be avoided:

  1. Data loss. When data volumes increase and the pipelines are backed up, some solutions allow some of the data to be discarded. Also, when the CDC solution has limited data type support, it may not be able to capture the columns that contain that data type. If an outage or process interruption occurs, an incorrect recovery point can also lead to skipping some data.
  2. Duplicate data. After recovering from a process interruption, the system may create duplicate data. This issue becomes even more prevalent when processing the data with time windows after the CDC step.

How Striim Ensures Reliable and Highly Available Streaming Cloud ETL

Striim ingests, moves, processes, and delivers real-time data across heterogeneous and high-volume data environments. The software platform is built ground-up specifically to ensure reliability for streaming cloud ETL solutions with the following architectural capabilities.

Fault-tolerant architecture

Striim is designed with a built-in clustered environment with a distributed architecture to provide immediate failover. The metadata and clustering service watches for node failure, application failure, and failure of certain other services. If one node goes down, another node within the cluster immediately takes over without the need for users to do this manually or perform complex configurations.

Cloud ETL with ReliabilityExactly Once Processing for zero data loss or duplicates

Striim’s advanced checkpointing capabilities ensure that no events are missed or processed twice while taking time window contents into account. It has been tested and certified for cloud solutions to offer real-time streaming to Microsoft Azure, AWS, and/or Google Cloud with event delivery guarantees.

During data ingest, checkpointing keeps track of all events the system processes and how far they got down various data pipelines. If something fails, Striim knows the last known good state and what position it needs to recover from. Advanced checkpointing is designed to eliminate loss in data windows when the system fails.

If you have a defined data window (say 5 minutes) and the system goes down, you cannot typically restart from where you left off because you will have lost the 5 minutes’ worth of data that was in the data window. That means your source and target will no longer be completely synchronized. Striim addresses this issue by coordinating with the data replay feature that many data sources provide to rewind sources to just the right spot if a failure occurs.

In cases where data sources don’t support data replay—for example, data coming from sensors—Striim’s persistent messaging stores and checkpoints data as it is ingested. Persistent messaging allows previously non-replayable sources to replayed from a specific point. It also allows multiple flows from the same data source to maintain their own checkpoints. To offer exactly once processing, Striim checks to make sure input data has actually been written. As a result, the platform can checkpoint, confident in the knowledge that the data made it to the persistent queue.

End-to-end data flow management for simplified solution architecture and recovery

Striim’s streaming cloud ETL solution also delivers streamlined, end-to-end data integration between source and target systems that enhances reliability. The solution ingests data from the source in real time, performs all transformations such as masking, encryption, aggregation, and enrichment in memory as the data goes through the stream and then delivers the data to the target in a single network operation. All of these operations occur in one step without using a disk and deliver the streaming data to the target in sub-seconds. Because Striim does not require additional products, this simplified solution architecture enables a seamless recovery process and minimizes the risk of data loss or inaccurate processing.

In contrast, a data replication service without built-in stream processing requires data transformation to be performed in the target (or source), with an additional product and network hop. This minimum two-hop process introduces unnecessary data latency. It also complicates the solution architecture, exposing the customer to considerable recovery-related risks and requiring a great deal of effort for accurate data reconciliation after an outage.

For use cases where transactional integrity matters, such as migrating to a cloud database or continuously loading transactional data for a cloud-based business system, Striim also maintains ACID properties (atomicity, consistency, isolation, and durability) of database operations to preserve the transactional context.

The best way to choose a reliable streaming cloud ETL solution is to see it in action. Click here to request a customized demo for your specific environment.

 

Change Data Capture Methods

 

In databases, change data capture (CDC) is a set of software design patterns used to determine and track the data that has changed so that action can be taken using the changed data. Companies use change data capture for several use cases such as cloud adoption and enabling real-time data warehousing. There are multiple common change data capture methods that you can implement depending on your application requirements and tolerance for performance overhead.

  1. Introduction
  2. Audit Columns
  3. Table Deltas
  4. Triggers
  5. Log-based change data capture

 

Introduction

In high-velocity data environments where time-sensitive decisions are made, change data capture is an excellent fit to achieve low-latency, reliable, and scalable data integration. With over 80% of companies planning on implementing multi-cloud strategies by 2025, picking the right change data capture method for your business is more critical than ever given the need to replicate data across multiple environments.

The business transactions captured in relational databases are critical to understanding the state of business operations. Traditional batch-based approaches to move data once or several times a day introduce latency and reduce the operational value to the organization. Change Data Capture provides real-time or near real-time movement of data by moving and processing data continuously as new database events occur.

There are several change data capture methods to identify changes that need to be captured and moved. Here are the common methods, how they work, and their advantages as well as shortcomings.

Audit Columns

By using existing “LAST_UPDATED” or “DATE_MODIFIED” columns, or by adding one if not available in the application, you can create your own change data capture solution at the application level. This approach retrieves only the rows that have been changed since the data was last extracted.

The CDC logic for the technique would be

  1. Get the maximum value of both the target (blue) table’s ‘Created_Time’ and ‘Updated_Time’ columns

2. Select all the rows from the data source with ‘Created_Time’ greater than (>) the target table’s ‘Updated_Time’ , which are all the newly created rows since the last CDC process was executed.

3. Select all rows from the source table that have a ‘Updated_Time’ greater than (>) the target table’s ‘Updated_Time’ but less than (<) its ‘Updated_Time’. The reason for the exclusion of rows less than the maximum target create date is that they were included in step 2.

4. Insert new rows from step 2 or modify existing rows from step 3 in the target.

Pros of this method

  • It can be built with native application logic
  • It doesn’t require any external tooling

Cons of this method

  • Adds additional overhead to the database
  • DML statements such as deletes will not be propagated to the target without additional scripts to track deletes
  • Error prone and likely to cause issues with data consistency

This approach also requires CPU resources to scan the tables for the changed data and maintenance resources to ensure that the DATE_MODIFIED column is applied reliably across all source tables.

Table Deltas

You can use table delta or ‘tablediff’ utilities to compare the data in two tables for non-convergence. Then you can use additional scripts to apply the deltas from the source table to the target as another approach to change data capture. There are several examples of SQL scripts that can find the difference of two tables.

Advantages of this approach:

  • It provides an accurate view of changed data while only using native SQL scripts

Disadvantage of this approach:

  • Demand for storage significantly increases because you need three copies of the data sources that are being used in this technique: the original data, previous snapshot, and current snapshot
  • It does not scale well in applications with heavy transactional workloads

Although this works better for managing deleted rows, the CPU resources required to identify the differences is significant and the overhead increases linearly with the volume of data. The diff method also introduces latency and cannot be performed in real time. Some log-based change data capture tools come with the ability to analyze different tables to ensure replication consistency.

Triggers

Another method for building change data capture at the application level is defining triggers and creating your own change log in shadow tables. Triggers fire before or after INSERT, UPDATE, or DELETE commands (that indicate a change) and are used to create a change log. Operating at the SQL level, some users prefer this approach. Some databases even have native support for triggers.

However, triggers are required for each table in the source database, and they have greater overhead associated with running triggers on operational tables while the changes are being made. In addition to having a significant impact on the performance of the application, maintaining the triggers as the application change leads to management burden.

Advantages of this approach:

  • Shadow tables can provide an immutable, detailed log of all transactions
  • Directly supported in the SQL API for some databases

Disadvantage of this approach:

  • Significantly reduces the performance of the database by requiring multiple writes to a database every time a row is inserted, updated, or deleted

Many application users do not want to risk the application behavior by introducing triggers to operational tables. DBAs and data architects should always heavily test the performance of any triggers added into their environment and decide if they can tolerate the additional overhead.

Log-Based Change Data Capture

Databases contain transaction (sometimes called redo) logs that store all database events allowing for the database to be recovered in the event of a crash. With log-based change data capture, new database transactions – including inserts, updates, and deletes – are read from source databases’ native transaction or redo logs.

The changes are captured without making application level changes and without having to scan operational tables, both of which add additional workload and reduce source systems’ performance.

Advantages of this approach

  • Minimal impact on production database system – no additional queries required for each transaction
  • Can maintain ACID reliability across multiple systems
  • No requirement to change the production database system’s schemas or the need to add additional tables

Challenges of this approach

  • Parsing the internal logging format of a database is complex – most databases do not document the format nor do they announce changes to it in new releases. This would potentially require you to change your database log parsing logic with each new database release.
  • Would need system to manage the database change events metadata
  • Additional log levels required to produce scannable redo logs can add marginal performance overhead

Data integration platforms that natively perform change data capture can handle the complexity mentioned above by automatically mining the database change logs while managing additional metadata to ensure the replication between two or more systems is reliable.

If you would like a demo of to see how low-impact, real-time log-based change data capture works, or to talk to one of our CDC experts, you can schedule a demo for The Striim Platform.

Oracle Change Data Capture (CDC)

Oracle Change Data Capture Tutorial – An Event-Driven Architecture for Cloud Adoption

 

 

All businesses rely on data. Historically, this data resided in monolithic databases, and batch ETL processes were used to move that data to warehouses and other data stores for reporting and analytics purposes. As businesses modernize, looking to the cloud for analytics, and striving for real-time data insights, they often find that these databases are difficult to completely replace, yet the data and transactions happening within them are essential for analytics. With over 80% of businesses noting that the volume & velocity of their data is rapidly increasing, scalable cloud adoption and change data capture from databases like Oracle, SQLServer, MySQL and others is more critical than ever before. Oracle change data capture is specifically one area where companies are seeing an influx of modern data integration use cases.

To resolve this, more and more companies are moving to event-driven architectures, because of the dynamic distributed scalability which makes sharing large volumes of data across systems possible.

In this post we will look at an example which replaces batch ETL by event-driven distributed stream processing: Oracle change data capture events are extracted as they are created; enriched with in-memory, SQL-based denormalization; then delivered to the Azure Cloud to provide scalable, real-time, low-cost analytics, without affecting the source database. We will also look at using the enriched events, optionally backed by Kafka, to incrementally add other event-driven applications or services.

Continuous Data Collection, Processing, Delivery, and Analytics with the Striim Platform
Continuous Data Collection, Processing, Delivery, and Analytics with the Striim Platform

Event-Driven Architecture Patterns

Most business data is produced as a sequence of events, or an event stream: for example, web or mobile app interactions, devices, sensors, bank transactions, all continuously generate events. Even the current state of a database is the outcome of a sequence of events. Treating state as the result of a sequence of events forms the core of several event-driven patterns.

Event Sourcing is an architectural pattern in which the state of the application is determined by a sequence of events. As an example, imagine that each “event” is an incremental update to an entry in a database. In this case, the state of a particular entry is simply the accumulation of events pertaining to that entry. In the example below the stream contains the queue of all deposit and withdrawal events, and the database table persists the current account balances.

Event as a Change to an Entry in a Database
Imagine Each Event as a Change to an Entry in a Database

The events in the stream can be used to reconstruct the current account balances in the database, but not the other way around. Databases can be replicated with a technology called Change Data Capture (CDC), which collects the changes being applied to a source database, as soon as they occur by monitoring its change log, turns them into a stream of events, then applies those changes to a target database. Source code version control is another well known example of this, where the current state of a file is some base version, plus the accumulation of all changes that have been made to it.

The Change Log can be used to Replicate a Database
The Change Log can be used to Replicate a Database

What if you need to have the same set of data for different databases, for different types of use? With a stream, the same message can be processed by different consumers for different purposes. As shown below, the stream can act as a distribution point, where, following the polygot persistence pattern, events can be delivered to a variety of data stores, each using the most suited technology for a particular use case or materialized view.

Streaming Events Delivered to a Variety of Data Stores
Streaming Events Delivered to a Variety of Data Stores

Event-Driven Streaming ETL Use Case Example

Below is a diagram of the Event-Driven Streaming ETL use case example:

Event-Driven Streaming ETL Use Case Diagram
Event-Driven Streaming ETL Use Case Diagram
  1. Striim’s low-impact, real-time Oracle change data capture (CDC) feature is used to stream database changes (inserts, updates and deletes) from an Operational Oracle database into Striim
  2. CDC Events are enriched and denormalized with Streaming SQL and Cached data, in order to make relevant data available together
  3. Enriched, denormalized events are streamed to CosmosDB for real-time analytics
  4. Enriched streaming events can be monitored in real time with the Striim Web UI, and are available for further Streaming SQL analysis, wizard-based dashboards, and other applications on-premise or in the cloud.

Replacing Batch Extract with Real Time Streaming of CDC Order Events

Striim’s easy-to-use CDC wizards automate the creation of applications that leverage change data capture, to stream events as they are created, from various source systems to various targets. In this example, shown below, we use Striim’s OracleReader (Oracle Change Data Capture) to read the Order OLTP transactions in Oracle redo logs and stream these insert, update, delete operations, as soon as the transactions commit, into Striim, without impacting the performance of the source database.

Configuring Database Properties for the Oracle <a href=CDC Data Source” width=”353″ height=”379″ />
Configuring Database Properties for the Oracle CDC Data Source

Utilizing Caches For Enrichment

Relational Databases typically have a normalized schema which makes storage efficient, but causes joins for queries, and does not scale well horizontally. NoSQL databases typically have a denormalized schema which scales across a cluster because data that is read together is stored together.

Normalized Schema with Joins for Queries Does Not Scale Horizontally
Normalized Schema with Joins for Queries Does Not Scale Horizontally

With a normalized schema, a lot of the data fields will be in the form of IDs. This is very efficient for the database, but not very useful for downstream queries or analytics without any meaning or context. In this example we want to enrich the raw Orders data with reference data from the SalesRep table, correlated by the Order Sales_Rep_ID, to produce a denormalized record including the Sales Rep Name and Email information in order to make analysis easier by making this data available together.

Since the Striim platform is a high-speed, low latency, SQL-based stream processing platform, reference data also needs to be loaded into memory so that it can be joined with the streaming data without slowing things down. This is achieved through the use of the Cache component. Within the Striim platform, caches are backed by a distributed in-memory data grid that can contain millions of reference items distributed around a Striim cluster. Caches can be loaded from database queries, Hadoop, or files, and maintain data in-memory so that joining with them can be very fast. In this example, shown below, the cache is loaded with a query on the SalesRep table using the Striim DatabaseReader.

Configuring Database Properties for the Sales Rep Cache
Configuring Database Properties for the Sales Rep Cache

Joining Streaming and Cache Data For Real Time Transforming and Enrichment With SQL

We can process and enrich data-in-motion using continuous queries written in Striim’s SQL-based stream processing language. Using a SQL-based language is intuitive for data processing tasks, and most common SQL constructs can be utilized in a streaming environment. The main differences between using SQL for stream processing, and its more traditional use as a database query language, are that all processing is in-memory, and data is processed continuously, such that every event on an input data stream to a query can result in an output.

Dataflow Showing Joining and Enrichment of <a href=CDC data with Cache” width=”455″ height=”247″ />
Dataflow Showing Joining and Enrichment of CDC data with Cache

This is the query we will use to process and enrich the incoming data stream:

Full Transformation and Enrichment Query Joining the <a href=CDC Stream with Cache Data” width=”408″ height=”251″ />
Full Transformation and Enrichment Query Joining the CDC Stream with Cache Data

In this query we select the Order stream and SalesRep cache fields that we want, apply transformations to convert data types, put the Order stream and SalesRep cache in the FROM clause, and include a join on SALES_REP_ID as part of the WHERE clause. The result of this query is to continuously output enriched (denormalized) events, shown below, for every CDC event that occurs for the Orders table. So with this approach we can join streams from an Oracle Change Data Capture reader with cached data for enrichment.

Events After Transformation and Enrichment
Events After Transformation and Enrichment

Loading the Enriched Data to the Cloud for Real Time Analytics

Now the Oracle CDC (Oracle change data capture) data, streamed and enriched through Striim, can be stored simultaneously in Azure Cloud blob storage and Azure Cosmos DB, for elastic storage with advanced big data analytics, using the Striim AzureBlobWriter and the CosmosDBWriter shown below.

The image below shows the Striim flow web UI for our streaming ETL application. Flows define what data an application receives, how it processes the data, and what it does with the results.

End-to-End Data Flow
End-to-End Data Flow

Using Kafka for Streaming Replay and Application Decoupling

The enriched stream of order events can be backed by or published to Kafka for stream persistence, laying the foundation for streaming replay and application decoupling. Striim’s native Integration with Apache Kafka makes it quick and easy to leverage Kafka to make every data source re-playable, enabling recovery even for streaming sources that cannot be rewound. This also acts to decouple applications, enabling multiple applications to be powered by the same data source, and for new applications, caches or views to be added later.

Streaming SQL for Aggregates

We can further use Striims Streaming SQL on the denormalized data to make a real time stream of summary metrics about the events being processed available to Striim Real-Time Dashboards and other applications. For example, to create a running count and sum of orders per SalesRep in the last hour, from the stream of enriched orders, you would use a window, and the familiar group by clause.

CREATE WINDOW OrderWindow
OVER EnrichCQ
KEEP WITHIN 1 HOUR
PARTITION BY sales_rep_id

SELECT sales_rep_id, sales_rep_Name,
COUNT(*) as orderCount,
SUM(order_total) as totalAmount
FROM OrderWindow
GROUP BY sales_rep_id

Monitoring

With the Striim Monitoring Web UI we can now monitor our data pipeline with real-time information for the cluster, application components, servers, and agents. The Main monitor page allows to visualize summary statistics for Events Processed, App CPU%, Server Memory, or Server CPU%. Below the Monitor App page displays our App Resources, Performance and Components.

Striim Monitoring Web UI Monitor App Page
Striim Monitoring Web UI Monitor App Page

Clicking on an app component ‘more details’ button will display more detailed performance information such as CPU and Event rate as shown below:

Striim Monitoring Web UI Monitor App Component Details Page
Striim Monitoring Web UI Monitor App Component Details Page

Summary

In this blog post, we discussed how we can use Striim to:

  1. Perform Oracle Change Data Capture to stream data base changes in real-time
  2. Use streaming SQL and caches to easily denormalize data in order to make relevant data available together
  3. Load streaming enriched data to the cloud for real-time analytics
  4. Use Kafka for persistent streams
  5. Create rolling aggregates with streaming SQL
  6. Continuously monitor data pipelines

Additional Resources:

To read more about real-time data ingestion, please visit our Real-Time Data Integration solutions page.

To learn more about the power of streaming SQL, visit Striim Platform Overview product page, schedule a demo with a Striim technologist, or download a free trial of the platform and try it for yourself!

To learn more about Striim’s capabilities to support the data integration requirements for an Azure hybrid cloud architecture check out all of Striim’s solutions for Azure.

Striim 3.10.1 Further Speeds Cloud Adoption

 

 

We are pleased to announce the general availability of Striim 3.10.1 that includes support for new and enhanced Cloud targets, extends manageability and diagnostics capabilities, and introduces new ease of use features to speed our customers’ cloud adoption. Key Features released in Striim 3.10.1 are directly available through Snowflake Partner Connect to enable rapid movement of enterprise data into Snowflake.

Striim 3.10.1 Focus Areas Including Cloud Adoption

This new release introduces many new features and capabilities, summarized here:

3.10.1 Features Summary

 

Let’s review the key themes and features of this new release, starting with the new and expanded cloud targets

Striim on Snowflake Partner Connect

From Snowflake Partner Connect, customers can launch a trial Striim Cloud instance directly as part of the Snowflake on-boarding process from the Snowflake UI and load data, optionally with change data capture, directly into Snowflake from any of our supported sources. You can read about this in a separate blog.

Expanded Support for Cloud Targets to Further Enhance Cloud Adoption

The Striim platform has been chosen as a standard for our customers’ cloud adoption use-cases partly because of the wide range of cloud targets it supports. Striim provides integration with databases, data warehouses, storage, messaging systems and other technologies across all three major cloud environments.

A major enhancement is the introduction of support for the Google BigQuery Streaming API. This not only enables real-time analytics on large scale data in BigQuery by ensuring that data is available within seconds of its creation, but it also helps with quota issues that can be faced by high volume customers. The integration through the BigQuery streaming API can support data transfer up to 1GB per second.

In addition to this, Striim 3.10.1 also has the following enhancements:

  • Optimized delivery to Snowflake and Azure Synapse that facilitates compacting multiple operations on the same data to a single operation on the target resulting in much lower change volume
  • Delivery to MongoDB cloud and MongoDB API for Azure Cosmos DB
  • Delivery to Apache Cassandra, DataStax Cassandra, and Cassandra API for Azure Cosmos DB

  • Support for delivery of data in Parquet format to Cloud Storage and Cloud Data Lakes to further support cloud analytics environments

Schema Conversion to Simplify Cloud Adoption Workflows

As part of many cloud migration or cloud integration use-cases, especially during the initial phases, developers often need to create target schemas to match those of source data. Striim adds the capability to use source schema information from popular databases such as Oracle, SQL Server, and PostgreSQL and create appropriate target schema in cloud targets such as Google BigQuery, Snowflake and others. Importantly, these conversions understand data type and structure differences between heterogeneous sources and targets and act intelligently to spot problems and inconsistencies before progressing to data movement, simplifying cloud adoption.

Enhanced Monitoring, Alerting and Diagnostics

On-going data movement between on-premise and cloud environments for migrations, or powering reporting and analytics solutions, are often part of an enterprise’s critical applications. As such they demand deep insights into the status of all active data flows.

Striim 3.10.1 adds the capability to inherently monitor data from its creation in the source to successful delivery in a target, generate detailed lag reports, and alert on situations where lag is outside of SLAs.

End to End Lag Visualization

In addition, this release provides detailed status on checkpointing information for recovery and high availability scenarios, with insight into checkpointing history and currency.

Real-time Checkpointing Information

Simplifies Working with Complex Data

As customers work with heterogeneous environments and adopt more complex integration scenarios, they often have to work with complex data types, or perform necessary data conversions. While always possible through user defined functions, this release adds multiple commonly requested data manipulation functions out of the box. This simplifies working with JSON data and document structures, while also facilitating data cleansing, and regular expression operations.

On-Going Support for Enterprise Sources

As customers upgrade their environments, or adopt new technologies, it is essential that their integration platform keeps pace. In Striim 3.10.1 we extend our support for the Oracle database to include Oracle 19c, including change data capture, add support for schema information and metadata for Oracle GoldenGate trails, and certify our support for Hive 3.1.0

These are a high level view of the new features of Striim 3.10.1. There is a lot more to discover to aid on your cloud adoption journey. If you would like to learn more about the new release, please reach out to schedule a demo with a Striim expert.

Getting Started with Real-Time ETL to Azure SQL Database

 

 

Running production databases in the cloud has become the new norm. For us at Striim, real-time ETL to Azure SQL Database and other popular cloud databases has become a common use case. Striim customers run critical operational workloads in cloud databases and rely on our enterprise-grade streaming data pipelines to keep their cloud databases up-to-date with existing on-premises or cloud data sources.

Striim supports your cloud journey starting with the first step. In addition to powering fully-connected hybrid and multi-cloud architectures, the streaming data integration platform enables cloud adoption by minimizing risks and downtime during data migration. When you can migrate your data to the cloud without database downtime or data loss, it is easier to modernize your mission-critical systems. And when you liberate your data trapped in legacy databases and stream to Azure SQL DB in sub-seconds, you can run high-value, operational workloads in the cloud and drive business transformation faster.

Streaming Integration from Oracle to Azure SQL DBBuilding continuous, streaming data pipelines from on-premises databases to production cloud databases for critical workloads requires a secure, scalable, and reliable integration solution. Especially if you have enterprise database sources that cannot tolerate performance degradation, traditional batch ETL will not suffice. Striim’s low-impact change data capture (CDC) feature minimizes overhead on the source systems while moving database operations (inserts, updates, and deletes) to Azure SQL DB in real time with security, reliability, and transactional integrity.

Striim is available as a PaaS offering in major cloud marketplaces such as Microsoft Azure Cloud, AWS, and Google Cloud. You can run Striim in the Azure Cloud to simplify real-time ETL to Azure SQL Database and other Azure targets, such as Azure Synapse Analytics, Azure Cosmos DB, Event Hubs, ADLS, and more. The service includes heterogeneous data ingestion, enrichment, and transformation in a single solution before delivering the data to Azure services with sub-second latency. What users love about Striim is that it offers a non-intrusive, quick-to-deploy, and easy-to-iterate solution for streaming data integration into Azure.

To illustrate the ease of use of Striim and to help you get started with your cloud database integration project, we have prepared a Tech Guide: Getting Started with Real-Time Data Integration to Microsoft Azure SQL Database. You will find step-by-step instructions on how to move data from an on-premises Oracle Database to Azure SQL Database using Striim’s PaaS offering available in the Azure Marketplace. In this tutorial you will see how Striim’s log-based CDC enables a solution that doesn’t impact your source Oracle Database’s performance.

If you have, or plan to have, Azure SQL Databases that run operational workloads, I highly recommend that you use a free trial of Striim along with this tutorial to find out how fast you can set up enterprise-grade, real-time ETL to Azure SQL Database. On our website you can find additional tutorials for different cloud databases. So be sure to check out our other resources as well. For any streaming integration questions, please feel free to reach out.

 

Cloud Adoption: How Streaming Integration Minimizes Risks

 

 

Last week, we hosted a live webinar, Cloud Adoption: How Streaming Integration Minimizes Risks. In just 35 minutes, we discussed how to eliminate database downtime and minimize other risks of cloud migration and ongoing integration for hybrid cloud architecture, including a live demo of Striim’s solution.

Our first speaker, Steve Wilkes, started the presentation discussing the importance of cloud adoption for today’s pandemic-impacted, fragile business environment. He continued with the common risks of cloud data migration and how streaming data integration with low-impact change data capture minimizes both downtime and risks. Our second presenter, Edward Bell, gave us a live demonstration of Striim for zero downtime data migration. In this blog post, you can find my short recap of the key areas of the presentation. This summary certainly cannot do justice to the comprehensive discussion we had at the webinar. That’s why I highly recommend you watch the full webinar on-demand to access details on the solution architecture, its comparison to batch ETL approach, customer examples, the live demonstration, and the interactive Q&A section.

Cloud adoption brings multiple challenges and risks that prevent many businesses from modernizing their business-critical systems.

Limited cloud adoption and modernization reduces the ability to optimize business operations. These challenges and risks include causing downtime and business disruption and losing data during the migration, which are simply not acceptable for critical business systems. The risk list, however, is longer than these two. Switching over to cloud without adequate testing that leads to failures, working with stale data in the cloud, and data security and privacy are also among the key concerns.

Steve emphasized the point that “rushing the testing of the new environment to reduce the downtime, if you cannot continually feed data, can also lead to failures down the line or problems with the application.” Later, he added that “Beyond the migration, how do you continually feed the system? Especially in integration use cases where you are maintaining the data where it was and also delivering somewhere else, you need to continuously refresh the data to prevent staleness.”

Each of these risks mentioned above are preventable with the right approach to data movement between the legacy and new cloud systems.

 

Streaming data integration plays a critical role in successful cloud adoption with minimized risks.

A reliable, secure, and scalable streaming data integration architecture with low-impact change data capture enables zero database downtime and zero data loss during data migration. Because the source system is not interrupted, you can test the new cloud system as long as you need before the switchover. You also have the option to failback to the legacy system after switchover by reversing the data flow and keeping the old system up-to-date with the cloud system until you are fully confident that it is stable.

CDCInitialLoad.png” alt=”” width=”1902″ height=”958″ />

Striim’s cloud data migration solution uses this modern approach. During the bulk load, Striim’s CDC component collects the source database changes in real time. As soon as the initial load is complete, Striim applies the changes to the target environment to maintain the legacy and cloud database consistency. With built-in exactly once processing (E1P), Striim can avoid data both data loss and duplicates. You have the ability to use Striim’s real-time dashboards to monitor the data flow and various detailed performance metrics.

Continuous, streaming data integration for hybrid cloud architecture liberates your data for modernization and business transformation.

Cloud adoption and streaming integration are not limited to the lifting and shifting of your systems to the cloud. Ongoing integration post-migration is a crucial part of planning your cloud adoption. You cannot restrict it to database sources and database targets in the cloud, either. Your data lives in various systems and needs to be shared with different endpoints, such as your storage, data lake, or messaging systems in the cloud environment. Without enabling comprehensive and timely data flow from your enterprise systems to the cloud, what you can achieve in the cloud will be very limited.

“It is all about liberating your data.” Steve added in this part of the presentation. “Making it useful for the purpose you need it for. Continuous delivery in the correct format from a variety of sources relies on being able to filter that data, transform it, and possibly aggregate, join and enrich before you deliver to where needed. All of these can be done in Striim with a SQL-based language.”

A key point both Edward and Steve made is that Striim is very flexible. You can source from multiple sources and send to multiple targets. True data liberation and modernizing your data infrastructure needs that flexibility.

Striim also provides deployment flexibility. In fact, this was a question in the Q&A part, asking about deployment options and pricing. Unfortunately we could not answer all the questions we received. The short answer is: Striim can be deployed in the cloud, on-premises, or both via a hybrid topology. It is priced based on the CPUs of the servers where the Striim platform is installed. So you don’t need to worry about the sizes of your source and target systems.

There is much more covered in this short webinar we hosted on cloud adoption. I invite you to watch it on-demand at your convenience. If you would like to get a customized demo for cloud adoption or other streaming data integration use cases, please feel free to reach out.

Mitigating Data Migration and Integration Risks for Hybrid Cloud Architecture

 

Cloud computing has transformed how businesses use technology and drive innovation for improved outcomes. However, the journey to the cloud, which includes data migration from legacy systems, and integration of cloud solutions with existing systems, is not a trivial task. There are multiple cloud adoption risks that businesses need to mitigate to achieve the cloud’s full potential.

 

Common Risks in Data Migration and Integration to Cloud Environments

In addition to data security and privacy, there are additional concerns and risks in cloud migration and integration. These include:

Downtime: The bulk data loading technique, which takes a snapshot of the source database, requires you to lock the legacy database to preserve the consistent state. This translates to downtime and business disruption for your end users. While this disruption can be acceptable for some of your business systems, the mission-critical ones that need modernization are typically the ones that cannot tolerate even planned downtime. And sometimes, planned downtime extends beyond the expected duration, turning into unplanned downtime with detrimental effects on your business.

Data loss: Some data migration tools might lose or corrupt data in transit because of a process failure or network outage. Or they may fail to apply the data to the target system in the right transactional order. As a result, your cloud database ends up diverging from the legacy system, also negatively impacting your business operations.

Inadequate Testing: Many migration projects operate under tense time pressures to minimize downtime, which can lead to a rushed testing phase. When the new environment is not tested thoroughly, the end result can be an unstable cloud environment. Certainly, not the desired outcome when your goal is to take your business systems to the next level.

Stale Data: Many migration solutions focus on the “lift and shift” of existing systems to the cloud. While it is a critical part of cloud adoption, your journey does not end there. Having a reliable and secure data integration solution that keeps your cloud systems up-to-date with existing data sources is critical to maintaining your hybrid cloud or multi-cloud architecture. Working with outdated technologies can lead to stale data in the cloud and create delays, errors, and other inefficiencies for your operational workloads.

 

Upcoming Webinar on the Role of Streaming Data Integration for Data Migration and Integration to Cloud

Streaming data integration is a new approach to data integration that addresses the multifaceted challenges of cloud adoption. By combining bulk loading with real-time change data capture technologies, it minimizes downtime and risks mentioned above and enables reliable and continuous data flow after the migration.

Striim - Data Migration to Cloud

In our next live, interactive webinar, we dive into this particular topic; Cloud Adoption: How Streaming Data Integration Minimizes Risks. Our Co-Founder and CTO, Steve Wilkes, will present the practical ways you can mitigate the data migration risks and handle integration challenges for cloud environments. Striim’s Solution Architect, Edward Bell, will walk you through with a live demo of zero downtime data migration and continuous streaming integration to major cloud platforms, such as AWS, Azure, and Google Cloud.

I hope you can join this live, practical presentation on Thursday, May 7th 10:00 AM PT / 1:00 PM ET to learn more about how to:

  • Reduce migration downtime and data loss risks, as well as allow unlimited testing time of the new cloud environment.
  • Set up streaming data pipelines in just minutes to reliably support operational workloads in the cloud.
  • Handle strict security, reliability, and scalability requirements of your mission-critical systems with an enterprise-grade streaming data integration platform.

Until we see you at the webinar, and afterward, please feel free to reach out to get a customized Striim demo for data migration and integration to cloud to support your specific IT environment.

 

Top 4 Highlights from Our Streaming Data and Analytics Webinar with GigaOm

 

 

On April 9, 2020, Striim’s co-founder and CTO Steve Wilkes joined GigaOm’s analyst Andrew Brust (bio) in an interview-style webinar on “Streaming Data: The Nexus of Cloud Modernized Analytics.” GigaOm and Striim Webinar SpeakersOver the course of the hour, the two talked about the evolution of data integration needs, what defines streaming data integration, capturing transactional data through change data capture (CDC), comparative approaches for data integration, where companies typically start with streaming data, use case examples, how it supports cloud initiatives, providing a foundation for operational intelligence, and even its role in AI/ML advancements.

While we can’t cover it all in one blog post, here is a “top 4” list of our favorite things highlighted during the webinar — and we invite you to view the entire on-demand event by watching it online

 

#1: “Today, People Expect to Have Up-to-the-Second Information” — Steve Wilkes

Andrew asked Steve to do a bit of “wayback machine” to trace how we arrived at the need for streaming, real-time data. “Twenty years ago, most data was created by humans working on applications with data stored in databases, and you’d use ETL to move and store the data in batches into a data warehouse. It was OK to see data hours or even days later, and everyone did that,” said Steve. But fast-forward to our daily lives today and how we get immediate updates on things like Twitter feeds, news alerts, instant messaging with friends, and expectations have changed.

“So the business world needs to work the same way, and this does drive competitive pressures,” he continued. “If you’re not having this view into your operations and what your customers need, someone else will and they can push you out of business.”

Related to this, Andrew said later in the webinar: “We have new modes of thinking. But using older modes of technology, we’re going to run into issues.”

GigaOm: Old vs New Approaches to Data Movement

#2: Cloud Adoption Driving the Need for Streaming Data

As Steve noted, there’s been a significant shift from all on-premises systems to cloud-based environments, but there is still the need to get data into the cloud in order to get use from it.

Steve shared with Andrew that what Striim sees across its global customer case in terms of adoption is that the majority have a first goal of building the ability to stream their data first and then use it to power the analytics.

“Initial use cases are often zero-downtime data migrations to cloud or feeding a cloud-based data warehouse…. Once they’ve stream-enabled a lot of their sources, they will start to think about what analytics they can promote to real time and where they can get value out of that,” said Steve.

 

#3: A Range of Business Use Cases

Throughout the webinar, Andrew mentioned a few possible use cases, particularly in the context of the global pandemic being faced. “There’s nothing more frustrating, especially in these times of lockdown, when it says something is in stock and then you go to confirm the purchase and it says it’s out of stock … or you find out later.”

From Steve: “That real immediacy into what customers are doing, need, and want is key to what streaming data can do.”

Another example Andrew used illustrated the need for operational intelligence using real-time data. He referenced his home state of New York as it faces the coronavirus pandemic, where the real-time sharing of data about medical supplies and personnel data across the state’s hospitals could improve decisions to best allocate and redistribute those assets.

Shifting to the analytics side, Steve described operational intelligence as being able to change what you know about your operations and the decisions you make, based on current information. He gave the example of being able to track down critical devices, such as wheelchairs, in settings such as airports and hospitals.

The two also discussed how streaming data fits with AI/ML, where Steve commented how streaming data can be used to get data ready and processed for AI models to improve efficiency and performance.

 

#4: Status of Streaming Data

Andrew polled attendees with the question of where they are today with having streaming data in their organization.

GigaOm Poll: Use of Streaming Data In Your Organization

At least half of the attendees said they are using streaming data at least occasionally, which suggests that streaming data integration will continue to grow in popularity and ubiquity. Another 25% are currently evaluating streaming data technology.

Andrew asked Steve for his thoughts on the 15% who felt they don’t have a need for streaming data. As Steve commented: “A lot of organizations have a perception of what a real-time application is and the categories of use cases they are good for. But if you are moving applications to the cloud and they are business-critical, if you can’t turn them off for a few days, how do you do that without turning it off when data is still changing. There’s a need for real-time streaming data there.”

As you can see, the two covered a lot of ground — and so much more during this interactive webinar event. It is available to watch on demand at your convenience, so please check it out. We thank GigaOm and Andrew Brust for hosting this engaging program.

Also, you can learn more about the topic of Streaming Integration in a new 100+ book published by O’Reilly Media and co-authored by Steve Wilkes, who was the speaker of this webinar. Download your free PDF copy today.

 

MySQL to Google BigQuery using CDC

Tutorial: Migrating from MySQL to BigQuery for Real-Time Data Analytics

 

 

In this post, we will walk through an example of how to replicate and synchronize your data from on-premises MySQL to BigQuery using change data capture (CDC).

Data warehouses have traditionally been on-premises services that required data to be transferred using batch load methods. Ingesting, storing, and manipulating data with cloud data services like Google BigQuery makes the whole process easier and more cost effective, provided that you can get your data in efficiently.

Striim real-time data integration platform allows you to move data in real-time as changes are being recorded using a technology called change data capture. This allows you to build real-time analytics and machine learning capabilities from your on-premises datasets with minimal impact.

Source MySQL Database

Before you set up the Striim platform to synchronize your data from MySQL to BigQuery, let’s take a look at the source database and prepare the corresponding database structure in BigQuery. For this example, I am using a local MySQL database with a simple purchases table to simulate a financial datastore that we want to ingest from MySQL to BigQuery for analytics and reporting.

I’ve loaded a number of initial records into this table and have a script to apply additional records once Striim has been configured to show how it picks up the changes automatically in real time.

Targeting Google BigQuery

You also need to make sure your instance of BigQuery has been set up to mirror the source or the on-premises data structure. There are a few ways to do this, but because you are using a small table structure, you are going to set this up using the Google Cloud Console interface. Open the Google Cloud Console, and select a project, or create a new one. You can now select BigQuery from the available cloud services. Create a new dataset to hold the incoming data from the MySQL database.

Once the dataset has been created, you also need to create a table structure. Striim can perform the transformations while the data flies through the synchronization process. However, to make things a little easier here, I have replicated the same structure as the on-premises data source.

You will also need a service account to allow your Striim application to access BigQuery. Open the service account option through the IAM window in the Google Cloud Console and create a new service account. Give the necessary permissions for the service account by assigning BigQuery Owner and Admin roles and download the service account key to a JSON file.

Set Up the Striim Application

Now you have your data in a table in the on-premises MySQL database and have a corresponding empty table with the same fields in BigQuery. Let’s now set up a Striim application on Google Cloud Platform for the migration service.

Open your Google Cloud Console and open or start a new project. Go to the marketplace and search for Striim. A number of options should return, but the option you are after is the first item that allows integration of real-time data to Google Cloud services.

Select this option and start the deployment process. For this tutorial, you are just using the defaults for the Striim server. In production, you would need to size appropriately depending on your load.

Click the deploy button at the bottom of this screen and start the deployment process.

Once this deployment has finished, the details of the server and the Striim application will be generated.

Before you open the admin site, you will need to add a few files to the Striim Virtual Machine. Open the SSH console to the machine and copy the JSON file with the service account key to a location Striim can access. I used /opt/striim/conf/servicekey.json.

You also need to restart the Striim services for these setting and changes to take effect. The easiest way to do this is to restart the VM.

Give these files the right permissions by running the following commands:

chown striim:striim <filename>

chmod 770 <filename>

You also need to restart the Striim services for this to take effect. The easiest way to do this is to restart the VM.

Once this is done, close the shell and click on the Visit The Site button to open the Striim admin portal.

Before you can use Striim, you will need to configure some basic details. Register your details and enter in the Cluster name (I used “DemoCluster”) and password, as well as an admin password. Leave the license field blank to get a trial license if you don’t have a license, then wait for the installation to finish.

 

When you get to the home screen for Striim, you will see three options. Let’s start by creating an app to connect your on-premises database with BigQuery to perform the initial load of data. To create this application, you will need to start from scratch from the applications area. Give your application a name and you will be presented with a blank canvas.

The first step is to read data from MySQL, so drag a database reader from the sources tab on the left. Double-click on the database reader to set the connection string with a JDBC-style URL using the template:

jdbc:mysql://<server_ip>:<port>/<database>

You must also specify the tables to synchronize — for this example, purchases — as this allows you to restrict what is synchronized.

Finally, create a new output. I called mine PurchasesDataStream.

You also need to connect your BigQuery instance to your source. Drag a BigQuery writer from the targets tab on the left. Double-click on the writer and select the input stream from the previous step and specify the location of the service account key. Finally, map the source and target tables together using the form:

<source-database>.<source-table>,<target-database>.<target-table>

For this use case this is just a single table on each side.

Once both the source and target connectors have been configured, deploy and start the application to begin the initial load process. Once the application is deployed and running, you can use the monitor menu option on the top left of the screen to watch the progress.

Because this example contains a small data load, the initial load application finishes pretty quickly. You can now stop this initial load application and move on to the synchronization.

Updating BigQuery with Change Data Capture

Striim has pushed your current database up into BigQuery, but ideally you want to update this every time the on-premises database changes. This is where the change data capture application comes into play.

Go back to the applications screen in Striim and create a new application from a template. Find and select the MySQL CDC to BigQuery option.

 

Like the first application, you need to configure the details for your on-premises MySQL source. Use the same basic settings as before. However, this time the wizard adds the JDBC component to the connection URL.

When you click Next, Striim will ensure that it can connect to the local source. Striim will retrieve all the tables from the source. Select the tables you want to sync. For this example, it’s just the purchases table.

Once the local tables are mapped, you need to connect to the BigQuery target. Again, you can use the same settings as before by specifying the same service key JSON file, table mapping, and GCP Project ID.

Once the setup of the application is complete, you can deploy and turn on the synchronization application. This will monitor the on-premises database for any changes, then synchronize them into BigQuery.

Let’s see this in action by clicking on the monitor button again and loading some data into your on-premises database. As the data loads, you will see the transactions being processed by Striim.

Next Steps

As you can see, Striim makes it easy for you to synchronize your on-premises data from existing databases, such as MySQL, to BigQuery. By constantly moving your data into BigQuery, you could now start building analytics or machine learning models on top, all with minimal impact to your current systems. You could also start ingesting and normalizing more datasets with Striim to fully take advantage of your data when combined with the power of BigQuery.

To learn more about Striim for Google BigQuery, check out the related product page. Striim is not limited to MySQL to BigQuery integration, and supports many different sources and targets. To see how Striim can help with your move to cloud-based services, schedule a demo with a Striim technologist or download a free trial of the platform.

Advancement of Data Movement Techologies

Advancement of Data Movement Technologies: Whiteboard Wednesdays

 

In this Whiteboard Wednesday video, Irem Radzik, Head of Product Marketing at Striim, looks at how data movement technologies have evolved in response to changing user demands. Read on, or watch the 8-minute video:

Today we’re going to talk about the advancement of data movement technologies. We’re going to look at the ETL technologies that we started seeing in ‘90s, then the CDC (Change Data Capture)/Logical Replication solutions that we started seeing a couple of decades ago, and then streaming data integration solutions that we more commonly see today.

ETL

Let’s look at ETL technologies. ETL is known for its batch extract, then bringing the data into the transformation step in the middle tier server, and then loading the target in bulk again, typically for next-day reporting. You end up having high latency with these types of solutions. That was good enough for the ‘90s, but then we started demanding more fresh data for operational decision making. Latency became an issue with ETL solutions.

Data Movement - ETL

The other issue with ETL was the batch-window dependency. Because of the high impact on the production sources, there had to be a dedicated time for these batch extracts when the main users wouldn’t be able to use the production database. The batch window that was available for data extract became shorter and shorter as business demanded continuous access to the OLTP system.

The data volumes increased at the same time. You ended up not having enough time to move all the data you needed. That became a pain point for ETL users, driving them to look into other solutions.

Change Data Capture/Logical Replication

Change Data Capture/Logical Replication solutions addressed several of the key concerns that ETL had. Change Data Capture basically means that you continuously capture new transactions happening in the source database and deliver it to the target in real time.

Data Movement - CDC / Logical ReplicationThat obviously helps with the data latency problem. You end up having real-time, up to date data in the target for your operational decision making. The other plus of CDC is the source impact.

When it’s using logs (database logs) to capture the data, it has negligible impact. The source production system is available for transaction users. There is no batch window needed and no limitations for how much time you have to extract and move the data.

The CDC/Logical Replication solutions handle some of the key concerns of ETL users. They are made more for the E and L steps. What ends up happening with these solutions is that you need to do transformations within the database or with another tool, in order to complete the transformation step for end users.

The transformation happening there creates an E L T architecture and requires another product, another step, another network hub in your architecture, which complicates the process.

When there’s an outage, when there is a process disruption, reconciling your data and recovering becomes more complicated. That’s the shortcoming CDC users have been facing. These solutions were mainly made for databases.

Once the cloud and big data solutions became popular, the CDC providers had to come up with new products for cloud and big data targets. These are add-ons, not part of the main platform.

Another shortcoming that we’ve seen with CDC/Logical Replication solutions is their single node architecture, which translates into a single point of failure. This is a shortcoming, especially for mission-critical systems that need continuous availability of the data integration processes.

Streaming Data Integration

In recent years, streaming data integration came about to address the issues that CDC/Logical Replication products raised. It is becoming increasingly common. With streaming data integration, you’re not limited to just database sources.

Data Movement Streaming Data IntegrationYou can have your files, log data, your machine data, your system log files for example, all moving in a real-time fashion. Your cloud sources, your service bus or your messaging systems can be your source. Your sensor data can be moved in real time, in a streaming fashion to multiple targets. Again, not limited to just databases.

You can have cloud databases or other cloud services as your target. You can, in addition to databases, have messaging systems as your target, on-premises or in cloud, your big data solutions, on-premises or cloud. You can also deliver in file format.

Everything is like it was in a logical replication solution. It is continuous, in real time, and Change Data Capture is still a big component of the streaming data integration.

It’s built on top of the Change Data Capture technologies and brings additional data sources and additional data targets. Another important difference, and handling one of the challenges of logical replication, is the transformation piece. As we discussed, a transformation needs to happen and where it happens makes a big difference.

With streaming data integration, it’s happening in-flight. While the data is moving, you can have stream processing without adding more latency to your data. While the data is moving, it can be filtered, it can be aggregated, it can be masked and encrypted, and enriched with reference data, all in flight before it’s delivered to your target, so that it’s available in a consumable format. This streamlines your architecture, simplifies it, and makes all the recovery steps easier. It’s also delivering the data in the format that your users need.

Another important thing to highlight is the distributed architecture. This natively clustered environment helps with a single point of failure risk. When one node fails, the other one takes over immediately, so you have a highly available data pipeline. This distributed clustered environment also helps you to scale out very easily, add more servers as you have more data to process and move.

These solutions now come with a monitoring part. The real time monitoring of the pipelines gives you an understanding of what’s happening with your integration flows. If there is an issue, if there is high data latency or process issue, you get immediate alerts so you can trust that everything is running.

Data reliability is really critical, whole pipeline reliability is very critical. To make sure that there is no data loss or duplicates, there is data delivery validation that can be included in some of these solutions. You can also make sure, with the right solution, that everything is processed exactly once, and you are not repeating or dropping data. There are checkpointing mechanisms to be able to do that.

As you see, the new streaming data integration solutions handle some of the challenges that we have seen in the past with outdated data movement technologies. To learn more about streaming data integration, please visit our Real-time Data Integration solution page, schedule a demo with a Striim expert, or download the Striim platform to get started.

 

Evaluating Streaming Data Integration Platforms

Evaluating Streaming Data Integration Platforms: Whiteboard Wednesdays

 

 

In today’s Whiteboard Wednesday video, Steve Wilkes, founder and CTO of Striim, looks at what you need to consider when evaluating streaming data integration platforms. Read on, or watch the 15-minute video:

We’ve already gone through what the components of a streaming integration platform are. Today we’re going to talk about how you go about evaluating streaming data integration platforms based on these components.

Just to reiterate, you need the platform to be able to:

  • Do real-time continuous data collection
  • Move that data continuously from where it’s collected to where it’s going
  • Support delivery to all the different targets that you care about
  • Process the data as it’s moving, so stream processing
  • This all needs to be enterprise grade so that it is scalable and reliable, and all those other things that you care about for mission-critical data
  • Get insights and alerts on that data movement

Let’s think about the things that you need to consider in order to actually achieve this when you’re evaluating such platforms.

Data Collection & Delivery

For data collection and delivery, you care about quite a few different things. Firstly, it needs to be low latency. If it’s a streaming data integration platform, then just doing bulk loads or micro batch may not be sufficient. You want to be able to collect the data the instant it’s created, within milliseconds typically. You need low-latency data collection.

Evaluating Streaming Integration Platforms - Data CollectionIt needs to be able to support all the sources that you care about. If you’re looking for a streaming integration platform, then you’re thinking of more than just one use case. You’re thinking “what platform is going to support all of the streaming data integration needs within my organization?” Supporting just one data source or a couple of data sources isn’t enough.

You need to be able to support all the sources that you care about now and may care about in the future. That could be databases, files, or messaging systems. It could even be IoT. So think about that when you’re evaluating whether the platform has all the sources that you need. Think about how it can deal with those sources in a number of different ways.

For databases, you may need to be able to do bulk loads into a streaming infrastructure, as well as doing Change Data Capture. This is important for collecting real-time change as it’s happening in a database, the inserts, updates, and deletes. For files, you may need to do bulk files, files that exist already, but also files as they’re created, streaming out the data as it’s being written. Supporting both bulk and change data is equally important.

You also need to consider whether the adapters are actually part of the platform or are they third party. If they are part of the platform and the platform is built well, then it means that they will be able to handle all the different requirements of the platform – scalability, reliability, and recoverability. All of those things are integrated end to end because the adapters are part of the platform.

If they’re third party, then that may not be the case. If you have to plug in third party components into your infrastructure, then you can have areas of brittleness where things may not work properly or problematic interfaces when things change. Try to avoid third party adapters wherever you can.

Data collection and data delivery need to be able to support the end to end recovery and reliability that is part of being enterprise grade. That means that from a database perspective, for example, you may need to be able to support maintaining a database transaction context from one end to the other. You need to be able to pick up from where you left off and make sure that data that is collected is delivered to all of the appropriate targets. These could be variable and different.

You might be delivering some data on-premise and some data to the cloud, but you still need to be able to make sure that all the data has made it there. You need to be able to validate that the data is being written to all the different sources and targets that the platform is supporting.

If it’s part of a platform and they’re not third party, you would expect that to be there. If they are third party, then you have to investigate whether all of those things are supported. Data collection and data delivery is the first part of how you evaluate the platform.

Data Movement

The next part is how does it do data movement? This is crucial to maintaining the kind of high throughput and low latency that you’d expect. Data movement is a number of different things. It’s between processing steps. Between your source collection and your data delivery.

Between source collection, maybe some in-memory processing or maybe some enrichment and data delivery. Or it could be an even a more complex pipeline with multiple steps in it. You’re moving data between each step.

It’s also between nodes. If you have a clustered platform and that platform is moving data between nodes for different processing steps, or maybe between source and target because the target is closer to one of the nodes than other nodes. You need to be able to ensure that the data movement happens efficiently, with high throughput and low latency, between nodes.

You also need to be able to support collecting data on-premise and delivering it into cloud environments, or collecting it from cloud environments and delivering it to on-premise, or moving between clouds. Supporting all these different typologies is all part of data movement.

Ideally as much of the data movement as possible should be in memory only. Try to avoid having to write to disk or do any kind of IO in between processing steps. The reason for this is that each processing step needs to perform optimally in order to get high throughput.

If you are persisting data, that can add latency. Ideally when you’re doing multiple processing steps in a pipeline, you’re doing all of that data movement in memory only, between the steps or just between nodes. You’re not persisting to disk.

You should only use persistent data movement or persistent data streams where needed. There are a couple of really good use cases for this. One is if you have data sources that you can’t rewind into for recoverability, you may want to use a persistent data stream as the first step in the process, but everything downstream can be in memory only.

If you’re collecting data in real time, but you have multiple applications all running at their own speeds against that data, you may want to think about having persistent data streams between different steps. Typically, you want to minimize the amount of persistent data streams that you have and use in-memory only data streams wherever possible. That will really aid in reducing your latency and increasing your throughput.

Stream Processing

The next thing that you need to be able to do is stream processing. Stream processing obviously has to be able to support all of the different types of processing that you want to do. For example, it needs to be able to support complex transformations. If it doesn’t support the transformations that you want, you should be able to add in your own components or your own user defined functions to do the transformations.

It needs to be able to combine and enrich data. This requires a lot of different constructs for stream processing. When you are combining data together from multiple data streams, they run at high speed and typically events aren’t going to happen at the same time.

You need a flexible windowing structure that can maintain a set of events from different data streams to combine together, in order to be able to produce a combined output stream that has the last data from every stream apart from the current data from the current one.

When you’re enriching data, you need to be able to join streaming data with reference data. You can’t go back to a database or go back to the original source of the reference data for every event on a data stream. It’s just too slow. You need to be able to load, cache, and remember the data you are using for enrichment in memory so you can join it really efficiently, in order to keep and maintain the throughput that you’re looking for from the overall system.

You want the stream processing to be optimized. It should really run as fast as if you’d written it yourself manually. It also needs to be easy to use. We recommend that you look for SQL-based stream processing because SQL is the language of data. There are very few people that work with data that don’t understand SQL. It allows you to do filtering, transformation, and data enrichment through natural SQL constructs.

Obviously if you want to do more complex things, you should also be allowed to import your own transformations and work with those. For SQL-based transformations, it enables anyone that knows data to be able to build and understand what the transformations are. You also want building pipelines to be as easily accessible as possible to all the people that want to work with the data.

You need to have a good UI for building the data pipelines and have as much of the process as possible automated through wizards and other UI based assistance. You need to be able to build multi-step stream processing, not just a single source into single target or a single source into single piece of processing into single target. Potentially with fan in and fan out. Multiple data sources coming in, going into multiple processing steps in a staged environment, where they go step by step by step, to potentially multiple targets coming out at the other end.

This all needs to be coordinated, well-maintained, and deployable across a cluster in order to be scalable. Your stream processing should be very rich, very capable, and also very high throughput.

Enterprise Grade

You also need to think about the enterprise-grade qualities of the platform. I’ve mentioned before, for it to be enterprise grade it needs to be scalable. You need to be able to handle increasing the throughput, increasing the number of sources, increasing the number of targets, and increasing the volume of data being generated from each one of those.

When you’re evaluating platforms and evaluating for a production scenario, you should test the platform with a reasonable throughput that corresponds to what you’re expecting in order to see how it behaves and how it scales, and measure the throughput and the latency from end to end as you’re evaluating the platform.

You also need it to be reliable. You need to be able to ensure that you have guaranteed delivery from source all the way to target. Even if something fails, if a network fails, if the source or the target goes down, if any of the processing nodes in the cluster go down or the whole cluster goes down, you need to be able to ensure that it picks up from where it left off and doesn’t miss any messages.

It has to be able to recover from failures as well. Guaranteed delivery in the normal “I’m always running” case so you don’t miss any messages, just because they disappeared into the ether somewhere. But also, that if you have a failure, you should recover and not lose any messages, not lose any events that come from the source into the target.

Of course, security is also paramount. You can secure the data while it’s moving in transit, so it’s encrypted as it goes across the network. But also that you can secure who has access to the data, who can work with individual data streams, who can see the data on individual data streams, who can build applications, who can view the results of building applications.

You need security that works across the whole end to end and deals with every single component, so that you can secure them and lock them down and make sure that only the people that need to work with data, can.

Insights & Alerts

Finally, you need to make sure that the platform gives you visibility into your data, that you can monitor the data flows and see what’s going on in real time, that you get alerts when anything happens. This could be when CPU or memory usage on any of the nodes goes above certain criteria. It could be when applications crash, or data flows crash. It could be when volume goes above or below what you expect, and doing that in a granular fashion. For example, when an individual database table goes above or below what you expect.

You need to be able to work with insights into the data flows that help you operationalize this and make sure that it’s working full time, 24/7, when you actually put it into production. You may even want to get insights on the data itself, drill down into the actual data that’s flowing, and do some analytics on that. If your streaming integration platform can also give you those valuable insights on the streaming data, then that’s the icing on the cake.

Just to summarize, when you’re evaluating streaming data integration platforms, you need to make sure that the platform can do everything that you need, to get your data from where it’s generated to where it needs to be, in order to get real value out of your data.

 

To learn more about streaming data integration, please visit our Real-time Data Integration solution page, schedule a demo with a Striim expert, or download the Striim platform to get started.

HPE NonStop Bi-directional replication

HPE NonStop Community Prioritizes Bi-Directional Data Movement to the Cloud

As 2019 comes to a close, many of the posts and news releases coming from Striim have focused on the bidirectional data movement support between traditional databases and the cloud. For the HPE NonStop community, this is becoming a topic of conversation, as in many cases this is a compelling alternative to other options available.

While many NonStop users rely on multiple NonStop deployments geographically separated – their business continuity requirements – having the option to back up mission critical data to the cloud has its merits. The most compelling is the attraction of a lower cost option to more traditional approaches to business continuity through global distribution, as is the case with some of the larger financial institutions.

HPE NonStop Bi-directional replicationThere are instances too where implementing hybrid configurations involving NonStop and private, on-prem, clouds necessitates the movement of data between the cloud and NonStop – particularly with the increased interest in NonStop SQL supporting database as a service (DBaaS) to applications running in the cloud.

Having the opportunity to move the data in both directions allows for a risk-reduced operation, whereby data can be moved in phases knowing that, should something fail in the process, the application can continue to run uninterrupted. This lends itself to satisfy needs of those NonStop users who are running mission-critical applications.

In her post On-Premises-to-Cloud Migration: How to Minimize the Risks to the Striim blog, Irem Radzik writes about the inherent value that comes with Striim having embraced a change data capture (CDC) model to better ensure consistency between source and target databases. This is of great importance to the NonStop community and has been at the heart of the business continuity implementations for more than a decade. According to Radzik:

“Here comes the good news that I love sharing: Today, newer, more sophisticated streaming data integration with change data capture technology minimizes disruptions and risks mentioned earlier. This solution combines initial batch load with real-time change data capture (CDC) and delivery capabilities.

“As the system performs the bulk load, the CDC component collects the changes in real time as they occur. As soon as the initial load is complete, the system applies the changes to the target environment to maintain the legacy and cloud database consistent.”

Bi-directional data replication featuring CDC was the theme of the article News from Gartner and news from Striim; one objective – (bi-directional) movement to the cloud! published in the December 2019 issue of NonStop Insider. In this article, Alok Pareek, Co-Founder and EVP of Product at Striim is quoted:

“Responding to requests from marquee customers who use Striim to enable their hybrid cloud infrastructure, our engineering team has delivered a robust bi-directional data replication offering. These enterprise customers finally have a next-generation, zero-downtime, zero-data-loss solution for online phased database migrations, allowing them to seamlessly run their new cloud environments in parallel with the legacy systems for a gradual transition for their end users.”

For now, however, the extent of the opportunity for the NonStop community to leverage this capability (available in Striim release 3.9.7) requires additional input from NonStop users. As it was noted in the article in NonStop Insider, when it comes to the NonStop community:

“Striim is still canvassing the NonStop community members for further feedback about their own use-case potential as Striim is to prioritize support of bidirectional NonStop SQL to / from Cloud based solely on these NonStop users requirements.

Mr. Pareek goes on to state (and NonStop users would agree):

“These enterprise customers finally have a next generation zero-downtime, zero-data-loss solution for online phased database migrations, allowing them to seamlessly run their new cloud environments in parallel with the legacy systems for a gradual transition for their end users.”

Should you have any trouble at all with the above hyperlink to the article published in the digital publication NonStop Insider, you can always cut and paste this link into your browser:

https://www.nonstopinsider.com/uncategorised/news-from-gartner-and-news-from-striim-one-objective-bi-directional-movement-to-the-cloud/

For now, should the functionality of this latest release of Striim be of interest to you as a NonStop user and you would like to know more about the capabilities on offer with Striim Release 3.9.7 please email us or give us a call and make sure you check out our web site for all the news as it breaks at https://www.striim.com/.

PostgreSQL to Kafka

Streaming Data Integration Tutorial: Adding a Kafka Target to a Real-Time Data Pipeline

This is the second post in a two-part blog series discussing how to stream database changes into Kafka. You can read part one here. We will discuss adding a Kafka target to the CDC source from the previous post. The application will ingest database changes (inserts, updates, and deletes) from the PostgreSQL source tables and deliver to Kafka to continuously to update a Kafka topic.

What is Kafka?

Apache Kafka is a popular distributed, fault-tolerant, high-performance messaging system.

Why use Striim with Kafka?

The Striim platform enables you to ingest data into Kafka, process it for different consumers, analyze, visualize, and distribute to a broad range of systems on-premises and in the cloud with an intuitive UI and SQL-based language for easy and fast development.

How to add a Kafka Target to a Striim Dataflow

From the Striim Apps page, click on the app that we created in the previous blog post and select Manage Flow.

MyPostgreSQL CDC App
MyPostgreSQL-CDC App

This will open your application in the Flow Designer.

PostgreSQL CDC App Data Flow
MyPostgrSQLCDC app data flow.

To do the writing to Kafka, we need to add a Target component into the dataflow. Click on the data stream, then on the plus (+) button, and select “Connect next Target component” from the menu.

Connecting a target component to the Data Flow
Connecting a target component to the data flow.

Enter the Target Info

The next step is to specify how to write data to the target.  With the New Target ADAPTER drop-down, select Kafka Writer Version 0.11.0, and enter a few connection properties including the target name, topic and broker URL.

Configuring the Kafka Target
Configuring the Kafka target.

Data Formatting 

Different Kafka consumers may have different requirements for the data format. When writing to Kafka in Striim, you can choose the data format with the FORMATTER drop down and optional configuration properties. Striim supports JSON, Delimited, XML, Avro and free text formats, in this case we are selecting the JSONFormatter.

Configuring the Kafka target formatter
Configuring the Kafka target FORMATTER.

Deploying and Starting the Data Flow

The resulting data flow can now be modified, deployed, and started through the UI. In order to run the application, it first needs to be deployed, click on the ‘Created’ dropdown and select ‘Deploy App’ to show the Deploy UI. 

Deploying CDC app
Deploying the app.

The application can be deployed to all nodes, any one node, or predefined groups in a Striim cluster, the default is the least used node. 

Deployment node selection.
Deployment node selection.

After deployment the application is ready to start, by selecting Start App.

Starting the app.
Starting the app.

Testing the Data Flow

You can use the PostgreSQL to Kafka sample integration application, to insert, delete, and update the PosgtreSQL CDC source table, then you should see data flowing in the UI, indicated by a number of msgs/s. (Note the message sending happens fast and quickly returns to 0).

Testing the streaming data flow.
Testing the streaming data flow.

If you now click on the data stream in the middle and click on the eye icon, you can preview the data flowing between PostgreSQL and Kafka. Here you can see the data, metadata (these are all updates) and before values (what the data was before the update).

Previewing the data flowing from PostgreSQL to Kafka
Previewing the data flowing from PostgreSQL to Kafka.

There are many other sources and targets that Striim supports for streaming data integration. Please request a demo with one of our lead technologists, tailored to your environment.

Change Data Capture - change log

Streaming Data Integration Tutorial: Using CDC to Stream Database Changes

 

This is the first in a two-part blog post discussing how to use Striim for streaming database changes to Apache Kafka. Striim offers continuous data ingestion from databases and other sources in real time; transformation and enrichment using Streaming SQL; delivery of data to multiple targets in the cloud or on-premise; and visualization of results. In this part, we will use Striim’s low-impact, real-time change data capture (CDC) feature to stream database changes (inserts, updates, and deletes) from an operational database into Striim.

What is Change Data Capture

Databases maintain change logs that record all changes made to the database contents and metadata. These change logs can be used for database recovery in the event of a crash, and also for replication or integration.

Change data capture change log

With Striim’s log-based CDC, new database transactions – including inserts, updates, and deletes – are read from source databases’ change logs and turned into a stream of events without impacting the database workload. Striim offers CDC for Oracle, SQL Server, HPE NonStop, MySQL, PostgreSQL, MongoDB, and MariaDB.

Why use Striim’s CDC?

Businesses use Striim’s CDC capabilities to feed real-time data to their big data lakes, cloud databases, and enterprise messaging systems, such as Kafka, for timely operational decision making. They also migrate from on-premises databases to cloud environments without downtime and keep cloud-based analytics environments up-to-date with on-premises databases using CDC.

How to use Striim’s CDC?

Striim’s easy-to-use CDC template wizards automate the creation of applications that leverage change data capture, to stream events as they are created, from various source systems to various targets. Apps created with templates may be modified using Flow Designer or by exporting TQL, editing it, and importing the modified TQL. Striim has templates for many source-target combinations.

In addition, Striim offers pre-built integration applications for bulk loading and CDC from PostgreSQL source databases to target systems including PostgreSQL database, Kafka, and files. You can start these applications in seconds by going to the Applications section of the Striim platform.

Striim Pre-built Sample Integration Applications
Striim pre-built sample integration applications.

In this post, we will show how to use the PostgreSQL CDC (PostgreSQL Reader) with a Striim Target using the wizards for a custom application instead of using the pre-built application mentioned above. The instructions below assume that you are using the PostgreSQL instance that comes with the Striim platform. If you are using your own PostgreSQL database instance, please review our instructions on how to set up PostgreSQL for CDC.

Using the CDC Template

To start building the CDC application, in the Striim web UI, go to the Apps page and select Add App > Start with Template. Enter PostgreSQL in the search field to narrow down the sources and select “PostgreSQL Reader to Striim”.

CDC application template
Wizard template selection when creating a new app.

Next enter the name and namespace for your application (the namespace is a way of grouping applications together).

Creating a new application using Striim.

Specifying the Data Source Properties

In the SETUP POSTGRESQL READER specify the data source and table properties:

  • the connection URL, username, and password.
  • the tables for which you want to read change data.
Configuring the Data Source in the Wizard
Configuring the data source in the wizard.

After you complete this step, your application will open in the Flow Designer.

The Wizard Generates a Data Flow
The wizard generates a data flow.

In the flow designer, you can add various processors, enrichers, transformers, and targets as shown below to complete your pipeline, in some cases with zero coding.

Flow designer enrichers and processors.

Flow designer event transformers and targets.

 

In the next blog post, we will discuss how to add a Kafka target to this data pipeline. In the meantime, please feel free to request a demo with one of our lead technologists, tailored to your environment.

 

Log-based Change Data Capture

Log-Based Change Data Capture: the Best Method for CDC

 

 

Change data capture, and in particular log-based change data capture, has become popular in the last two decades as organizations have discovered that sharing real-time transactional data from OLTP databases enables a wide variety of use-cases. The fast adoption of cloud solutions requires building real-time data pipelines from in-house databases, in order to ensure the cloud systems are continually up to date. Turning enterprise databases into a streaming source, without the constraints of batch windows, lays the foundation for today’s modern data architectures. In this blog post, I would like to discuss Striim’s CDC capabilities along with its unique features that enhance the change data capture, as well as its processing and delivery across a wide range of sources and targets.

CDC-featuredimage.jpg”>CDC-featuredimage.jpg” alt=”Log-based Change Data Capture” width=”1200″ height=”630″ />Log-Based Change Data Capture

In our previous blog post on Change Data Capture Methods, we explained why log-based change data capture is a better method to identify and capture change data. Striim uses the log-based CDC technique for the same reasons we stated in that post: Log-based CDC minimizes the overhead on the source systems, reducing the chances of performance degradation. In addition, it is non-intrusive. It does not require changes to the application, such as adding triggers to tables would do. It is a light-weight but also a highly-performant way to ingest change data. While Striim reads DML operations (INSERTS, UPDATES, DELETES) from the database logs, these systems continue to run with high-performance for their end users.

Striim’s strengths for real-time CDC are not limited to the ingestion point. Here are a few capabilities of the Striim platform that build on its real-time, log-based change data capture in enabling robust, end-to-end streaming data integration solutions:

  1. Log-based CDC from heterogeneous databases for non-intrusive, low-impact real-time data ingestion: Striim uses log-based change data capture when ingesting from major enterprise databases including Oracle, HPE NonStop, MySQL, PostgreSQL, MongoDB, among others. It minimizes CPU overhead on sources, does not require application changes, and substantial management overhead to maintain the solution.
  2. Ingestion from multiple, concurrent data sources to combine database transactions with semi-structured and unstructured data. Striim’s real-time data ingestion is not limited to databases and the CDC method. With Striim you can merge real-time transactional data from OLTP systems with real-time log data (i.e., machine data), messaging systems’ events, sensor data, NoSQL, and Hadoop data to obtain rich, comprehensive, and reliable information about your business.
  3. End-to-end change data integration: Striim is designed from the ground-up to ingest, process, secure, scale, monitor, and deliver change data across a diverse set of sources and targets in real time. It does so by offering several robust capabilities out of the box:
    • Transaction integrity: When ingesting the change data from database logs, Striim moves committed transactions with the transactional context (i.e., ACID properties) maintained. Throughout the whole data movement, processing, and delivery steps, this transactional context is preserved so that users can create reliable replica databases, such as in the case of cloud bursting.
    • In-flight change data processing: Striim offers out-of-the-box transformers, and in-memory stream processing capabilities to filter, aggregate, mask, transform, and enrich change data while it is in motion. Using SQL-based continuous queries, Striim immediately turns change data into a consumable format for end users, without losing transactional context.
    • Built-in checkpointing for reliability: As the data moves and gets processed through the in-memory components of the Striim platform, every operation is recorded and tracked by the solution. If there is an outage, Striim can replay the transactions from where it was left off — without missing data or having duplicates.
    • Distributed processing in a clustered environment: Striim comes with a clustered environment for scalability and high availability. Without much effort, and using inexpensive hardware, you can scale out for very high data volumes with failover and recoverability assurances. With Striim, you don’t need to build your own clusters with third-party products.
    • Continuous monitoring of change data streams: Striim continuously tracks change data capture, movement, processing, and delivery processes, as well as the end-to-end integration solution via real-time dashboards. With Striim’s transparent pipelines, you have a clear view into the health of your integration solutions.
    • Schema change replication: When source Oracle database schema is modified and a DDL statement is created, Striim applies the schema change to the target system without pausing the processes.
    • Data delivery validation. For database sources and targets, Striim offers out-of-the-box data delivery verification. The platform continuously compares the source and target systems, as the data is moving, validating that the databases are consistent and all changed data has been applied to the target. In use cases, where data loss must be avoided, such as migration to a new cloud data store, this feature immensely minimizes migration risks.
    • Concurrent, real-time delivery to a wide range of targets: With the same software, Striim can deliver change datain real time not only to on-premise databases but also to databases running in the cloud, cloud services, messaging systems, files, IoT solutions, Hadoop and NoSQL environments. Striim’s integration applications can have multiple targets with concurrent real-time data delivery.
    • Pre-packaged applications for initial load and CDC: Striim comes with example integration applications that include initial load and CDC for PostgreSQL environments. These integration applications enable setting up data pipelines in seconds, and serve as a template for other CDC sources as well.
  1. Turning Change Data to Time-Sensitive Insights. In addition to building real-time integration solutions for change data, Striim can perform streaming analytics with flexible time windows allowing you to gain immediate insights from your data in motion. For example, if you are moving financial transactions using Striim, you can build real-time dashboards that alert on potential fraud cases before Striim delivers the data to your analytics solution.

Log-based change data capture is the modern way to turn databases into streaming data sources. However, ingesting the change data is only the first of many concerns that integration solutions should address. You can learn more about Striim’s CDC offering by scheduling a demo with a Striim technologist or experience its enterprise-grade streaming integration solution first-hand by downloading a free trial.

 

Microsoft SQL Server to Kafka

Microsoft SQL Server CDC to Kafka

By delivering high volumes of data using Microsoft SQL Server CDC to Kafka, organizations gain visibility of their business and the vital context needed for timely operational decision making. Getting maximum value from Kafka solutions requires ingesting data from a wide variety of sources – in real time – and delivering it to users and applications that need it to take informed action to support the business.Microsoft SQL Server to Kafka

Traditional methods used to move data, such as ETL, are just not sufficient to support high-volume, high-velocity data environments. These approaches delay getting data to where it can be of real value to the organization. Moving all the data, regardless of relevance, to the target creates challenges in storing it and getting actionable data to the applications and users that need it. Microsoft SQL Server CDC to Kafka minimizes latency and prepares data so it is delivered in the correct format for different consumers to utilize.

In most cases, the data that resides in transactional databases like Microsoft SQL Server is the most valuable to the organization. The data is constantly changing reflecting every event or transaction that occurs.  Using non-intrusive, low-impact change data capture (CDC) the Striim platform moves and processes only the changed data. With Microsoft SQL Server CDC to Kafka users manage their data integration processes more efficiently and in real time. 

Using a drag-and-drop UI and pre-built wizards, Striim simplifies creating data flows for Microsoft SQL Server CDC to Kafka. Depending on the requirements of users, the data can either be delivered “as-is,” or in-flight processing can filter, transform, aggregate, mask, and enrich the data. This delivers the data in the format needed with all the relevant context to meet the needs of different Kafka consumers –with sub-second latency.

Striim is an end-to-end platform that delivers the security, recoverability, reliability (including exactly once processing), and scalability required by an enterprise-grade solution. Built-in monitoring also compares sources and targets and validates that all data has been delivered successfully. 

In addition to Microsoft SQL Server CDC to Kafka, Striim offers non-intrusive change data capture (CDC) solutions for a range of enterprise databases including Oracle, Microsoft SQL Server, PostgreSQL, MongoDB, HPE NonStop SQL/MX, HPE NonStop SQL/MP, HPE NonStop Enscribe, and MariaDB.

For more information about how to use Microsoft SQL Server CDC to Kafka to maintain real-time pipelines for continuous data movement, please visit our Change Data Capture solutions page.

If you would like a demo of how Microsoft SQL Server CDC to Kafka works and to talk to one of our technologists, please contact us to schedule a demo.

real time data ingestion diagram

Real-Time Data Ingestion – What Is It and Why Does It Matter?

 

 

The integration and analysis of data from both on-premises and cloud environments give an organization a deeper understanding of the state of their business. Real-time data ingestion for analytical or transactional processing enables businesses to make timely operational decisions that are critical to the success of the organization – while the data is still current. real-time data ingestion diagram

Transactional and operational data contain valuable insights that drive informed and appropriate actions. Achieving visibility into business operations in real time allows organizations to identify and act on opportunities and address situations where improvements are needed. Real-time data ingestion to feed powerful analytics solutions demands the movement of high volumes of data from diverse sources without impacting source systems and with sub-second latency.

Using traditional batch methods to move the data introduces unwelcome delays. By the time the data is collected and delivered it is already out of date and cannot support real-time operational decision making. Real-time data ingestion is a critical step in the collection and delivery of volumes of high-velocity data – in a wide range of formats – in the timeframe necessary for organizations to optimize their value.

The Striim platform enables the continuous movement of structured, semi-structured, and unstructured data – extracting it from a wide range of sources and delivering it to cloud and on-premises endpoints – in real time and available immediately to users and applications.

The Striim platform supports real-time data ingestion from sources including databases, log files, sensors, and message queues and delivery to targets that include Big Data, Cloud, Transactional Databases, Files, and Messaging Systems. Using non-intrusive Change Data Capture (CDC) Striim reads new database transactions from source databases’ transaction or redo logs and moves only the changed data without impacting the database workload.

Real-time data ingestion is critical to accessing data that delivers significant value to a business. With clear visibility into the organization, based on data that is current and comprehensive, organizations can make more informed operational decisions faster.

To read more about real-time data ingestion, please visit our Real-Time Data Integration solutions page.

To have one of our experts guide you through a brief demo of our real-time data ingestion offering, please schedule a demo.

Oracle CDC to Postgres

As an open source alternative, Postgres offers a lower total cost of ownership and the ability to store structured and unstructured data. Real-time movement of transactional data using Oracle CDC to Postgres is essential to creating a rich and up-to-date view of operations and improving
customer experiences.Oracle CDC to Postgres

IDC projects that by the year 2025, 80% of all data will be unstructured. Emails and social media posts are good examples of unstructured data. The ability to integrate unstructured, semi-structured and structured data from transactional databases into the enterprise is vital for timely and relevant analysis. To get a deep understanding from all the data an organization captures and records and to get the most value from it, it must be in the right place and in the right format – in real time.

Continuous movement of transactional data using Oracle CDC to Postgres ensures the organization is utilizing the real-time information from on-prem transactional databases and other data stores that is needed to make decisions that optimize user experience and drive higher revenue.

Moving data from enterprise databases to Postgres using traditional ETL processes introduces latency. Delays incurred while the data is being migrated or updated results in an out-of-date picture of the business, and limits the extent to which decisions can have any significant impact. Organizations also face a series of challenges managing storage and accessing the actual data that can produce real value to the organization if they move all the data as is.

Striim enables organizations to generate real value from the transactional data residing in their existing Oracle databases. Using non-intrusive change data capture (CDC), Striim enables continuous data ingestion from Oracle to Postgres with sub-second latency. Users can easily set up ingestion via Striim’s pre-configured CDC wizards, and drag-and-drop UI.

Moving and processing data in-flight, Striim filters data that is not required and delivers what is important to Postgres – in real time. The data can also be transformed and enriched so it is delivered in the format required. Oracle CDC to Postgres allows organizations gain access to critical insights sooner and make more informed operational decisions faster.

Once the real-time data pipelines are built and the initial data load using Oracle CDC to Postgres has been performed, continuous updating with every new database transaction ensures that analytics applications have the most up-to-date information. Built-in monitoring continuously compares the source and target, validating database consistency and providing assurance that the replicated environment is completely up-to-date with the on-prem Oracle instance.

For more information on real-time data integration and processing using Striim’s Oracle CDC to Postgres solution, please visit our Change Data Capture page.

To see first-hand how easy it is to move data to Postgres using Striim’s Oracle CDC to Postgres functionality, please schedule a demo with one of our technologists.