Tutorial: Real-Time Database Integration with Apache Kafka via Change Data Capture

 

 

In this series of blog-based tutorials, we are going to step you through the process of building data flows for streaming integration and analytics applications using the Striim platform. This blog series focuses on the continuous collection of streaming data from a database; integration with Apache Kafka; Kafka stream processing, including enrichment, using continual SQL-based queries; delivery of data to multiple targets; Kafka data analytics and visualization of results.

Continuous Data Collection, Processing, Delivery, and Analytics with the Striim Platform

The first step in any streaming integration is sourcing data. You are probably aware that Striim can continuously collect data from many sources. In this tutorial we are going to be using change data capture (CDC) to stream database DML activity (inserts, updates and deletes) from a MySQL database. The same process would be used to read from Oracle, SQL Server, and MariaDB, but MySQL was chosen so you can easily try this at home.

Using the Striim Application Templates

To start building the CDC application, you need to navigate to Apps in either the drop-down menu, or home screen. Once there, click on
the Apps link to view all previously created data flow applications. This may be empty if you are first getting started with Striim.

To create the application click “Add App” in the top right-hand corner. This will give you three options for creating a new application:

  • Start with Template
  • Start from Scratch
  • Import Existing App

 

 

The Different Options for Creating Applications Using Striim

In this case we are going to start with a template. There are a lot of templates for many combinations of sources and targets. You can narrow the search by entering terms into the Search Templates box.

Striim Provides Many Pre-defined Application Templates

Since we are sourcing from MySQL, you can narrow the search using MySQL.

Our first dataflow will write into Kafka, so select the appropriate Kafka version as a target. You are now able to name your application. Within Striim applications have fully qualified names that enable the same name (for different departments, or users) to exist in different namespaces. A namespace is really just a way to keep related things together.

This application is MySQLCDCToKafka, and we’ll put it in the ‘cdc’ namespace.

Creating a New Application Using Striim

When you click “Save,” the application is created and the wizard will walk you through the steps of the process:

  1. Enter the Data Source info
  2. Test the connection and CDC capabilities
  3. Select the required tables
    (Do optional mapping and filtering of data)
  4. Enter the Target info

You can move forwards and backwards through these steps as necessary if you need to make any corrections.

Enter the Data Source Info

The first step is to enter the required information to connect to the data source. This will vary by source, but you will always need to choose a unique (within the namespace) data source name. In the case of MySQL you need to enter a connection URL, username, password and database name.

The First Step in the Wizard is Configuring the Data Source

Note: If you are using Striim in the cloud, and have configured an on-premise agent to collect data to send to the cloud, you will need to click on ‘configure an on-premise source’. Don’t worry about the password, it is automatically encrypted by our server.

To validate the information and move to the next step, click on the “Next” button at the bottom right.

Testing the Connection

The Striim platform not only verifies that the database connection information you entered is correct, but also checks that the database user has the correct privileges, and that CDC is setup correctly. This step varies from data source to data source. If anything is wrong, you will be notified and told how to fix the problem.

For example, when Next was clicked previously, the ‘striim’ user did not have the correct privileges to receive CDC information from the MySQL database. And the results of step two were as follows.

The Second Step in the Wizard is Testing the Connection and Showing How to Fix Issues

As you can see, the explanation includes the steps required to fix the issue – an additional two permissions were needed by the ‘striim’ user to obtain the CDC information. Once ‘Replication Client’ and ‘Replication Slave’ privileges were granted, the UI permits you to move to the next step.

The Second Step in the Wizard Once All Issues Are Removed

Select the Required Tables

The Striim platform enables you to selectively choose which database schemas and tables to use CDC for. These can be chosen from a list, or through wildcard selection.

The Third Step in the Wizard is Selection Tables for CDC

In this case we are just going to select the PRODUCT_INV table in the test database schema. Clicking Next takes us to an optional screen where you can do more complex data selection.

You Can Optionally Do Advanced Data Mapping, if Necessary

Enter the Target Info

The final step is to specify how we want to write data to the target. When we selected the wizard we chose to write data to Apache Kafka. To complete this step, we simply need to enter a few connection properties including the target name, topic and broker URL.

The Final Wizard Step is Configuring the Target

The final step, before clicking “Next” one last time, is to choose the format of the output data. The Striim platform supports JSON, Delimited, XML, Avro and free text formats for Kafka topics. In this case we are selecting the JSONFormatter. This has a number of optional properties, but we are just going to go with the defaults.

Using the wizard we quickly built an integration application that collects data continuously from MySQL usingchange data capture, and writes that in real-time to Apache Kafka in JSON format. After clicking “Next” you are shown the data flow, and can now edit, run and test the data flow as necessary.

You can also use Striim to write data to managed Kafka services like Confluent Cloud and AWS managed Kafka. Striim also supports cloud-based message busses such as Google PubSub and Azure EventHub.

To simplify your deployments for cloud-based Kafka services, Striim is available on AWS, Azure, and Google Cloud Marketplace.

Deploying and Starting the Data Flow

The resulting data flow can now be modified, deployed and started through the UI.

The Wizard Generates an End-to-End Data Flow

This is achieved through the state control button directly above the data flow. The initial state of the data flow is ‘created’. This means it is just a definition of what the application is supposed to do. In order to turn this into runtime objects, the application needs to be deployed.

 

 

Click on the ‘Created’ dropdown and select ‘Deploy App’ to show the Deploy UI.

 

 

Striim data flows can be deployed very flexibly. The default is to deploy the data flow to the least used node in a Striim cluster. However, the application can be deployed to all, or just some of the nodes (in predefined groups) as necessary. You can even split applications into sub-flows and deploy pieces on different parts of the cluster if required. In this case it’s easy, we’ll just deploy this whole application to one node. After deployment the application is ready to start, by selecting Start App.

Testing the Data Flow

Assuming you have activity on the MySQL table selected in during table selection, you should see data flowing in the UI, indicated by a number of msgs/s.

Testing the Streaming Data Flow

If you now click on the data stream in the middle and click on the eye icon, you can preview the data flowing between MySQL and Kafka.

Previewing the Data Flowing from MySQL to Kafka

Here you can see the data, metadata (these are all updates) and before values (what the data was before the update).

You can write a Kafka Consumer to see what this looks like in JSON format, or use a simple Striim application to read from Kafka, then use the built-in preview mechanism to see what it looks like.

Testing the Message Format Using a Striim Kafka Reader

Exporting As A Script

With the Striim Platform you are not forced to use the UI, even though it is an incredibly fast and convenient way of doing things). You can instead use our scripting language.

Any streaming data flow application can be exported as a TQL script, and any TQL script can be imported then worked with in the UI.

If we go back to the original MySQLCDCToKafka application, and click on the configuration dropdown on the top right of the toolbar, you can export it.

The script for the application looks like this:

CREATE APPLICATION MySQLCDCToKafka;

CREATE  SOURCE MySQLCDC USING MysqlReader  ( 

  Username: 'striim',

  Password: 'Vx/QMzVeRTMY1rxZH+zycQ==',

  ConnectionURL: 'mysql://localhost:3306',

  Tables: 'test.PRODUCT_INV',

  DatabaseName: 'test',

  Password_encrypted: true

 ) 

OUTPUT TO MySQLCDC_ChangeDataStream ;

CREATE  TARGET WriteToKafka USING KafkaWriter VERSION '0.8.0' ( 

  Mode: 'Sync',

  Topic: 'mysqlcdc',

  brokerAddress: 'localhost:9092'

 ) 

FORMAT USING JSONFormatter  ( 

 ) 

INPUT FROM MySQLCDC_ChangeDataStream;

END APPLICATION MySQLCDCToKafka;

In our next tutorial, we will take the raw CDC data and use some in-memory SQL-based processing to transform and enrich before we write the data out to Apache Kafka. In the meantime, please feel free to request a demo with one of our lead technologists, tailored to your environment.

Cloud ETL with Reliability

Implementing Streaming Cloud ETL with Reliability

 

 

If you’re considering adopting a cloud-based data warehousing and analytics solution the most important consideration is how to populate it with current on-site and cloud data in real time with cloud etl.

As noted in a recent study by IBM and Forrester, 88% of companies want to run near-real-time analytics on stored data. Streaming Cloud ETL enables real-time analytics by loading data from operational systems from private and public clouds to your analytics cloud of choice.

A streaming cloud ETL solution enables you to continuously load data from your in-house or cloud-based mission-critical operational systems to your cloud-based analytical systems in a timely and dependable manner. Reliable and continuous data flow is essential to ensuring that you can trust the data you use to make important operational decisions.

In addition to high-performance and scalability to handle your large data volumes, you need to look for data reliability and pipeline high availability. What should you look for in a streaming cloud ETL solution to be confident that it offers the high degree of reliability and availability your mission-critical applications demand?

For data reliability, the following two common issues with data pipelines should be avoided:

  1. Data loss. When data volumes increase and the pipelines are backed up, some solutions allow some of the data to be discarded. Also, when the CDC solution has limited data type support, it may not be able to capture the columns that contain that data type. If an outage or process interruption occurs, an incorrect recovery point can also lead to skipping some data.
  2. Duplicate data. After recovering from a process interruption, the system may create duplicate data. This issue becomes even more prevalent when processing the data with time windows after the CDC step.

How Striim Ensures Reliable and Highly Available Streaming Cloud ETL

Striim ingests, moves, processes, and delivers real-time data across heterogeneous and high-volume data environments. The software platform is built ground-up specifically to ensure reliability for streaming cloud ETL solutions with the following architectural capabilities.

Fault-tolerant architecture

Striim is designed with a built-in clustered environment with a distributed architecture to provide immediate failover. The metadata and clustering service watches for node failure, application failure, and failure of certain other services. If one node goes down, another node within the cluster immediately takes over without the need for users to do this manually or perform complex configurations.

Cloud ETL with ReliabilityExactly Once Processing for zero data loss or duplicates

Striim’s advanced checkpointing capabilities ensure that no events are missed or processed twice while taking time window contents into account. It has been tested and certified for cloud solutions to offer real-time streaming to Microsoft Azure, AWS, and/or Google Cloud with event delivery guarantees.

During data ingest, checkpointing keeps track of all events the system processes and how far they got down various data pipelines. If something fails, Striim knows the last known good state and what position it needs to recover from. Advanced checkpointing is designed to eliminate loss in data windows when the system fails.

If you have a defined data window (say 5 minutes) and the system goes down, you cannot typically restart from where you left off because you will have lost the 5 minutes’ worth of data that was in the data window. That means your source and target will no longer be completely synchronized. Striim addresses this issue by coordinating with the data replay feature that many data sources provide to rewind sources to just the right spot if a failure occurs.

In cases where data sources don’t support data replay—for example, data coming from sensors—Striim’s persistent messaging stores and checkpoints data as it is ingested. Persistent messaging allows previously non-replayable sources to replayed from a specific point. It also allows multiple flows from the same data source to maintain their own checkpoints. To offer exactly once processing, Striim checks to make sure input data has actually been written. As a result, the platform can checkpoint, confident in the knowledge that the data made it to the persistent queue.

End-to-end data flow management for simplified solution architecture and recovery

Striim’s streaming cloud ETL solution also delivers streamlined, end-to-end data integration between source and target systems that enhances reliability. The solution ingests data from the source in real time, performs all transformations such as masking, encryption, aggregation, and enrichment in memory as the data goes through the stream and then delivers the data to the target in a single network operation. All of these operations occur in one step without using a disk and deliver the streaming data to the target in sub-seconds. Because Striim does not require additional products, this simplified solution architecture enables a seamless recovery process and minimizes the risk of data loss or inaccurate processing.

In contrast, a data replication service without built-in stream processing requires data transformation to be performed in the target (or source), with an additional product and network hop. This minimum two-hop process introduces unnecessary data latency. It also complicates the solution architecture, exposing the customer to considerable recovery-related risks and requiring a great deal of effort for accurate data reconciliation after an outage.

For use cases where transactional integrity matters, such as migrating to a cloud database or continuously loading transactional data for a cloud-based business system, Striim also maintains ACID properties (atomicity, consistency, isolation, and durability) of database operations to preserve the transactional context.

The best way to choose a reliable streaming cloud ETL solution is to see it in action. Click here to request a customized demo for your specific environment.

 

Change Data Capture Methods

 

In databases, change data capture (CDC) is a set of software design patterns used to determine and track the data that has changed so that action can be taken using the changed data. Companies use change data capture for several use cases such as cloud adoption and enabling real-time data warehousing. There are multiple common change data capture methods that you can implement depending on your application requirements and tolerance for performance overhead.

  1. Introduction
  2. Audit Columns
  3. Table Deltas
  4. Triggers
  5. Log-based change data capture

 

Introduction

In high-velocity data environments where time-sensitive decisions are made, change data capture is an excellent fit to achieve low-latency, reliable, and scalable data integration. With over 80% of companies planning on implementing multi-cloud strategies by 2025, picking the right change data capture method for your business is more critical than ever given the need to replicate data across multiple environments.

The business transactions captured in relational databases are critical to understanding the state of business operations. Traditional batch-based approaches to move data once or several times a day introduce latency and reduce the operational value to the organization. Change Data Capture provides real-time or near real-time movement of data by moving and processing data continuously as new database events occur.

There are several change data capture methods to identify changes that need to be captured and moved. Here are the common methods, how they work, and their advantages as well as shortcomings.

Audit Columns

By using existing “LAST_UPDATED” or “DATE_MODIFIED” columns, or by adding one if not available in the application, you can create your own change data capture solution at the application level. This approach retrieves only the rows that have been changed since the data was last extracted.

The CDC logic for the technique would be

  1. Get the maximum value of both the target (blue) table’s ‘Created_Time’ and ‘Updated_Time’ columns

2. Select all the rows from the data source with ‘Created_Time’ greater than (>) the target table’s ‘Updated_Time’ , which are all the newly created rows since the last CDC process was executed.

3. Select all rows from the source table that have a ‘Updated_Time’ greater than (>) the target table’s ‘Updated_Time’ but less than (<) its ‘Updated_Time’. The reason for the exclusion of rows less than the maximum target create date is that they were included in step 2.

4. Insert new rows from step 2 or modify existing rows from step 3 in the target.

Pros of this method

  • It can be built with native application logic
  • It doesn’t require any external tooling

Cons of this method

  • Adds additional overhead to the database
  • DML statements such as deletes will not be propagated to the target without additional scripts to track deletes
  • Error prone and likely to cause issues with data consistency

This approach also requires CPU resources to scan the tables for the changed data and maintenance resources to ensure that the DATE_MODIFIED column is applied reliably across all source tables.

Table Deltas

You can use table delta or ‘tablediff’ utilities to compare the data in two tables for non-convergence. Then you can use additional scripts to apply the deltas from the source table to the target as another approach to change data capture. There are several examples of SQL scripts that can find the difference of two tables.

Advantages of this approach:

  • It provides an accurate view of changed data while only using native SQL scripts

Disadvantage of this approach:

  • Demand for storage significantly increases because you need three copies of the data sources that are being used in this technique: the original data, previous snapshot, and current snapshot
  • It does not scale well in applications with heavy transactional workloads

Although this works better for managing deleted rows, the CPU resources required to identify the differences is significant and the overhead increases linearly with the volume of data. The diff method also introduces latency and cannot be performed in real time. Some log-based change data capture tools come with the ability to analyze different tables to ensure replication consistency.

Triggers

Another method for building change data capture at the application level is defining triggers and creating your own change log in shadow tables. Triggers fire before or after INSERT, UPDATE, or DELETE commands (that indicate a change) and are used to create a change log. Operating at the SQL level, some users prefer this approach. Some databases even have native support for triggers.

However, triggers are required for each table in the source database, and they have greater overhead associated with running triggers on operational tables while the changes are being made. In addition to having a significant impact on the performance of the application, maintaining the triggers as the application change leads to management burden.

Advantages of this approach:

  • Shadow tables can provide an immutable, detailed log of all transactions
  • Directly supported in the SQL API for some databases

Disadvantage of this approach:

  • Significantly reduces the performance of the database by requiring multiple writes to a database every time a row is inserted, updated, or deleted

Many application users do not want to risk the application behavior by introducing triggers to operational tables. DBAs and data architects should always heavily test the performance of any triggers added into their environment and decide if they can tolerate the additional overhead.

Log-Based Change Data Capture

Databases contain transaction (sometimes called redo) logs that store all database events allowing for the database to be recovered in the event of a crash. With log-based change data capture, new database transactions – including inserts, updates, and deletes – are read from source databases’ native transaction or redo logs.

The changes are captured without making application level changes and without having to scan operational tables, both of which add additional workload and reduce source systems’ performance.

Advantages of this approach

  • Minimal impact on production database system – no additional queries required for each transaction
  • Can maintain ACID reliability across multiple systems
  • No requirement to change the production database system’s schemas or the need to add additional tables

Challenges of this approach

  • Parsing the internal logging format of a database is complex – most databases do not document the format nor do they announce changes to it in new releases. This would potentially require you to change your database log parsing logic with each new database release.
  • Would need system to manage the database change events metadata
  • Additional log levels required to produce scannable redo logs can add marginal performance overhead

Data integration platforms that natively perform change data capture can handle the complexity mentioned above by automatically mining the database change logs while managing additional metadata to ensure the replication between two or more systems is reliable.

If you would like a demo of to see how low-impact, real-time log-based change data capture works, or to talk to one of our CDC experts, you can schedule a demo for The Striim Platform.

Oracle Change Data Capture (CDC)

Oracle Change Data Capture Tutorial – An Event-Driven Architecture for Cloud Adoption

 

 

All businesses rely on data. Historically, this data resided in monolithic databases, and batch ETL processes were used to move that data to warehouses and other data stores for reporting and analytics purposes. As businesses modernize, looking to the cloud for analytics, and striving for real-time data insights, they often find that these databases are difficult to completely replace, yet the data and transactions happening within them are essential for analytics. With over 80% of businesses noting that the volume & velocity of their data is rapidly increasing, scalable cloud adoption and change data capture from databases like Oracle, SQLServer, MySQL and others is more critical than ever before. Oracle change data capture is specifically one area where companies are seeing an influx of modern data integration use cases.

To resolve this, more and more companies are moving to event-driven architectures, because of the dynamic distributed scalability which makes sharing large volumes of data across systems possible.

In this post we will look at an example which replaces batch ETL by event-driven distributed stream processing: Oracle change data capture events are extracted as they are created; enriched with in-memory, SQL-based denormalization; then delivered to the Azure Cloud to provide scalable, real-time, low-cost analytics, without affecting the source database. We will also look at using the enriched events, optionally backed by Kafka, to incrementally add other event-driven applications or services.

Continuous Data Collection, Processing, Delivery, and Analytics with the Striim Platform
Continuous Data Collection, Processing, Delivery, and Analytics with the Striim Platform

Event-Driven Architecture Patterns

Most business data is produced as a sequence of events, or an event stream: for example, web or mobile app interactions, devices, sensors, bank transactions, all continuously generate events. Even the current state of a database is the outcome of a sequence of events. Treating state as the result of a sequence of events forms the core of several event-driven patterns.

Event Sourcing is an architectural pattern in which the state of the application is determined by a sequence of events. As an example, imagine that each “event” is an incremental update to an entry in a database. In this case, the state of a particular entry is simply the accumulation of events pertaining to that entry. In the example below the stream contains the queue of all deposit and withdrawal events, and the database table persists the current account balances.

Event as a Change to an Entry in a Database
Imagine Each Event as a Change to an Entry in a Database

The events in the stream can be used to reconstruct the current account balances in the database, but not the other way around. Databases can be replicated with a technology called Change Data Capture (CDC), which collects the changes being applied to a source database, as soon as they occur by monitoring its change log, turns them into a stream of events, then applies those changes to a target database. Source code version control is another well known example of this, where the current state of a file is some base version, plus the accumulation of all changes that have been made to it.

The Change Log can be used to Replicate a Database
The Change Log can be used to Replicate a Database

What if you need to have the same set of data for different databases, for different types of use? With a stream, the same message can be processed by different consumers for different purposes. As shown below, the stream can act as a distribution point, where, following the polygot persistence pattern, events can be delivered to a variety of data stores, each using the most suited technology for a particular use case or materialized view.

Streaming Events Delivered to a Variety of Data Stores
Streaming Events Delivered to a Variety of Data Stores

Event-Driven Streaming ETL Use Case Example

Below is a diagram of the Event-Driven Streaming ETL use case example:

Event-Driven Streaming ETL Use Case Diagram
Event-Driven Streaming ETL Use Case Diagram
  1. Striim’s low-impact, real-time Oracle change data capture (CDC) feature is used to stream database changes (inserts, updates and deletes) from an Operational Oracle database into Striim
  2. CDC Events are enriched and denormalized with Streaming SQL and Cached data, in order to make relevant data available together
  3. Enriched, denormalized events are streamed to CosmosDB for real-time analytics
  4. Enriched streaming events can be monitored in real time with the Striim Web UI, and are available for further Streaming SQL analysis, wizard-based dashboards, and other applications on-premise or in the cloud.

Replacing Batch Extract with Real Time Streaming of CDC Order Events

Striim’s easy-to-use CDC wizards automate the creation of applications that leverage change data capture, to stream events as they are created, from various source systems to various targets. In this example, shown below, we use Striim’s OracleReader (Oracle Change Data Capture) to read the Order OLTP transactions in Oracle redo logs and stream these insert, update, delete operations, as soon as the transactions commit, into Striim, without impacting the performance of the source database.

Configuring Database Properties for the Oracle <a href=CDC Data Source” width=”353″ height=”379″ />
Configuring Database Properties for the Oracle CDC Data Source

Utilizing Caches For Enrichment

Relational Databases typically have a normalized schema which makes storage efficient, but causes joins for queries, and does not scale well horizontally. NoSQL databases typically have a denormalized schema which scales across a cluster because data that is read together is stored together.

Normalized Schema with Joins for Queries Does Not Scale Horizontally
Normalized Schema with Joins for Queries Does Not Scale Horizontally

With a normalized schema, a lot of the data fields will be in the form of IDs. This is very efficient for the database, but not very useful for downstream queries or analytics without any meaning or context. In this example we want to enrich the raw Orders data with reference data from the SalesRep table, correlated by the Order Sales_Rep_ID, to produce a denormalized record including the Sales Rep Name and Email information in order to make analysis easier by making this data available together.

Since the Striim platform is a high-speed, low latency, SQL-based stream processing platform, reference data also needs to be loaded into memory so that it can be joined with the streaming data without slowing things down. This is achieved through the use of the Cache component. Within the Striim platform, caches are backed by a distributed in-memory data grid that can contain millions of reference items distributed around a Striim cluster. Caches can be loaded from database queries, Hadoop, or files, and maintain data in-memory so that joining with them can be very fast. In this example, shown below, the cache is loaded with a query on the SalesRep table using the Striim DatabaseReader.

Configuring Database Properties for the Sales Rep Cache
Configuring Database Properties for the Sales Rep Cache

Joining Streaming and Cache Data For Real Time Transforming and Enrichment With SQL

We can process and enrich data-in-motion using continuous queries written in Striim’s SQL-based stream processing language. Using a SQL-based language is intuitive for data processing tasks, and most common SQL constructs can be utilized in a streaming environment. The main differences between using SQL for stream processing, and its more traditional use as a database query language, are that all processing is in-memory, and data is processed continuously, such that every event on an input data stream to a query can result in an output.

Dataflow Showing Joining and Enrichment of <a href=CDC data with Cache” width=”455″ height=”247″ />
Dataflow Showing Joining and Enrichment of CDC data with Cache

This is the query we will use to process and enrich the incoming data stream:

Full Transformation and Enrichment Query Joining the <a href=CDC Stream with Cache Data” width=”408″ height=”251″ />
Full Transformation and Enrichment Query Joining the CDC Stream with Cache Data

In this query we select the Order stream and SalesRep cache fields that we want, apply transformations to convert data types, put the Order stream and SalesRep cache in the FROM clause, and include a join on SALES_REP_ID as part of the WHERE clause. The result of this query is to continuously output enriched (denormalized) events, shown below, for every CDC event that occurs for the Orders table. So with this approach we can join streams from an Oracle Change Data Capture reader with cached data for enrichment.

Events After Transformation and Enrichment
Events After Transformation and Enrichment

Loading the Enriched Data to the Cloud for Real Time Analytics

Now the Oracle CDC (Oracle change data capture) data, streamed and enriched through Striim, can be stored simultaneously in Azure Cloud blob storage and Azure Cosmos DB, for elastic storage with advanced big data analytics, using the Striim AzureBlobWriter and the CosmosDBWriter shown below.

The image below shows the Striim flow web UI for our streaming ETL application. Flows define what data an application receives, how it processes the data, and what it does with the results.

End-to-End Data Flow
End-to-End Data Flow

Using Kafka for Streaming Replay and Application Decoupling

The enriched stream of order events can be backed by or published to Kafka for stream persistence, laying the foundation for streaming replay and application decoupling. Striim’s native Integration with Apache Kafka makes it quick and easy to leverage Kafka to make every data source re-playable, enabling recovery even for streaming sources that cannot be rewound. This also acts to decouple applications, enabling multiple applications to be powered by the same data source, and for new applications, caches or views to be added later.

Streaming SQL for Aggregates

We can further use Striims Streaming SQL on the denormalized data to make a real time stream of summary metrics about the events being processed available to Striim Real-Time Dashboards and other applications. For example, to create a running count and sum of orders per SalesRep in the last hour, from the stream of enriched orders, you would use a window, and the familiar group by clause.

CREATE WINDOW OrderWindow
OVER EnrichCQ
KEEP WITHIN 1 HOUR
PARTITION BY sales_rep_id

SELECT sales_rep_id, sales_rep_Name,
COUNT(*) as orderCount,
SUM(order_total) as totalAmount
FROM OrderWindow
GROUP BY sales_rep_id

Monitoring

With the Striim Monitoring Web UI we can now monitor our data pipeline with real-time information for the cluster, application components, servers, and agents. The Main monitor page allows to visualize summary statistics for Events Processed, App CPU%, Server Memory, or Server CPU%. Below the Monitor App page displays our App Resources, Performance and Components.

Striim Monitoring Web UI Monitor App Page
Striim Monitoring Web UI Monitor App Page

Clicking on an app component ‘more details’ button will display more detailed performance information such as CPU and Event rate as shown below:

Striim Monitoring Web UI Monitor App Component Details Page
Striim Monitoring Web UI Monitor App Component Details Page

Summary

In this blog post, we discussed how we can use Striim to:

  1. Perform Oracle Change Data Capture to stream data base changes in real-time
  2. Use streaming SQL and caches to easily denormalize data in order to make relevant data available together
  3. Load streaming enriched data to the cloud for real-time analytics
  4. Use Kafka for persistent streams
  5. Create rolling aggregates with streaming SQL
  6. Continuously monitor data pipelines

Additional Resources:

To read more about real-time data ingestion, please visit our Real-Time Data Integration solutions page.

To learn more about the power of streaming SQL, visit Striim Platform Overview product page, schedule a demo with a Striim technologist, or download a free trial of the platform and try it for yourself!

To learn more about Striim’s capabilities to support the data integration requirements for an Azure hybrid cloud architecture check out all of Striim’s solutions for Azure.

MySQL to Google BigQuery using CDC

Tutorial: Migrating from MySQL to BigQuery for Real-Time Data Analytics

 

 

In this post, we will walk through an example of how to replicate and synchronize your data from on-premises MySQL to BigQuery using change data capture (CDC).

Data warehouses have traditionally been on-premises services that required data to be transferred using batch load methods. Ingesting, storing, and manipulating data with cloud data services like Google BigQuery makes the whole process easier and more cost effective, provided that you can get your data in efficiently.

Striim real-time data integration platform allows you to move data in real-time as changes are being recorded using a technology called change data capture. This allows you to build real-time analytics and machine learning capabilities from your on-premises datasets with minimal impact.

Source MySQL Database

Before you set up the Striim platform to synchronize your data from MySQL to BigQuery, let’s take a look at the source database and prepare the corresponding database structure in BigQuery. For this example, I am using a local MySQL database with a simple purchases table to simulate a financial datastore that we want to ingest from MySQL to BigQuery for analytics and reporting.

I’ve loaded a number of initial records into this table and have a script to apply additional records once Striim has been configured to show how it picks up the changes automatically in real time.

Targeting Google BigQuery

You also need to make sure your instance of BigQuery has been set up to mirror the source or the on-premises data structure. There are a few ways to do this, but because you are using a small table structure, you are going to set this up using the Google Cloud Console interface. Open the Google Cloud Console, and select a project, or create a new one. You can now select BigQuery from the available cloud services. Create a new dataset to hold the incoming data from the MySQL database.

Once the dataset has been created, you also need to create a table structure. Striim can perform the transformations while the data flies through the synchronization process. However, to make things a little easier here, I have replicated the same structure as the on-premises data source.

You will also need a service account to allow your Striim application to access BigQuery. Open the service account option through the IAM window in the Google Cloud Console and create a new service account. Give the necessary permissions for the service account by assigning BigQuery Owner and Admin roles and download the service account key to a JSON file.

Set Up the Striim Application

Now you have your data in a table in the on-premises MySQL database and have a corresponding empty table with the same fields in BigQuery. Let’s now set up a Striim application on Google Cloud Platform for the migration service.

Open your Google Cloud Console and open or start a new project. Go to the marketplace and search for Striim. A number of options should return, but the option you are after is the first item that allows integration of real-time data to Google Cloud services.

Select this option and start the deployment process. For this tutorial, you are just using the defaults for the Striim server. In production, you would need to size appropriately depending on your load.

Click the deploy button at the bottom of this screen and start the deployment process.

Once this deployment has finished, the details of the server and the Striim application will be generated.

Before you open the admin site, you will need to add a few files to the Striim Virtual Machine. Open the SSH console to the machine and copy the JSON file with the service account key to a location Striim can access. I used /opt/striim/conf/servicekey.json.

You also need to restart the Striim services for these setting and changes to take effect. The easiest way to do this is to restart the VM.

Give these files the right permissions by running the following commands:

chown striim:striim <filename>

chmod 770 <filename>

You also need to restart the Striim services for this to take effect. The easiest way to do this is to restart the VM.

Once this is done, close the shell and click on the Visit The Site button to open the Striim admin portal.

Before you can use Striim, you will need to configure some basic details. Register your details and enter in the Cluster name (I used “DemoCluster”) and password, as well as an admin password. Leave the license field blank to get a trial license if you don’t have a license, then wait for the installation to finish.

 

When you get to the home screen for Striim, you will see three options. Let’s start by creating an app to connect your on-premises database with BigQuery to perform the initial load of data. To create this application, you will need to start from scratch from the applications area. Give your application a name and you will be presented with a blank canvas.

The first step is to read data from MySQL, so drag a database reader from the sources tab on the left. Double-click on the database reader to set the connection string with a JDBC-style URL using the template:

jdbc:mysql://<server_ip>:<port>/<database>

You must also specify the tables to synchronize — for this example, purchases — as this allows you to restrict what is synchronized.

Finally, create a new output. I called mine PurchasesDataStream.

You also need to connect your BigQuery instance to your source. Drag a BigQuery writer from the targets tab on the left. Double-click on the writer and select the input stream from the previous step and specify the location of the service account key. Finally, map the source and target tables together using the form:

<source-database>.<source-table>,<target-database>.<target-table>

For this use case this is just a single table on each side.

Once both the source and target connectors have been configured, deploy and start the application to begin the initial load process. Once the application is deployed and running, you can use the monitor menu option on the top left of the screen to watch the progress.

Because this example contains a small data load, the initial load application finishes pretty quickly. You can now stop this initial load application and move on to the synchronization.

Updating BigQuery with Change Data Capture

Striim has pushed your current database up into BigQuery, but ideally you want to update this every time the on-premises database changes. This is where the change data capture application comes into play.

Go back to the applications screen in Striim and create a new application from a template. Find and select the MySQL CDC to BigQuery option.

 

Like the first application, you need to configure the details for your on-premises MySQL source. Use the same basic settings as before. However, this time the wizard adds the JDBC component to the connection URL.

When you click Next, Striim will ensure that it can connect to the local source. Striim will retrieve all the tables from the source. Select the tables you want to sync. For this example, it’s just the purchases table.

Once the local tables are mapped, you need to connect to the BigQuery target. Again, you can use the same settings as before by specifying the same service key JSON file, table mapping, and GCP Project ID.

Once the setup of the application is complete, you can deploy and turn on the synchronization application. This will monitor the on-premises database for any changes, then synchronize them into BigQuery.

Let’s see this in action by clicking on the monitor button again and loading some data into your on-premises database. As the data loads, you will see the transactions being processed by Striim.

Next Steps

As you can see, Striim makes it easy for you to synchronize your on-premises data from existing databases, such as MySQL, to BigQuery. By constantly moving your data into BigQuery, you could now start building analytics or machine learning models on top, all with minimal impact to your current systems. You could also start ingesting and normalizing more datasets with Striim to fully take advantage of your data when combined with the power of BigQuery.

To learn more about Striim for Google BigQuery, check out the related product page. Striim is not limited to MySQL to BigQuery integration, and supports many different sources and targets. To see how Striim can help with your move to cloud-based services, schedule a demo with a Striim technologist or download a free trial of the platform.

Advancement of Data Movement Techologies

Advancement of Data Movement Technologies: Whiteboard Wednesdays

 

In this Whiteboard Wednesday video, Irem Radzik, Head of Product Marketing at Striim, looks at how data movement technologies have evolved in response to changing user demands. Read on, or watch the 8-minute video:

Today we’re going to talk about the advancement of data movement technologies. We’re going to look at the ETL technologies that we started seeing in ‘90s, then the CDC (Change Data Capture)/Logical Replication solutions that we started seeing a couple of decades ago, and then streaming data integration solutions that we more commonly see today.

ETL

Let’s look at ETL technologies. ETL is known for its batch extract, then bringing the data into the transformation step in the middle tier server, and then loading the target in bulk again, typically for next-day reporting. You end up having high latency with these types of solutions. That was good enough for the ‘90s, but then we started demanding more fresh data for operational decision making. Latency became an issue with ETL solutions.

Data Movement - ETL

The other issue with ETL was the batch-window dependency. Because of the high impact on the production sources, there had to be a dedicated time for these batch extracts when the main users wouldn’t be able to use the production database. The batch window that was available for data extract became shorter and shorter as business demanded continuous access to the OLTP system.

The data volumes increased at the same time. You ended up not having enough time to move all the data you needed. That became a pain point for ETL users, driving them to look into other solutions.

Change Data Capture/Logical Replication

Change Data Capture/Logical Replication solutions addressed several of the key concerns that ETL had. Change Data Capture basically means that you continuously capture new transactions happening in the source database and deliver it to the target in real time.

Data Movement - CDC / Logical ReplicationThat obviously helps with the data latency problem. You end up having real-time, up to date data in the target for your operational decision making. The other plus of CDC is the source impact.

When it’s using logs (database logs) to capture the data, it has negligible impact. The source production system is available for transaction users. There is no batch window needed and no limitations for how much time you have to extract and move the data.

The CDC/Logical Replication solutions handle some of the key concerns of ETL users. They are made more for the E and L steps. What ends up happening with these solutions is that you need to do transformations within the database or with another tool, in order to complete the transformation step for end users.

The transformation happening there creates an E L T architecture and requires another product, another step, another network hub in your architecture, which complicates the process.

When there’s an outage, when there is a process disruption, reconciling your data and recovering becomes more complicated. That’s the shortcoming CDC users have been facing. These solutions were mainly made for databases.

Once the cloud and big data solutions became popular, the CDC providers had to come up with new products for cloud and big data targets. These are add-ons, not part of the main platform.

Another shortcoming that we’ve seen with CDC/Logical Replication solutions is their single node architecture, which translates into a single point of failure. This is a shortcoming, especially for mission-critical systems that need continuous availability of the data integration processes.

Streaming Data Integration

In recent years, streaming data integration came about to address the issues that CDC/Logical Replication products raised. It is becoming increasingly common. With streaming data integration, you’re not limited to just database sources.

Data Movement Streaming Data IntegrationYou can have your files, log data, your machine data, your system log files for example, all moving in a real-time fashion. Your cloud sources, your service bus or your messaging systems can be your source. Your sensor data can be moved in real time, in a streaming fashion to multiple targets. Again, not limited to just databases.

You can have cloud databases or other cloud services as your target. You can, in addition to databases, have messaging systems as your target, on-premises or in cloud, your big data solutions, on-premises or cloud. You can also deliver in file format.

Everything is like it was in a logical replication solution. It is continuous, in real time, and Change Data Capture is still a big component of the streaming data integration.

It’s built on top of the Change Data Capture technologies and brings additional data sources and additional data targets. Another important difference, and handling one of the challenges of logical replication, is the transformation piece. As we discussed, a transformation needs to happen and where it happens makes a big difference.

With streaming data integration, it’s happening in-flight. While the data is moving, you can have stream processing without adding more latency to your data. While the data is moving, it can be filtered, it can be aggregated, it can be masked and encrypted, and enriched with reference data, all in flight before it’s delivered to your target, so that it’s available in a consumable format. This streamlines your architecture, simplifies it, and makes all the recovery steps easier. It’s also delivering the data in the format that your users need.

Another important thing to highlight is the distributed architecture. This natively clustered environment helps with a single point of failure risk. When one node fails, the other one takes over immediately, so you have a highly available data pipeline. This distributed clustered environment also helps you to scale out very easily, add more servers as you have more data to process and move.

These solutions now come with a monitoring part. The real time monitoring of the pipelines gives you an understanding of what’s happening with your integration flows. If there is an issue, if there is high data latency or process issue, you get immediate alerts so you can trust that everything is running.

Data reliability is really critical, whole pipeline reliability is very critical. To make sure that there is no data loss or duplicates, there is data delivery validation that can be included in some of these solutions. You can also make sure, with the right solution, that everything is processed exactly once, and you are not repeating or dropping data. There are checkpointing mechanisms to be able to do that.

As you see, the new streaming data integration solutions handle some of the challenges that we have seen in the past with outdated data movement technologies. To learn more about streaming data integration, please visit our Real-time Data Integration solution page, schedule a demo with a Striim expert, or download the Striim platform to get started.

 

5 Streaming Cloud Integration Use Cases: Whiteboard Wednesdays

 

 

Today we’re going to talk about five streaming cloud integration use cases. Streaming cloud integration moves data continuously in real time between heterogeneous databases, with in-flight data processing. Read on, or watch the 9-minute video:

Let’s focus on how to use streaming data integration in cloud initiatives, and the five common scenarios that we see.

Use Case #1 – Online Migration/Cloud Adoption

Let’s start with the first one. It is basically adopting the cloud or getting to the cloud. When you want to move your data to the cloud, streaming cloud integration helps you with online database migration. You have your legacy database and you want to move it to the cloud. If this is a critical database, you do not want to pause it during this migration, you want it to be operational to support your business.

Streaming data integration offers Change Data Capture technology. This captures all new transactions, change transactions, as soon as they happen and delivers them to the target. While you’re doing the initial load to the cloud database, you can start Change Data Capture, keep the system open to transactions, and capture all the new transactions happening with the CDC feature.

Once the initial load is done here, you can apply the change data to the system so that these two are in-sync. Because the system, this database, is open to transactions you basically have no database downtime. This is available for users and once the data is applied here you also have the ability to validate that these two databases are in-sync, and there is no data loss during this migration process. There are tools that provide this validation, to give you no data loss during the migration.

Because this database has production data and is up-to-date and the other one is still functional, you have unlimited time to test the new database. You can control the tests and be comfortable before you point any users or any applications to this cloud database. This unlimited testing minimizes your risks. The time pressure is gone, and you can be comfortable with your move to the cloud.

You also have the ability to perform phased migration. Bi-directional data flow between the legacy system and the cloud system allows you to have users on both sides. You can move some of your users to the cloud database and some of them are still in the legacy.

The streaming cloud integration solution can apply the changes happening in the cloud to the legacy and the changes happening in the legacy back to the cloud, so they stay in-sync. You can gradually move your users to the cloud database when you feel comfortable. Phased migration is another way to minimize your risk of moving your mission-critical systems to the cloud.

Use Case #2 – Hybrid Cloud Architecture for Analytics

We have discussed how to ease your cloud adoption, but once you’re in the cloud and you have adopted a cloud solution, you also need to treat it as part of your data center and build continuous data movement between your existing data sources and the new cloud solution.Hybrid cloud architecture

We see quite a bit of cloud analytics solutions and that’s our second use case. Many organizations these days offload their analytics to cloud solutions. Modern cloud solutions give them tons of new features to modernize their analytics environment and transform their business.

We help with moving all kinds of enterprise data. This can be your databases, your machine data, all kinds of log files (security files and system log files), your existing cloud sources, your messaging systems, and your sensor data. All of them can be moved in real time continuously to your cloud analytics solution.

I would like to add that some streaming cloud integration solutions give you the ability to do in-flight data processing. Transformations happen in-flight so that you deliver the data, without adding latency, to the target system in a consumable format that it needs. You end up having data flowing in the right format for your cloud analytics solution.

The main value from this is that you can now run operational workloads, high value or high operational value producing analytics applications in your analytics solution. You can influence the decisions, operational decisions, happening in your business. That will help you gain faster business transformation throughout your enterprise.

Use Case #3 – Building New Applications in the Cloud

Building applications in the cloudWe talked about the analytics use case, here is another similar one. As part of your hybrid cloud architecture, you might be building new applications in the cloud. You still need data coming from your enterprise data sources to your cloud environment. By moving all this diverse set of data in real time to your cloud messaging systems or cloud databases or storage solutions, you are able to easily build applications in the cloud.

These modern applications move your business forward because the data is available. You can make better use of these cloud applications if you have this real-time bridge between your existing data center and your new cloud environment. Streaming Integration helps you to move your data so you can quickly build new applications for your business, to help it move forward with more modern solutions.

Use Case #4 – Multi-Cloud Integration

We also see multi-cloud use cases. A lot of companies now haveMulti-cloud integration one cloud solution for one purpose, another cloud solution for another purpose, and are working with multiple vendors. You have the option to feed your data to multiple targets. After you capture it once you can feed it to all kinds of different targets, maybe one of them for analytics and one of them for supporting new applications. You have the ability to distribute your data to multiple cloud solutions.

Use Case #5 – Inter-Cloud Integration

Similarly, if you’re working with multiple cloud vendors, you will need to connect these solutions with each other. If you have an operational database in one cloud and you have an analytics solution in Inter-cloud integrationanother cloud, you need to move the data from this cloud solution to the other one in real time, so you can have operational reporting or operational analytics solutions in this cloud.

Streaming cloud integration gives you the agility and the ability to move your data wherever you want. Cloud can be easily a part of your data center, seamlessly part of your data infrastructure, by moving your data to that environment.

You can use streaming cloud integration to ease your migration to the cloud and adoption of cloud solutions by minimizing your risk and business disruption. You can also maintain your hybrid cloud architecture and multi-cloud architecture with a continuous data flow from your existing data sources.

To learn more about streaming data integration for your cloud solutions, please visit our Hybrid Cloud Integration solution page, schedule a demo with a Striim expert, or download the Striim platform to get started.

 

Evaluating Streaming Data Integration Platforms

Evaluating Streaming Data Integration Platforms: Whiteboard Wednesdays

 

 

In today’s Whiteboard Wednesday video, Steve Wilkes, founder and CTO of Striim, looks at what you need to consider when evaluating streaming data integration platforms. Read on, or watch the 15-minute video:

We’ve already gone through what the components of a streaming integration platform are. Today we’re going to talk about how you go about evaluating streaming data integration platforms based on these components.

Just to reiterate, you need the platform to be able to:

  • Do real-time continuous data collection
  • Move that data continuously from where it’s collected to where it’s going
  • Support delivery to all the different targets that you care about
  • Process the data as it’s moving, so stream processing
  • This all needs to be enterprise grade so that it is scalable and reliable, and all those other things that you care about for mission-critical data
  • Get insights and alerts on that data movement

Let’s think about the things that you need to consider in order to actually achieve this when you’re evaluating such platforms.

Data Collection & Delivery

For data collection and delivery, you care about quite a few different things. Firstly, it needs to be low latency. If it’s a streaming data integration platform, then just doing bulk loads or micro batch may not be sufficient. You want to be able to collect the data the instant it’s created, within milliseconds typically. You need low-latency data collection.

Evaluating Streaming Integration Platforms - Data CollectionIt needs to be able to support all the sources that you care about. If you’re looking for a streaming integration platform, then you’re thinking of more than just one use case. You’re thinking “what platform is going to support all of the streaming data integration needs within my organization?” Supporting just one data source or a couple of data sources isn’t enough.

You need to be able to support all the sources that you care about now and may care about in the future. That could be databases, files, or messaging systems. It could even be IoT. So think about that when you’re evaluating whether the platform has all the sources that you need. Think about how it can deal with those sources in a number of different ways.

For databases, you may need to be able to do bulk loads into a streaming infrastructure, as well as doing Change Data Capture. This is important for collecting real-time change as it’s happening in a database, the inserts, updates, and deletes. For files, you may need to do bulk files, files that exist already, but also files as they’re created, streaming out the data as it’s being written. Supporting both bulk and change data is equally important.

You also need to consider whether the adapters are actually part of the platform or are they third party. If they are part of the platform and the platform is built well, then it means that they will be able to handle all the different requirements of the platform – scalability, reliability, and recoverability. All of those things are integrated end to end because the adapters are part of the platform.

If they’re third party, then that may not be the case. If you have to plug in third party components into your infrastructure, then you can have areas of brittleness where things may not work properly or problematic interfaces when things change. Try to avoid third party adapters wherever you can.

Data collection and data delivery need to be able to support the end to end recovery and reliability that is part of being enterprise grade. That means that from a database perspective, for example, you may need to be able to support maintaining a database transaction context from one end to the other. You need to be able to pick up from where you left off and make sure that data that is collected is delivered to all of the appropriate targets. These could be variable and different.

You might be delivering some data on-premise and some data to the cloud, but you still need to be able to make sure that all the data has made it there. You need to be able to validate that the data is being written to all the different sources and targets that the platform is supporting.

If it’s part of a platform and they’re not third party, you would expect that to be there. If they are third party, then you have to investigate whether all of those things are supported. Data collection and data delivery is the first part of how you evaluate the platform.

Data Movement

The next part is how does it do data movement? This is crucial to maintaining the kind of high throughput and low latency that you’d expect. Data movement is a number of different things. It’s between processing steps. Between your source collection and your data delivery.

Between source collection, maybe some in-memory processing or maybe some enrichment and data delivery. Or it could be an even a more complex pipeline with multiple steps in it. You’re moving data between each step.

It’s also between nodes. If you have a clustered platform and that platform is moving data between nodes for different processing steps, or maybe between source and target because the target is closer to one of the nodes than other nodes. You need to be able to ensure that the data movement happens efficiently, with high throughput and low latency, between nodes.

You also need to be able to support collecting data on-premise and delivering it into cloud environments, or collecting it from cloud environments and delivering it to on-premise, or moving between clouds. Supporting all these different typologies is all part of data movement.

Ideally as much of the data movement as possible should be in memory only. Try to avoid having to write to disk or do any kind of IO in between processing steps. The reason for this is that each processing step needs to perform optimally in order to get high throughput.

If you are persisting data, that can add latency. Ideally when you’re doing multiple processing steps in a pipeline, you’re doing all of that data movement in memory only, between the steps or just between nodes. You’re not persisting to disk.

You should only use persistent data movement or persistent data streams where needed. There are a couple of really good use cases for this. One is if you have data sources that you can’t rewind into for recoverability, you may want to use a persistent data stream as the first step in the process, but everything downstream can be in memory only.

If you’re collecting data in real time, but you have multiple applications all running at their own speeds against that data, you may want to think about having persistent data streams between different steps. Typically, you want to minimize the amount of persistent data streams that you have and use in-memory only data streams wherever possible. That will really aid in reducing your latency and increasing your throughput.

Stream Processing

The next thing that you need to be able to do is stream processing. Stream processing obviously has to be able to support all of the different types of processing that you want to do. For example, it needs to be able to support complex transformations. If it doesn’t support the transformations that you want, you should be able to add in your own components or your own user defined functions to do the transformations.

It needs to be able to combine and enrich data. This requires a lot of different constructs for stream processing. When you are combining data together from multiple data streams, they run at high speed and typically events aren’t going to happen at the same time.

You need a flexible windowing structure that can maintain a set of events from different data streams to combine together, in order to be able to produce a combined output stream that has the last data from every stream apart from the current data from the current one.

When you’re enriching data, you need to be able to join streaming data with reference data. You can’t go back to a database or go back to the original source of the reference data for every event on a data stream. It’s just too slow. You need to be able to load, cache, and remember the data you are using for enrichment in memory so you can join it really efficiently, in order to keep and maintain the throughput that you’re looking for from the overall system.

You want the stream processing to be optimized. It should really run as fast as if you’d written it yourself manually. It also needs to be easy to use. We recommend that you look for SQL-based stream processing because SQL is the language of data. There are very few people that work with data that don’t understand SQL. It allows you to do filtering, transformation, and data enrichment through natural SQL constructs.

Obviously if you want to do more complex things, you should also be allowed to import your own transformations and work with those. For SQL-based transformations, it enables anyone that knows data to be able to build and understand what the transformations are. You also want building pipelines to be as easily accessible as possible to all the people that want to work with the data.

You need to have a good UI for building the data pipelines and have as much of the process as possible automated through wizards and other UI based assistance. You need to be able to build multi-step stream processing, not just a single source into single target or a single source into single piece of processing into single target. Potentially with fan in and fan out. Multiple data sources coming in, going into multiple processing steps in a staged environment, where they go step by step by step, to potentially multiple targets coming out at the other end.

This all needs to be coordinated, well-maintained, and deployable across a cluster in order to be scalable. Your stream processing should be very rich, very capable, and also very high throughput.

Enterprise Grade

You also need to think about the enterprise-grade qualities of the platform. I’ve mentioned before, for it to be enterprise grade it needs to be scalable. You need to be able to handle increasing the throughput, increasing the number of sources, increasing the number of targets, and increasing the volume of data being generated from each one of those.

When you’re evaluating platforms and evaluating for a production scenario, you should test the platform with a reasonable throughput that corresponds to what you’re expecting in order to see how it behaves and how it scales, and measure the throughput and the latency from end to end as you’re evaluating the platform.

You also need it to be reliable. You need to be able to ensure that you have guaranteed delivery from source all the way to target. Even if something fails, if a network fails, if the source or the target goes down, if any of the processing nodes in the cluster go down or the whole cluster goes down, you need to be able to ensure that it picks up from where it left off and doesn’t miss any messages.

It has to be able to recover from failures as well. Guaranteed delivery in the normal “I’m always running” case so you don’t miss any messages, just because they disappeared into the ether somewhere. But also, that if you have a failure, you should recover and not lose any messages, not lose any events that come from the source into the target.

Of course, security is also paramount. You can secure the data while it’s moving in transit, so it’s encrypted as it goes across the network. But also that you can secure who has access to the data, who can work with individual data streams, who can see the data on individual data streams, who can build applications, who can view the results of building applications.

You need security that works across the whole end to end and deals with every single component, so that you can secure them and lock them down and make sure that only the people that need to work with data, can.

Insights & Alerts

Finally, you need to make sure that the platform gives you visibility into your data, that you can monitor the data flows and see what’s going on in real time, that you get alerts when anything happens. This could be when CPU or memory usage on any of the nodes goes above certain criteria. It could be when applications crash, or data flows crash. It could be when volume goes above or below what you expect, and doing that in a granular fashion. For example, when an individual database table goes above or below what you expect.

You need to be able to work with insights into the data flows that help you operationalize this and make sure that it’s working full time, 24/7, when you actually put it into production. You may even want to get insights on the data itself, drill down into the actual data that’s flowing, and do some analytics on that. If your streaming integration platform can also give you those valuable insights on the streaming data, then that’s the icing on the cake.

Just to summarize, when you’re evaluating streaming data integration platforms, you need to make sure that the platform can do everything that you need, to get your data from where it’s generated to where it needs to be, in order to get real value out of your data.

 

To learn more about streaming data integration, please visit our Real-time Data Integration solution page, schedule a demo with a Striim expert, or download the Striim platform to get started.

Use Cases for Streaming Data Integration

The Top 4 Use Cases for Streaming Data Integration: Whiteboard Wednesdays

 

 

Today we are talking about the top four use cases for streaming data integration. If you’re not familiar with streaming data integration, please check out our channel for a deeper dive into the technology. In this 7-minute video, let’s focus on the use cases.

 

Use Case #1 Cloud Adoption – Online Database Migration

The first one is cloud adoption – specifically online database migration. When you have your legacy database and you want to move it to the cloud and modernize your data infrastructure, if it’s a critical database, you don’t want to experience downtime. The streaming data integration solution helps with that. When you’re doing an initial load from the legacy system to the cloud, the Change Data Capture (CDC) feature captures all the new transactions happening in this database as it’s happening. Once this database is loaded and ready, all the changes that happened in the legacy database can be applied in the cloud. During the migration, your legacy system is open for transactions – you don’t have to pause it.

While the migration is happening, CDC helps you to keep these two databases continuously in-sync by moving the real-time data between the systems. Because the system is open to transactions, there is no business interruption. And if this technology is designed for both validating the delivery and checkpointing the systems, you will also not experience any data loss.

Because this cloud database has production data, is open to transactions, and is continuously updated, you can take your time to test it before you move your users. So you have basically unlimited testing time, which helps you minimize your risks during such a major transition. Once the system is completely in-sync and you have checked it and tested it, you can point your applications and run your cloud database.

This is a single switch-over scenario. But streaming data integration gives you the ability to move the data bi-directionally. You can have both systems open to transactions. Once you test this, you can run some of your users in the cloud and some of you users in the legacy database.

All the changes happening with these users can be moved between databases, synchronized so that they’re constantly in-sync. You can gradually move your users to the cloud database to further minimize your risk. Phased migration is a very popular use case, especially for mission-critical systems that cannot tolerate risk and downtime.

Cloud adoptionUse Case #2 Hybrid Cloud Architecture

Once you’re in the cloud and you have a hybrid cloud architecture, you need to maintain it. You need to connect it with the rest of your enterprise. It needs to be a natural extension of your data center. Continuous real-time data moment with streaming data integration allows you to have your cloud databases and services as part of your data center.

The important thing is that these workloads in the cloud can be operational workloads because there’s fresh information (ie, continuously updated information) available. Your databases, your machine data, your log files, your other cloud sources, messaging systems, and sensors can move continuously to enable operational workloads.

What do we see in hybrid cloud architectures? Heavy use of cloud analytics solutions. If you want operational reporting or operational intelligence, you want comprehensive data delivered continuously so that you can trust that’s up-to-date, and gain operational intelligence from your analytics solutions.

You can also connect your data sources with the messaging systems in the cloud to support event distribution for your new apps that you’re running in the cloud so that they are completely part of your data center. If you’re adopting multi-cloud solutions, you can again connect your new cloud systems with existing cloud systems, or send data to multiple cloud destinations.

Hybrid Cloud ArchitectureUse Case #3 Real-Time Modern Applications

A third use case is real-time modern applications. Cloud is a big trend right now, but not everything is necessarily in the cloud. You can have modern applications on-premises. So, if you’re building any real-time app and modern new system that needs timely information, you need to have continuous real-time data pipelines. Streaming data integration enables you run real-time apps with real-time data.

Use Case #4 Hot Cache

Last, but not least, when you have an in-memory data grid to help with your data retrieval performance, you need to make sure it is continuously up-to-date so that you can rely on that data – it’s something that users can depend on. If the source system is updated, but your cache is not updated, it can create business problems. By continuously moving real-time data using CDC technology, streaming data integration helps you to keep your data grid up-to-date. It can serve as your hot cache to support your business with fresh data.

 

To learn more about streaming data integration use cases, please visit our Solutions section, schedule a demo with a Striim expert, or download the Striim platform to get started.

 

Build vs. Buy for Streaming Data Integration

Build vs. Buy for Streaming Data Integration: Whiteboard Wednesdays

In this Whiteboard Wednesday video, Steve Wilkes, founder and CTO of Striim, asks the question, “Is it better to build or buy a Streaming Data Integration platform?”  Read on, or watch the 10-minute video:

You want to use streaming data integration to move data from existing sources, that generate lots of data, to targets. But is it better to build a streaming data integration platform from lots of readily available open source components, or buy one and have that work done for me?

Build

First, you need all the components that we’ve mentioned in previous videos that go into making a streaming data integration platform. You need to build data collectors and manage data delivery – which is not just one single thing. You need many adapters that support databases, files, maybe IOT cloud targets, essentially a whole bunch of different technologies to get the data from where it is now to where you want it to be.

You need to think about data movement? What messaging system do I use. Do I use multiple ways of doing data movement? How am I going to do stream processing? What is the engine for that? How do I define it? Is there a UI in open source? Not everything has a UI. How do I define the stream processing? Does it support SQL, which in a previous video we mentioned is the best approach working with streaming data. How do I enrich data? Does the platform include an in-memory data grid or other mechanism by which I can enrich data at high speed? How do I ensure that all of this is enterprise grade? Is it scalable, reliable, and secure? How do I get insights and alerts? How do I view the data flows and understand what’s going on if I’m piecing this together myself? This is essential for a lot of operational reasons.

Build vs. Buy for Streaming Data Integration

So, assume that you are going to try and build this from lots of different open source pieces. For every piece that goes into making up this platform, including all the framework pieces and the adapters that you will need on either end, you need to go through a process. This process involves designing the overall platform – how do all the pieces fit together. Then for each piece you need to look at what is available, and evaluate each option, not just on their own, but how they fit in with the other pieces that you are looking at. Once you have done that for each piece, you need to try and integrate all the pieces together. That integration involves building a lot of the glue code on top of the pieces that you have chosen to abstract it so that it is easier for people to work with and ensure that it is enterprise grade; it scales together, it is reliable, and it is secure. Assuming that you have built all of this, you can then start to build your applications. 

Maintenance

Now when it comes to maintenance, open source isn’t always maintained forever. People can stop supporting it. In that case you have to identify another piece of open source that can take its place and will fit in. Then you have to integrate again. Even if it’s just upgraded, that could mean the APIs have changed or the way that it functions has changed. You’re going to have to perform the integration and test everything again to make sure that everything is working. Once that is complete you are going to have to test your applications on the platform that you’ve just built again and make sure that’s working too.

Support

If you have issues, you need to go to support. Some open source platforms offer support, some don’t. There may be pieces that you’re supporting yourself or because you’ve chosen to put multiple things together. In the case of Change Data Capture (CDC) there may be multiple vendors providing support. That becomes a headache because different organizations tend to like to blame each other. You need to have support for all the pieces. If you need to upgrade, then you need to reintegrate again. Maintenance can become a big issue.

Buy

Now the difference between doing all of that and using a prebuilt STI solution is that you’re not doing any of these pieces in the middle. In going the buy route, I am going to replace all those pieces of open source that I have put together myself with the STI solution straight out of the box. You simply download the solution and start building your data flows. That is going to massively reduce the amount of time for you to start building your data integrations.

Build vs. Buy - Streaming Data Integration

The Differences Between Build vs. Buy

When it comes to summarizing all of this, what are the differences between the costs and risks of build vs. buy?

Development Costs

We went through all the steps of building a solution. Each one of those steps are going to involve engineers. They’re going to involve people that understand different technologies and there is going to have to be a team around that to manage building the platform. It can be a pretty high development cost and often that development cost can outweigh the cost of licensing a platform that you purchase. 

Time to Value

There’s a long time to market because you have to go through all those steps for each component. If components are upgraded or deprecated, you have to repeat all of those steps. So the time it takes from you saying, “I want to move data from my on-premise databases into my cloud data warehouse” or “I want to move data from IOT on premise into storage so I can do machine learning”. The time it takes for you to go from thinking “I want to do that to being able to do that” is as long as it takes you to build your framework and infrastructure, build your own SDI platform.

Whereas if you’re buying it, the time it takes is how long does it take to download it, which is much faster. The time to a solution is massively reduced because you’re not going through all the steps it takes to build it and during that time not focusing on building business value.

A lot of organizations out there are not software development companies, they are companies with a different purpose. They are a bank, or a finance organization, or a healthcare organization. Value to the organization is massively increased if you’re not spending huge amounts of time building something that someone else has already built. Focusing on business value enables you to build out solutions quickly and bring value to your business much more quickly – utilizing a prebuilt solution rather than building it yourself.

Build vs. Buy - Streaming Data Integration

Support

I’ve mentioned that open source can become obsolescent. There are cases where people invested in open source technologies which people just stopped developing because it was no longer the hot new thing. They now had to try and maintain it themselves and understand someone else’s code in order to maintain the open source that they’d heavily invested in.

If you are buying a streaming data integration solution from a vendor, then that vendor is going to support you. It requires a wide range of specialized skills. Each one of the pieces of open source requires a different skill set in order to understand it and work with it. It may be a different programming language; it may be a different environment that you need to set up. There’s also a meta skill required in understanding how all those components fit together. This is an architectural role that has to understand all the different APIs and different components and work at how to add all the enterprise grade features so that they all scale, are reliable, and all work together. If you’re buying the costs really are the license cost and limited visibility into the code, into the engine that makes it work.

Depending on who you are, that may or may not be a big issue. As long as what you’ve built and the concepts around what you’ve built are transportable then investing in streaming data integration is also transportable.

What are the differences between build vs. buy? It really comes down to, do you want to get going with building your solution and providing business value straight away? Or do you want to focus on being a software development organization building a platform that can then build the applications that will provide business value.

To learn more about build vs. buy for streaming data integration, please visit our Real-time Data Integration solution page, schedule a demo with a Striim expert, or download the Striim platform to get started.

Whiteboard Wednesdays - Streaming Data Integration

Streaming Data Integration: Whiteboard Wednesdays

 

 

In this Whiteboard Wednesday video, Steve Wilkes, founder and CTO of Striim, takes a look at Streaming Data Integration – what it is, what it is used for, and most importantly, what is needed set it up and manage it. Read on, or watch the 10-minute video:

Like all enterprises, you have a lot of data sources that generate data – databases, machine logs, or other files that are being produced in your systems. You may also have messaging systems sensors or IoT sensors generating data. This data, generated all of the time, may not be in the format you need it or where you need it. This limits your ability to get value from it or make decision using it.

In terms of moving it somewhere else, maybe you want to put it into cloud storage or other types of storage. Maybe you want to put it into different databases or different data warehouses. Maybe want to put it onto messaging systems to build applications for analytics and reporting that will enable analysts, data scientists, decision makers, and your customers can get value out of it. Streaming data integration is about being able to collect data in real time from the systems that are generating it, and moving it to where it needs to be to get value out of it.

Data Collection

Continuous data collection means being able to get the data in real time. Each type of data source requires a different way of collecting the data. Databases are quite often associated with records of what has happened in the past. Running queries against a database can provide bulk responses of historical data. In order capture real time data – that is, the data as it’s being generated – you need to use a technology called change data capture or CDC.

CDC will allow you to collect the inserts, updates and deletes that are happening in the database in real time as they’re being generated. Similarly, you need specific technologies to get the data you want from other sources of data such as files or messaging systems. In the case of files, you may need to be able to tail the file, read at the end of the file, or maybe collect multiple files across multiple systems in parallel. Once you’ve got that data, you need to be able deliver it to the target systems.

Streaming Data Integration - Sources

Data Delivery

Data delivery depends on the type of target system. If you are using database change data capture on the source side, the delivery process also needs to be change aware so an insert or an update or a delete is being managed correctly in the target system. For example, if you are synchronizing an on-premise database with a cloud database, then you need to initially load the data, but you also need to apply change to it in order to keep it synchronized. The delivery process needs to understand that, instead of just inserting all the events into the database, it treats an insert as an insert, an update as an update, and delete as a delete. This needs to be applied in a heterogeneous fashion.

You also need to be able to work with lots of different APIs. Again, just as you need to be able to work with databases, with storage, and with messaging, you also need to be able to support different cloud systems. The security mechanisms used by the different cloud vendors are all different, so you need to be able to manage that.

Data Movement

Streaming data integration needs to continuously to move the data that’s being collected into the appropriate target. It may be required to collect data from multiple sources, combine them together and push them out to a single target, or take data from a single source and move it and deliver it continuously to multiple targets.

Data movement has to be smart. It has to be able to move data between processes, between nodes, in a clustered environment. It has to be able to move data from on-premise to cloud, or from cloud to on-premise, or across different networks and even across different clouds. Also, this data movement has to be at really high speeds and low latency. You need to be able to deliver the data from one place to another really quickly.

In a lot of use cases, you’re not just dealing with the raw data. If you’re doing a database replication, then you can collect change on one side and apply that change exactly as is to the target. But in a lot of cases you want to manipulate the data in some way. The format, structure, or content of the data you collect may not be enough, it may not be correct, or it may not be sufficient for where you want to put it, you may need to do some kind of stream processing.

Streaming Data Integration - TargetsData Processing

Data processing is about being able to take the data that’s on the data stream and manipulate it. You may want to filter it. Maybe everything you collect doesn’t need to be written to everything you deliver to. Maybe only a subset is written to storage; maybe only a subset goes into messaging. You need to be able to filter that data and choose what goes where.

You also need to be able to transform it, to give it the structure that you want, and to apply functions to it in order to make it look like you need.

You also need to worry about security. Maybe you need to mask some of the data, obfuscate it in some way. If you’re doing cloud analytics and you have Personally Identifiable Information (PII) coming from your sources, then it’s probably a good idea to anonymize that data before you put it into the cloud for analytics. Being able to do that as the data streams to a target is really important.

Maybe you need to be able to encrypt the data or combine data together. You can take data from multiple sources and combine it into a single data stream that you then apply into a target. Maybe you need to aggregate or summarize the data before you write it into a data warehouse. Maybe it doesn’t contain enough information by itself. Maybe you need to denormalize it or enrich it before it is delivered somewhere else. For streaming integration platforms, you need to be able to do all of these different types of processing on the data in order to get it in the form that you want before you deliver it.

Data Insights

You also need to be able to get insights on the data, to see how the data’s flowing, whether there are any issues in the data flow, maybe even investigate the data itself if there are any issues. This can be done ad hoc, but from an operational perspective, it’s also important that you get alerts if anything is failing or if there is an unusual volume of data coming from one particular place. There suddenly could be an unusual volume coming from your audit table, or maybe there isn’t sufficient volume and it’s fallen behind what you’d expect for that time of day. Being able to get alerts on the data flows themselves can be really, really important.

Streaming Data Integration - Enterprise Grade Enterprise-Grade Integration

You can play about with streaming data integration components in the lab. But if you’re going to put things into production, streaming data integration needs to be enterprise-grade. Everything needs to be able to scale. Typically, you’re talking about large volumes of data being generated continuously for streaming integration. You need to be able to deal with scalability and adding additional scale over time.

You are probably looking at a cluster distributed environment where you can add new nodes into that environment as you need to in order to be able to handle scale. Of course, for mission-critical platforms for mission-critical data flows, it needs to be reliable.

You need to be able to say, if I collect data, if data is generated on this source side, then I have to be able to guarantee that it makes it into my targets. If I am replicating a database that’s on-premise into a data warehouse that is in the cloud, I need to make sure that every single operation makes it from one end to the other, and that I can investigate that and validate that that has actually happened.

For streaming data integration to be enterprise grade and work with mission-critical applications, you need to be able to ensure that it is reliable. At a minimum, there should be at-least-once processing, and that everything that’s generated on the source makes it to the target. In certain cases, you need “exactly once processing.” This means that not only does it make it to the target, but you are guaranteed that even in the cases of failure, it only ever makes it the target once.

Security is important. We’ve talked about data security in terms of being able to mask and encrypt data as it flows. As things are moving across networks, all of that data should be encrypted so that people can’t tap into the network flow. From a security perspective, you need to be able to lock down who has access to what in the system that is doing all of these things. If you’re generating a data stream on-premise and it has personally identifiable information in it, you need to be able to lock down which employees can access that data stream, even through the streaming platform. Being able to do authentication and authorization within the streaming platform itself is also essential.

In summary, streaming data integration is about continually collecting data from enterprise and cloud sources, being able to process that data while it’s moving, delivering it into any target that you want, whether it’s on-premise or cloud, being able to get insights into how it’s moving, and do all of this in a enterprise-grade way that is scalable, reliable, and secure.

To learn more about streaming data integration, please visit our Real-time Data Integration solution page, schedule a demo with a Striim expert, or download the Striim platform to get started.

Why Choose Striim + Snowflake

Why Choose Striim + Snowflake

 

 

With greater speed, ease, and flexibility, Snowflake’s cloud data warehouse helps you gain meaningful insights by providing the performance and simplicity that traditional data warehouse offerings could not support.

Adopting a data warehouse in the cloud with Snowflake requires a modern approach to the movement of enterprise data. This data is often generated from diverse data sources deployed in various locations – including on-prem data centers, major public clouds, and devices. Snowflake users need real-time data movement capabilities to realize the full potential of data warehousing in the cloud, and benefit from more meaningful operational intelligence.

Why Choose Striim + SnowflakeWhile there are many vendors that provide data movement to Snowflake, Striim’s real-time, enterprise-grade streaming ETL solution offers advanced capabilities that other vendors can’t match, including the ability to:

  • Bring a wide range of data sets (including security log data, Kafka/messaging, IoT data, OLAP and OLTP) in a consumable format to Snowflake to achieve rich, timely, and reliable insights fast
  • Use robust, reliable, low-impact change data capture (CDC) from major enterprise database
  • Aggregate, filter, denormalize, enrich, mask real-time data in-flight using a SQL-based language before delivering to Snowflake to rapidly gain time-sensitive insights
  • Combine and correlate machine data, OLAP, OLTP and IoT data with other data sources in-flight for complete and rich operational intelligence
  • Perform online migration from existing on-prem data warehouses such as (Oracle Exadata or Teradata) to Snowflake with minimized interruption and risks
  • Offer an enterprise-grade solution designed for mission-critical, high data volume environments with built-in HA, scalability, exactly-once-processing (no data loss or duplicates) and security, all in a patented distributed platform

In fact, many current Striim + Snowflake customers previously deployed a solution from another vendor, only to find that the solution did not meet their needs for scalability, reliability, in-flight processing, or, simply, data access.

Let’s drill down on the ways Striim supports Snowflake’s advanced analytics applications with modern, enterprise-grade streaming ETL, and further allows customers to gain operational value from their Snowflake solutions.

Moving data in real time from diverse sources

Data ingestion providers that only collect data from a limited range of sources cannot support complete and rich operational intelligence in Snowflake. The Striim platform ingests real-time streaming data from a variety of sources out-of-the box, including data warehouses databases (including Oracle, SQL Server, HPE NonStop, MongoDB, Amazon RDS, and MySQL), log files from security devices and other systems, sensors, messaging systems, and Hadoop solutions.

Avoiding batch ETL-related inefficiencies

Data integration providers that use traditional batch ETL for the movement of data into Snowflake are unable to support real-time operational intelligence and time-sensitive analytics use cases in Snowflake. When users adopt an ELT architecture by using replication solutions with another tool for in-target transformations, this architecture creates complexities, especially during process recovery. Striim offers an end-to-end solution with a simplified solution architecture to bring a wide range of data in real time.

Minimizing source impact and interruption

While there are providers that offer very simplistic movement of file data, and do not support CDC from databases, these solutions cannot scale and may require the use of additional products to be integrated into the configuration. Using Striim, businesses can easily adopt cloud data warehouse with online migration from existing data warehouses with minimal disruption and risk.

Assuring security and reliability

Striim is an enterprise-grade solution with built-in HA, security, scalability, exactly once processing for no data loss or duplication for business-critical production systems.

Applying transformations and enrichments

Striim applies filtering, transformations, aggregations, masking, and enrichment using static or streaming reference datain real time – as the data is being delivered into Snowflake, in a consumable format to Snowflake – to accelerate the delivery of rich, timely, and reliable insights.

Striim + Snowflake Customer Use Cases

To illustrate the above points, we would like to share a couple of Snowflake customers that chose Striim to gain operational intelligence for critical operations.

A Leading International Cruise Line

  • Ingests real-time POS data from spa, casinos, and stores to enterprise data warehouse on Snowflake to generate near real-time offers for travelers.
  • Striim moves spending data from Oracle, SQL Server databases and GoldenGate Trail Files to Snowflake.
  • The solution provides real-time promotional offers and discounts to customers based on their spending behavior to improve both satisfaction and revenue.

A European HR Solutions Provider

  • Ingests data from Oracle and SQL Server database using real-time CDC from to Snowflake.
  • Chose Striim for low data latency with built-in scalability, security, and reliability.
  • Zero maintenance solution and pay-per-use models also were key considerations.

By streaming enterprise data to Snowflake with built-in scalability, security, and reliability, Striim simplifies the adoption of a modern, cloud data warehouse for time-sensitive, operational decision making.

We invite you to learn more about Striim’s Snowflake offering by visiting our Snowflake solutions page. Feel free to contact us if we can help you reliably move data in real time to Snowflake.

Bi-Directional Replication with Striim

What’s New in Striim v. 3.9.7

In the last few months, the Striim Team launched two new releases: 3.9.6, and more recently, 3.9.7.  Before I go into the details…

Did you know that Striim…? 

    • enables bi-directional data flows for phased, live cloud migrations as part of data modernization.

    • supports Azure Synapse Analytics.

    • supports the Cosmos DB Cassandra API to provide more options to connect Cosmos DB with other systems.

    • provides an Incremental Batch Reader for Google Cloud Spanner to integrate its data with other Google Cloud services in a timely manner.

    • enables commit timestamp propagation for Google Cloud Spanner to enable change-based streaming data flows from the Cloud Spanner database.

    • offers real-time data ingestion from a MariaDB Galera cluster to support customers’ high-availability solutions.

    • enhanced its JMS Reader for broker-side acknowledgments for additional event processing guarantees.

    • enhanced its FileReader to automatically generate output streams based on headers, improving ease of use.

With this overview, we can dig into the key new features that these releases introduced. At the high level, we can summarize the goals of these releases as:

  • Accelerating data migration to the cloud and enabling true hybrid cloud architecture with robust bidirectional replication, now including loop detection
  • Enhancing cloud adoption solutions with added support for additional real-time data ingestion and delivery capabilities
  • High-availability, reliability, and performance-related enhancements for end-to-end streaming data integration solutions

Let’s start with the bi-directional replication offering with a new loop detection feature that streamlines phased cloud migration solutions.

Bi-Directional Replication to Enable Hybrid Cloud Architectures and Minimize Risks with a Phased Cloud Migration

Data migration is one of the key challenges in transforming critical business systems to run on the cloud. Striim helps leading enterprises transform their business operations by simplifying cloud adoption and maintaining a connected hybrid cloud architecture. By continuously moving data from legacy systems to cloud databases in real-time, Striim eliminates database downtime and minimizes business disruption for data migration. With Striim’s bidirectional replication capabilities that include the new loop detection feature, you can now seamlessly move your applications to the cloud environment in a phased manner and further minimize risks. 

Bi-Directional Replication with StriimWith bi-directional replication, the changes occurring in the cloud databases are applied to the legacy database, while the changes happening in the legacy database continue to be delivered to the cloud database to keep them in synch. The loop detection feature recognizes which changes occur due to the replication function versus the application transactions writing to the source database. This bi-directional configuration allows a phased migration, where a portion of your application users use the legacy system, and the rest use the new cloud system. You can decide the speed in which users transition to the new environment. If the new system is not ready for production use, you also have the fallback option to move all users back to the legacy database. Striim version 3.9.7 offers bi-directional replication for Oracle, SQL Server, MySQL, PostgreSQL, and MariaDB databases. 

Using the bi-directional replication configuration across heterogeneous databases, you can minimize your cloud migration risks with:

  • Zero database downtime, 
  • Zero data loss via continuous data validation,
  • The ability to test the new environment without time restrictions,
  • The fallback option to the old environment after the switchover, if the new environment has any issues.

When such critical migration hurdles are removed, you can rapidly tap into the power of a modern cloud environment and transform your business operations to the fullest degree possible.

Bi-directional replication is not limited to data migration use cases, and supports post-migration hybrid cloud architecture as well. Businesses that have applications running on-premises and in the cloud using disparate databases can also use bi-directional replication on an ongoing basis to keep their databases in-sync.

Boosting Cloud Integration Solution with Improved Real-Time Data Ingestion and Delivery 

Striim continues to invest in its cloud integration solution with enhancements to the out-of-the-box adapters that connect with the leading cloud services. The latest releases introduce the following new features:

    • Support for Cassandra API for real-time data delivery to Azure Cosmos DB to provide Cosmos DB users more options for connecting Cosmos DB with other services and data environments in real-time.
    • Support for real-time data delivery to Azure Synapse Analytics.

Striim for Azure Synapse Analytics

    • Certification for Azure SQL Database Managed Instance to enable SQL Server and Azure SQL Database customers to use Striim when moving their critical workloads to Azure.
    • Ability to capture low-latency data from Google Cloud Spanner tables in incremental mode to share the data with other Google Cloud services in a timely manner.

Google Cloud Spanner - Incremental Batch Reader

    • Enhancements to Azure Event Hubs Writer and S3 writer to broaden data security configurations.
    • Further improvements to Google Cloud Pub/Sub Writer’s reliability in supporting high data volumes.

Enhancements for High-availability, Reliability, Security and Performance of the End-to-End Solutions

Enabling and supporting high-availability, reliability, and security of our customers’ end-to-end solutions has been our key differentiator from day one. So, the following new features are part of our efforts to continue to make Striim an enterprise-grade offering for mission-critical environments:

    • Striim now offers the ability to read changed data from, and write to, MariaDB Galera cluster environments in real-time. This means: MariaDB customers that choose to boosts their high-availability with the Galera cluster solution can seamlessly share real-time data with other on-premises and cloud solutions using Striim.
    • With broker side acknowledgments, Striim enhanced its JMS Reader for additional event processing guarantees. Striim groups messages as transactions at the broker level and tracks their full delivery before removing them from the queue to guarantee every message is processed. 
    • Striim further improves data security by supporting AES-256 encryption for data-in-motion.
    • A powerful feature of Striim: capturing real-time data from Oracle GoldenGate Trail Files, is also enhanced. Striim can ingest real-time data from GoldenGate version 18.1 and supports an extended list of data types to help GoldenGate customers distribute the data coming to GoldenGate to an extensive list of additional destinations in real-time.
    • Striim’s FileReader improves ease of use by automatically creating the output streams based on the header information.
    • Striim’s database writer has also been enhanced to increase initial load performance.

You can find more detailed information about Striim’s new features in the documentation for 3.9.7 and 3.9.6, respectively. You may also like to download a free trial of Striim’s latest version or reach out to us to see a customized demo for your specific environment.

PostgreSQL to Kafka

Streaming Data Integration Tutorial: Adding a Kafka Target to a Real-Time Data Pipeline

This is the second post in a two-part blog series discussing how to stream database changes into Kafka. You can read part one here. We will discuss adding a Kafka target to the CDC source from the previous post. The application will ingest database changes (inserts, updates, and deletes) from the PostgreSQL source tables and deliver to Kafka to continuously to update a Kafka topic.

What is Kafka?

Apache Kafka is a popular distributed, fault-tolerant, high-performance messaging system.

Why use Striim with Kafka?

The Striim platform enables you to ingest data into Kafka, process it for different consumers, analyze, visualize, and distribute to a broad range of systems on-premises and in the cloud with an intuitive UI and SQL-based language for easy and fast development.

How to add a Kafka Target to a Striim Dataflow

From the Striim Apps page, click on the app that we created in the previous blog post and select Manage Flow.

MyPostgreSQL CDC App
MyPostgreSQL-CDC App

This will open your application in the Flow Designer.

PostgreSQL CDC App Data Flow
MyPostgrSQLCDC app data flow.

To do the writing to Kafka, we need to add a Target component into the dataflow. Click on the data stream, then on the plus (+) button, and select “Connect next Target component” from the menu.

Connecting a target component to the Data Flow
Connecting a target component to the data flow.

Enter the Target Info

The next step is to specify how to write data to the target.  With the New Target ADAPTER drop-down, select Kafka Writer Version 0.11.0, and enter a few connection properties including the target name, topic and broker URL.

Configuring the Kafka Target
Configuring the Kafka target.

Data Formatting 

Different Kafka consumers may have different requirements for the data format. When writing to Kafka in Striim, you can choose the data format with the FORMATTER drop down and optional configuration properties. Striim supports JSON, Delimited, XML, Avro and free text formats, in this case we are selecting the JSONFormatter.

Configuring the Kafka target formatter
Configuring the Kafka target FORMATTER.

Deploying and Starting the Data Flow

The resulting data flow can now be modified, deployed, and started through the UI. In order to run the application, it first needs to be deployed, click on the ‘Created’ dropdown and select ‘Deploy App’ to show the Deploy UI. 

Deploying CDC app
Deploying the app.

The application can be deployed to all nodes, any one node, or predefined groups in a Striim cluster, the default is the least used node. 

Deployment node selection.
Deployment node selection.

After deployment the application is ready to start, by selecting Start App.

Starting the app.
Starting the app.

Testing the Data Flow

You can use the PostgreSQL to Kafka sample integration application, to insert, delete, and update the PosgtreSQL CDC source table, then you should see data flowing in the UI, indicated by a number of msgs/s. (Note the message sending happens fast and quickly returns to 0).

Testing the streaming data flow.
Testing the streaming data flow.

If you now click on the data stream in the middle and click on the eye icon, you can preview the data flowing between PostgreSQL and Kafka. Here you can see the data, metadata (these are all updates) and before values (what the data was before the update).

Previewing the data flowing from PostgreSQL to Kafka
Previewing the data flowing from PostgreSQL to Kafka.

There are many other sources and targets that Striim supports for streaming data integration. Please request a demo with one of our lead technologists, tailored to your environment.

Change Data Capture - change log

Streaming Data Integration Tutorial: Using CDC to Stream Database Changes

 

This is the first in a two-part blog post discussing how to use Striim for streaming database changes to Apache Kafka. Striim offers continuous data ingestion from databases and other sources in real time; transformation and enrichment using Streaming SQL; delivery of data to multiple targets in the cloud or on-premise; and visualization of results. In this part, we will use Striim’s low-impact, real-time change data capture (CDC) feature to stream database changes (inserts, updates, and deletes) from an operational database into Striim.

What is Change Data Capture

Databases maintain change logs that record all changes made to the database contents and metadata. These change logs can be used for database recovery in the event of a crash, and also for replication or integration.

Change data capture change log

With Striim’s log-based CDC, new database transactions – including inserts, updates, and deletes – are read from source databases’ change logs and turned into a stream of events without impacting the database workload. Striim offers CDC for Oracle, SQL Server, HPE NonStop, MySQL, PostgreSQL, MongoDB, and MariaDB.

Why use Striim’s CDC?

Businesses use Striim’s CDC capabilities to feed real-time data to their big data lakes, cloud databases, and enterprise messaging systems, such as Kafka, for timely operational decision making. They also migrate from on-premises databases to cloud environments without downtime and keep cloud-based analytics environments up-to-date with on-premises databases using CDC.

How to use Striim’s CDC?

Striim’s easy-to-use CDC template wizards automate the creation of applications that leverage change data capture, to stream events as they are created, from various source systems to various targets. Apps created with templates may be modified using Flow Designer or by exporting TQL, editing it, and importing the modified TQL. Striim has templates for many source-target combinations.

In addition, Striim offers pre-built integration applications for bulk loading and CDC from PostgreSQL source databases to target systems including PostgreSQL database, Kafka, and files. You can start these applications in seconds by going to the Applications section of the Striim platform.

Striim Pre-built Sample Integration Applications
Striim pre-built sample integration applications.

In this post, we will show how to use the PostgreSQL CDC (PostgreSQL Reader) with a Striim Target using the wizards for a custom application instead of using the pre-built application mentioned above. The instructions below assume that you are using the PostgreSQL instance that comes with the Striim platform. If you are using your own PostgreSQL database instance, please review our instructions on how to set up PostgreSQL for CDC.

Using the CDC Template

To start building the CDC application, in the Striim web UI, go to the Apps page and select Add App > Start with Template. Enter PostgreSQL in the search field to narrow down the sources and select “PostgreSQL Reader to Striim”.

CDC application template
Wizard template selection when creating a new app.

Next enter the name and namespace for your application (the namespace is a way of grouping applications together).

Creating a new application using Striim.

Specifying the Data Source Properties

In the SETUP POSTGRESQL READER specify the data source and table properties:

  • the connection URL, username, and password.
  • the tables for which you want to read change data.
Configuring the Data Source in the Wizard
Configuring the data source in the wizard.

After you complete this step, your application will open in the Flow Designer.

The Wizard Generates a Data Flow
The wizard generates a data flow.

In the flow designer, you can add various processors, enrichers, transformers, and targets as shown below to complete your pipeline, in some cases with zero coding.

Flow designer enrichers and processors.

Flow designer event transformers and targets.

 

In the next blog post, we will discuss how to add a Kafka target to this data pipeline. In the meantime, please feel free to request a demo with one of our lead technologists, tailored to your environment.

 

How to Migrate Oracle Database to Google Cloud SQL for PostgreSQL with Streaming Data Integration

 

 

For those who need to migrate an Oracle database to Google Cloud, the ability to move mission-critical data in real-time between on-premises and cloud environments without either database downtime or data loss data is paramount. In this video Alok Pareek, Founder and EVP of Products at Striim demonstrates how the Striim platform enables Google Cloud users to build streaming data pipelines from their on-premises databases into their Cloud SQL environment with reliability, security, and scalability. The full 8-minute video is available to watch below:

Striim offers an easy-to-use platform that maximizes the value gained from cloud initiatives; including cloud adoption, hybrid cloud data integration, and in-memory stream processing. This demonstration illustrates how Striim feeds real-time data from mission-critical applications from a variety of on-prem and cloud-based sources to Google Cloud without interruption of critical business operations.

Oracle database to Google Cloud

Through different interactive views, Striim users can develop Apps to build data pipelines to Google Cloud, create custom Dashboards to visualize their data, and Preview the Source data as it streams to ensure they’re getting the data they need. For this demonstration, Apps is the starting point from which to build the data pipeline.

There are two critical phases in this zero-downtime data migration scenario. The first involves the initial load of data from the on-premise Oracle database into the Cloud SQL Postgres database. The second is the synchronization phase, achieved through specialized readers to keep the source and target consistent.

Oracle database to Google Cloud
Striim Flow Designer

The pipeline from the source to the target is built using a flow designer that easily creates and modifies streaming data pipelines. The data can also be transformed while in motion, to be realigned or delivered in a different format. Through the interface, the properties of the Oracle database can also be configured – allowing users extensive flexibility in how the data is moved.

Once the application is started, the data can be previewed, and progress monitored. While in-motion, data can be filtered, transformed, aggregated, enriched, and analyzed before delivery. With up-to-the-second visibility of the data pipeline, users can quickly and easily verify the ingestion, processing, and delivery of their streaming data.

Oracle database to Google Cloud

During the time of initial load, the source data in the database is continually changing. Striim keeps the Cloud SQL Postgres database up-to-date with the on-premises Oracle database using change data capture (CDC). By reading the database transactions in the Oracle redo logs, Striim collects the insert, update, and delete operations as soon as the transactions commit, and makes only the changes to the target, This is done without impacting the performance of source systems, while avoiding any outage to the production database.

By generating DML activity using a simulator, the demonstration shows how inserts, updates, and deletes are managed. Running DMLS operations against the orders table, the preview shows not only the data being captured, but also metadata including the transaction ID, the system commit number, the table name, and the operation type. When you log into the orders table, the data is present in the table.

The initial upload of data from the source to the target, followed by change data capture to ensure source and target remain in-sync, allows businesses to move data from on-premises databases into Google Cloud with the peace of mind that there will be no data loss and no interruption of mission-critical applications.

Additional Resources:

To learn more about Striim’s capabilities to support the data integration requirements for a Google hybrid cloud architecture, check out all of Striim’s solutions for Google Cloud Platform.

To read more about real-time data integration, please visit our Real-Time Data Integration solutions page.

To learn more about how Striim can help you migrate Oracle database to Google Cloud, we invite you to schedule a demo with a Striim technologist.

 

Network Traffic Monitoring with Striim 2

Network Traffic Monitoring with Striim – Part 2

In the first post of this two-part series, we demonstrated how to use Striim’s ability to collect, integrate, process, analyze and visualize large streaming data in real time. In this post, we will provide more details about the machine learning pipeline to forecast network traffic and detect anomalies in Striim.

First, Striim reads raw data from files and then processes and aggregates the raw data. After aggregation, the data volume is dramatically reduced. In this application, it aggregates total network traffic for all IPs per minute. Thus it only has one data point for each minute to train the machine learning models.

In the machine learning pipeline, it first preprocesses the data, then fits the data into the machine learning models for prediction. Finally, it detects anomalies if the actual value is far from the predicted one.

  • Data Processing: Before fitting data into sophisticated models, it is usually necessary to transform and preprocess the data. Striim supports a collection of data preprocessing methods including standardization and normalization to make training less sensitive to the scale of features, power transformation (like square root, log) to stabilize volatile data, and seasonality decomposition like Loess or STL decomposition to remove the seasonality in time-series data, etc. In this application, we use log transformation to make data more stable and improve prediction accuracy.
  • Automated Prediction: Striim first transforms time series regression to a standard regression problem with lagged variables (i.e. autoregression), and then applies any regression model to it for prediction. Striim has a plethora of machine learning models including linear and non-linear regression, SVM, gaussian process regression, random forest, Bayesian networks, deep learning models, etc. In this application, we use Random Forest for several reasons: 
    • It is efficient enough to retrain continuously over streaming data. Some models, like deep neural networks, are time-consuming and not suitable for online training. 
    • It uses the ensembling method which combines multiple ML models to obtain better performance. 
    • It achieves good predictive performance compared to other models according to our comprehensive empirical study.
  • Anomaly Detection: Striim detects anomalies based on its prediction. The main idea is that anomalies tend to have a large difference between the actual value and predicted value. Striim uses a statistical-based threshold to find anomalies if the percentage error between actual and predicted values is larger than the threshold. In this application, Striim assumes percentage errors follow the normal distribution and defines the threshold accordingly. e.g. we can define the threshold as mean + 2 * standard deviation of percentage errors.  

The models can update continuously on the streaming data. One can bound the training data size and control update frequency. In this application, we use the recent 200 data points as training data and update the ML models when it receives a new data point. 

For more details, please see our 2019 DEBS paper: A Demonstration of Striim.

A Streaming Integration and Intelligence Platform. Striim won Best Demo Award in the 13th ACM International Conference on Distributed and Event‐based Systems (DEBS).

 

Network Traffic Monitoring with Striim 1

Network Traffic Monitoring with Striim – Part 1

In this post, we demonstrate how to use Striim to monitor network traffic, and showcase Striim’s ability to collect, integrate, process, analyze and visualize large streaming data in real time. 

The dataset we use is called UNSW-NB15, which is a popular dataset for network traffic analysis. It has about 2 million records, which are simulated tcpdump data from normal traffic and malicious traffic. The following is the application pipeline:

  1. Striim first reads network traffic data from the source file.
  2. Striim then extracts features from data records, including IP address, the port number of source and destination, transaction timestamp, and transaction bytes. It also transforms data on the fly, e.g., converts the unit from bytes to megabytes.
  3. Striim aggregates the total bytes of network traffic data for each IP address per minute. It finds hot IPs that have the heaviest network traffic.
  4. Striim predicts the future network traffic data one-step ahead with machine learning models (like Random Forest). 
  5. Striim detects anomalies if the actual network traffic is far from the predicted one.
  6. Finally, Striim writes the results to the target file.

The models are updated periodically to guarantee high prediction performance for continuous data streams. As shown in the image above, Striim detects some peaks and troughs as anomalies (red points). The intuition is that it is abnormal if the network traffic data suddenly increases or drops. The monitoring dashboard will be updated in real time. When a new anomaly is detected, it will show up immediately. 

Striim supports a number of preprocessing and machine learning models. In our pipeline, we apply log transformation to stabilize volatile data, fit it into random forest models, and use dynamic percentage error threshold to detect anomalies. We find such model selection performs well according to our empirical evaluation. It is also efficient enough to continuously evolve the unbounded streaming data and support real-time decision making. Some models, like deep neural networks, are time-consuming to retrain and not suitable in the streaming settings.

For more details, please see our 2019 DEBS paper: A Demonstration of Striim.

A Streaming Integration and Intelligence Platform. Striim won Best Demo Award in the 13th ACM International Conference on Distributed and Event‐based Systems (DEBS).