Tutorial: Real-Time Database Integration with Apache Kafka via Change Data Capture

 

 

In this series of blog-based tutorials, we are going to step you through the process of building data flows for streaming integration and analytics applications using the Striim platform. This blog series focuses on the continuous collection of streaming data from a database; integration with Apache Kafka; Kafka stream processing, including enrichment, using continual SQL-based queries; delivery of data to multiple targets; Kafka data analytics and visualization of results.

Continuous Data Collection, Processing, Delivery, and Analytics with the Striim Platform

The first step in any streaming integration is sourcing data. You are probably aware that Striim can continuously collect data from many sources. In this tutorial we are going to be using change data capture (CDC) to stream database DML activity (inserts, updates and deletes) from a MySQL database. The same process would be used to read from Oracle, SQL Server, and MariaDB, but MySQL was chosen so you can easily try this at home.

Using the Striim Application Templates

To start building the CDC application, you need to navigate to Apps in either the drop-down menu, or home screen. Once there, click on
the Apps link to view all previously created data flow applications. This may be empty if you are first getting started with Striim.

To create the application click “Add App” in the top right-hand corner. This will give you three options for creating a new application:

  • Start with Template
  • Start from Scratch
  • Import Existing App

 

 

The Different Options for Creating Applications Using Striim

In this case we are going to start with a template. There are a lot of templates for many combinations of sources and targets. You can narrow the search by entering terms into the Search Templates box.

Striim Provides Many Pre-defined Application Templates

Since we are sourcing from MySQL, you can narrow the search using MySQL.

Our first dataflow will write into Kafka, so select the appropriate Kafka version as a target. You are now able to name your application. Within Striim applications have fully qualified names that enable the same name (for different departments, or users) to exist in different namespaces. A namespace is really just a way to keep related things together.

This application is MySQLCDCToKafka, and we’ll put it in the ‘cdc’ namespace.

Creating a New Application Using Striim

When you click “Save,” the application is created and the wizard will walk you through the steps of the process:

  1. Enter the Data Source info
  2. Test the connection and CDC capabilities
  3. Select the required tables
    (Do optional mapping and filtering of data)
  4. Enter the Target info

You can move forwards and backwards through these steps as necessary if you need to make any corrections.

Enter the Data Source Info

The first step is to enter the required information to connect to the data source. This will vary by source, but you will always need to choose a unique (within the namespace) data source name. In the case of MySQL you need to enter a connection URL, username, password and database name.

The First Step in the Wizard is Configuring the Data Source

Note: If you are using Striim in the cloud, and have configured an on-premise agent to collect data to send to the cloud, you will need to click on ‘configure an on-premise source’. Don’t worry about the password, it is automatically encrypted by our server.

To validate the information and move to the next step, click on the “Next” button at the bottom right.

Testing the Connection

The Striim platform not only verifies that the database connection information you entered is correct, but also checks that the database user has the correct privileges, and that CDC is setup correctly. This step varies from data source to data source. If anything is wrong, you will be notified and told how to fix the problem.

For example, when Next was clicked previously, the ‘striim’ user did not have the correct privileges to receive CDC information from the MySQL database. And the results of step two were as follows.

The Second Step in the Wizard is Testing the Connection and Showing How to Fix Issues

As you can see, the explanation includes the steps required to fix the issue – an additional two permissions were needed by the ‘striim’ user to obtain the CDC information. Once ‘Replication Client’ and ‘Replication Slave’ privileges were granted, the UI permits you to move to the next step.

The Second Step in the Wizard Once All Issues Are Removed

Select the Required Tables

The Striim platform enables you to selectively choose which database schemas and tables to use CDC for. These can be chosen from a list, or through wildcard selection.

The Third Step in the Wizard is Selection Tables for CDC

In this case we are just going to select the PRODUCT_INV table in the test database schema. Clicking Next takes us to an optional screen where you can do more complex data selection.

You Can Optionally Do Advanced Data Mapping, if Necessary

Enter the Target Info

The final step is to specify how we want to write data to the target. When we selected the wizard we chose to write data to Apache Kafka. To complete this step, we simply need to enter a few connection properties including the target name, topic and broker URL.

The Final Wizard Step is Configuring the Target

The final step, before clicking “Next” one last time, is to choose the format of the output data. The Striim platform supports JSON, Delimited, XML, Avro and free text formats for Kafka topics. In this case we are selecting the JSONFormatter. This has a number of optional properties, but we are just going to go with the defaults.

Using the wizard we quickly built an integration application that collects data continuously from MySQL usingchange data capture, and writes that in real-time to Apache Kafka in JSON format. After clicking “Next” you are shown the data flow, and can now edit, run and test the data flow as necessary.

You can also use Striim to write data to managed Kafka services like Confluent Cloud and AWS managed Kafka. Striim also supports cloud-based message busses such as Google PubSub and Azure EventHub.

To simplify your deployments for cloud-based Kafka services, Striim is available on AWS, Azure, and Google Cloud Marketplace.

Deploying and Starting the Data Flow

The resulting data flow can now be modified, deployed and started through the UI.

The Wizard Generates an End-to-End Data Flow

This is achieved through the state control button directly above the data flow. The initial state of the data flow is ‘created’. This means it is just a definition of what the application is supposed to do. In order to turn this into runtime objects, the application needs to be deployed.

 

 

Click on the ‘Created’ dropdown and select ‘Deploy App’ to show the Deploy UI.

 

 

Striim data flows can be deployed very flexibly. The default is to deploy the data flow to the least used node in a Striim cluster. However, the application can be deployed to all, or just some of the nodes (in predefined groups) as necessary. You can even split applications into sub-flows and deploy pieces on different parts of the cluster if required. In this case it’s easy, we’ll just deploy this whole application to one node. After deployment the application is ready to start, by selecting Start App.

Testing the Data Flow

Assuming you have activity on the MySQL table selected in during table selection, you should see data flowing in the UI, indicated by a number of msgs/s.

Testing the Streaming Data Flow

If you now click on the data stream in the middle and click on the eye icon, you can preview the data flowing between MySQL and Kafka.

Previewing the Data Flowing from MySQL to Kafka

Here you can see the data, metadata (these are all updates) and before values (what the data was before the update).

You can write a Kafka Consumer to see what this looks like in JSON format, or use a simple Striim application to read from Kafka, then use the built-in preview mechanism to see what it looks like.

Testing the Message Format Using a Striim Kafka Reader

Exporting As A Script

With the Striim Platform you are not forced to use the UI, even though it is an incredibly fast and convenient way of doing things). You can instead use our scripting language.

Any streaming data flow application can be exported as a TQL script, and any TQL script can be imported then worked with in the UI.

If we go back to the original MySQLCDCToKafka application, and click on the configuration dropdown on the top right of the toolbar, you can export it.

The script for the application looks like this:

CREATE APPLICATION MySQLCDCToKafka;

CREATE  SOURCE MySQLCDC USING MysqlReader  ( 

  Username: 'striim',

  Password: 'Vx/QMzVeRTMY1rxZH+zycQ==',

  ConnectionURL: 'mysql://localhost:3306',

  Tables: 'test.PRODUCT_INV',

  DatabaseName: 'test',

  Password_encrypted: true

 ) 

OUTPUT TO MySQLCDC_ChangeDataStream ;

CREATE  TARGET WriteToKafka USING KafkaWriter VERSION '0.8.0' ( 

  Mode: 'Sync',

  Topic: 'mysqlcdc',

  brokerAddress: 'localhost:9092'

 ) 

FORMAT USING JSONFormatter  ( 

 ) 

INPUT FROM MySQLCDC_ChangeDataStream;

END APPLICATION MySQLCDCToKafka;

In our next tutorial, we will take the raw CDC data and use some in-memory SQL-based processing to transform and enrich before we write the data out to Apache Kafka. In the meantime, please feel free to request a demo with one of our lead technologists, tailored to your environment.

Change Data Capture Methods

 

In databases, change data capture (CDC) is a set of software design patterns used to determine and track the data that has changed so that action can be taken using the changed data. Companies use change data capture for several use cases such as cloud adoption and enabling real-time data warehousing. There are multiple common change data capture methods that you can implement depending on your application requirements and tolerance for performance overhead.

  1. Introduction
  2. Audit Columns
  3. Table Deltas
  4. Triggers
  5. Log-based change data capture

 

Introduction

In high-velocity data environments where time-sensitive decisions are made, change data capture is an excellent fit to achieve low-latency, reliable, and scalable data integration. With over 80% of companies planning on implementing multi-cloud strategies by 2025, picking the right change data capture method for your business is more critical than ever given the need to replicate data across multiple environments.

The business transactions captured in relational databases are critical to understanding the state of business operations. Traditional batch-based approaches to move data once or several times a day introduce latency and reduce the operational value to the organization. Change Data Capture provides real-time or near real-time movement of data by moving and processing data continuously as new database events occur.

There are several change data capture methods to identify changes that need to be captured and moved. Here are the common methods, how they work, and their advantages as well as shortcomings.

Audit Columns

By using existing “LAST_UPDATED” or “DATE_MODIFIED” columns, or by adding one if not available in the application, you can create your own change data capture solution at the application level. This approach retrieves only the rows that have been changed since the data was last extracted.

The CDC logic for the technique would be

  1. Get the maximum value of both the target (blue) table’s ‘Created_Time’ and ‘Updated_Time’ columns

2. Select all the rows from the data source with ‘Created_Time’ greater than (>) the target table’s ‘Updated_Time’ , which are all the newly created rows since the last CDC process was executed.

3. Select all rows from the source table that have a ‘Updated_Time’ greater than (>) the target table’s ‘Updated_Time’ but less than (<) its ‘Updated_Time’. The reason for the exclusion of rows less than the maximum target create date is that they were included in step 2.

4. Insert new rows from step 2 or modify existing rows from step 3 in the target.

Pros of this method

  • It can be built with native application logic
  • It doesn’t require any external tooling

Cons of this method

  • Adds additional overhead to the database
  • DML statements such as deletes will not be propagated to the target without additional scripts to track deletes
  • Error prone and likely to cause issues with data consistency

This approach also requires CPU resources to scan the tables for the changed data and maintenance resources to ensure that the DATE_MODIFIED column is applied reliably across all source tables.

Table Deltas

You can use table delta or ‘tablediff’ utilities to compare the data in two tables for non-convergence. Then you can use additional scripts to apply the deltas from the source table to the target as another approach to change data capture. There are several examples of SQL scripts that can find the difference of two tables.

Advantages of this approach:

  • It provides an accurate view of changed data while only using native SQL scripts

Disadvantage of this approach:

  • Demand for storage significantly increases because you need three copies of the data sources that are being used in this technique: the original data, previous snapshot, and current snapshot
  • It does not scale well in applications with heavy transactional workloads

Although this works better for managing deleted rows, the CPU resources required to identify the differences is significant and the overhead increases linearly with the volume of data. The diff method also introduces latency and cannot be performed in real time. Some log-based change data capture tools come with the ability to analyze different tables to ensure replication consistency.

Triggers

Another method for building change data capture at the application level is defining triggers and creating your own change log in shadow tables. Triggers fire before or after INSERT, UPDATE, or DELETE commands (that indicate a change) and are used to create a change log. Operating at the SQL level, some users prefer this approach. Some databases even have native support for triggers.

However, triggers are required for each table in the source database, and they have greater overhead associated with running triggers on operational tables while the changes are being made. In addition to having a significant impact on the performance of the application, maintaining the triggers as the application change leads to management burden.

Advantages of this approach:

  • Shadow tables can provide an immutable, detailed log of all transactions
  • Directly supported in the SQL API for some databases

Disadvantage of this approach:

  • Significantly reduces the performance of the database by requiring multiple writes to a database every time a row is inserted, updated, or deleted

Many application users do not want to risk the application behavior by introducing triggers to operational tables. DBAs and data architects should always heavily test the performance of any triggers added into their environment and decide if they can tolerate the additional overhead.

Log-Based Change Data Capture

Databases contain transaction (sometimes called redo) logs that store all database events allowing for the database to be recovered in the event of a crash. With log-based change data capture, new database transactions – including inserts, updates, and deletes – are read from source databases’ native transaction or redo logs.

The changes are captured without making application level changes and without having to scan operational tables, both of which add additional workload and reduce source systems’ performance.

Advantages of this approach

  • Minimal impact on production database system – no additional queries required for each transaction
  • Can maintain ACID reliability across multiple systems
  • No requirement to change the production database system’s schemas or the need to add additional tables

Challenges of this approach

  • Parsing the internal logging format of a database is complex – most databases do not document the format nor do they announce changes to it in new releases. This would potentially require you to change your database log parsing logic with each new database release.
  • Would need system to manage the database change events metadata
  • Additional log levels required to produce scannable redo logs can add marginal performance overhead

Data integration platforms that natively perform change data capture can handle the complexity mentioned above by automatically mining the database change logs while managing additional metadata to ensure the replication between two or more systems is reliable.

If you would like a demo of to see how low-impact, real-time log-based change data capture works, or to talk to one of our CDC experts, you can schedule a demo for The Striim Platform.

Striim 3.10.1 Further Speeds Cloud Adoption

 

 

We are pleased to announce the general availability of Striim 3.10.1 that includes support for new and enhanced Cloud targets, extends manageability and diagnostics capabilities, and introduces new ease of use features to speed our customers’ cloud adoption. Key Features released in Striim 3.10.1 are directly available through Snowflake Partner Connect to enable rapid movement of enterprise data into Snowflake.

Striim 3.10.1 Focus Areas Including Cloud Adoption

This new release introduces many new features and capabilities, summarized here:

3.10.1 Features Summary

 

Let’s review the key themes and features of this new release, starting with the new and expanded cloud targets

Striim on Snowflake Partner Connect

From Snowflake Partner Connect, customers can launch a trial Striim Cloud instance directly as part of the Snowflake on-boarding process from the Snowflake UI and load data, optionally with change data capture, directly into Snowflake from any of our supported sources. You can read about this in a separate blog.

Expanded Support for Cloud Targets to Further Enhance Cloud Adoption

The Striim platform has been chosen as a standard for our customers’ cloud adoption use-cases partly because of the wide range of cloud targets it supports. Striim provides integration with databases, data warehouses, storage, messaging systems and other technologies across all three major cloud environments.

A major enhancement is the introduction of support for the Google BigQuery Streaming API. This not only enables real-time analytics on large scale data in BigQuery by ensuring that data is available within seconds of its creation, but it also helps with quota issues that can be faced by high volume customers. The integration through the BigQuery streaming API can support data transfer up to 1GB per second.

In addition to this, Striim 3.10.1 also has the following enhancements:

  • Optimized delivery to Snowflake and Azure Synapse that facilitates compacting multiple operations on the same data to a single operation on the target resulting in much lower change volume
  • Delivery to MongoDB cloud and MongoDB API for Azure Cosmos DB
  • Delivery to Apache Cassandra, DataStax Cassandra, and Cassandra API for Azure Cosmos DB

  • Support for delivery of data in Parquet format to Cloud Storage and Cloud Data Lakes to further support cloud analytics environments

Schema Conversion to Simplify Cloud Adoption Workflows

As part of many cloud migration or cloud integration use-cases, especially during the initial phases, developers often need to create target schemas to match those of source data. Striim adds the capability to use source schema information from popular databases such as Oracle, SQL Server, and PostgreSQL and create appropriate target schema in cloud targets such as Google BigQuery, Snowflake and others. Importantly, these conversions understand data type and structure differences between heterogeneous sources and targets and act intelligently to spot problems and inconsistencies before progressing to data movement, simplifying cloud adoption.

Enhanced Monitoring, Alerting and Diagnostics

On-going data movement between on-premise and cloud environments for migrations, or powering reporting and analytics solutions, are often part of an enterprise’s critical applications. As such they demand deep insights into the status of all active data flows.

Striim 3.10.1 adds the capability to inherently monitor data from its creation in the source to successful delivery in a target, generate detailed lag reports, and alert on situations where lag is outside of SLAs.

End to End Lag Visualization

In addition, this release provides detailed status on checkpointing information for recovery and high availability scenarios, with insight into checkpointing history and currency.

Real-time Checkpointing Information

Simplifies Working with Complex Data

As customers work with heterogeneous environments and adopt more complex integration scenarios, they often have to work with complex data types, or perform necessary data conversions. While always possible through user defined functions, this release adds multiple commonly requested data manipulation functions out of the box. This simplifies working with JSON data and document structures, while also facilitating data cleansing, and regular expression operations.

On-Going Support for Enterprise Sources

As customers upgrade their environments, or adopt new technologies, it is essential that their integration platform keeps pace. In Striim 3.10.1 we extend our support for the Oracle database to include Oracle 19c, including change data capture, add support for schema information and metadata for Oracle GoldenGate trails, and certify our support for Hive 3.1.0

These are a high level view of the new features of Striim 3.10.1. There is a lot more to discover to aid on your cloud adoption journey. If you would like to learn more about the new release, please reach out to schedule a demo with a Striim expert.

Mitigating Data Migration and Integration Risks for Hybrid Cloud Architecture

 

Cloud computing has transformed how businesses use technology and drive innovation for improved outcomes. However, the journey to the cloud, which includes data migration from legacy systems, and integration of cloud solutions with existing systems, is not a trivial task. There are multiple cloud adoption risks that businesses need to mitigate to achieve the cloud’s full potential.

 

Common Risks in Data Migration and Integration to Cloud Environments

In addition to data security and privacy, there are additional concerns and risks in cloud migration and integration. These include:

Downtime: The bulk data loading technique, which takes a snapshot of the source database, requires you to lock the legacy database to preserve the consistent state. This translates to downtime and business disruption for your end users. While this disruption can be acceptable for some of your business systems, the mission-critical ones that need modernization are typically the ones that cannot tolerate even planned downtime. And sometimes, planned downtime extends beyond the expected duration, turning into unplanned downtime with detrimental effects on your business.

Data loss: Some data migration tools might lose or corrupt data in transit because of a process failure or network outage. Or they may fail to apply the data to the target system in the right transactional order. As a result, your cloud database ends up diverging from the legacy system, also negatively impacting your business operations.

Inadequate Testing: Many migration projects operate under tense time pressures to minimize downtime, which can lead to a rushed testing phase. When the new environment is not tested thoroughly, the end result can be an unstable cloud environment. Certainly, not the desired outcome when your goal is to take your business systems to the next level.

Stale Data: Many migration solutions focus on the “lift and shift” of existing systems to the cloud. While it is a critical part of cloud adoption, your journey does not end there. Having a reliable and secure data integration solution that keeps your cloud systems up-to-date with existing data sources is critical to maintaining your hybrid cloud or multi-cloud architecture. Working with outdated technologies can lead to stale data in the cloud and create delays, errors, and other inefficiencies for your operational workloads.

 

Upcoming Webinar on the Role of Streaming Data Integration for Data Migration and Integration to Cloud

Streaming data integration is a new approach to data integration that addresses the multifaceted challenges of cloud adoption. By combining bulk loading with real-time change data capture technologies, it minimizes downtime and risks mentioned above and enables reliable and continuous data flow after the migration.

Striim - Data Migration to Cloud

In our next live, interactive webinar, we dive into this particular topic; Cloud Adoption: How Streaming Data Integration Minimizes Risks. Our Co-Founder and CTO, Steve Wilkes, will present the practical ways you can mitigate the data migration risks and handle integration challenges for cloud environments. Striim’s Solution Architect, Edward Bell, will walk you through with a live demo of zero downtime data migration and continuous streaming integration to major cloud platforms, such as AWS, Azure, and Google Cloud.

I hope you can join this live, practical presentation on Thursday, May 7th 10:00 AM PT / 1:00 PM ET to learn more about how to:

  • Reduce migration downtime and data loss risks, as well as allow unlimited testing time of the new cloud environment.
  • Set up streaming data pipelines in just minutes to reliably support operational workloads in the cloud.
  • Handle strict security, reliability, and scalability requirements of your mission-critical systems with an enterprise-grade streaming data integration platform.

Until we see you at the webinar, and afterward, please feel free to reach out to get a customized Striim demo for data migration and integration to cloud to support your specific IT environment.

 

MySQL to Google BigQuery using CDC

Tutorial: Migrating from MySQL to BigQuery for Real-Time Data Analytics

 

 

In this post, we will walk through an example of how to replicate and synchronize your data from on-premises MySQL to BigQuery using change data capture (CDC).

Data warehouses have traditionally been on-premises services that required data to be transferred using batch load methods. Ingesting, storing, and manipulating data with cloud data services like Google BigQuery makes the whole process easier and more cost effective, provided that you can get your data in efficiently.

Striim real-time data integration platform allows you to move data in real-time as changes are being recorded using a technology called change data capture. This allows you to build real-time analytics and machine learning capabilities from your on-premises datasets with minimal impact.

Source MySQL Database

Before you set up the Striim platform to synchronize your data from MySQL to BigQuery, let’s take a look at the source database and prepare the corresponding database structure in BigQuery. For this example, I am using a local MySQL database with a simple purchases table to simulate a financial datastore that we want to ingest from MySQL to BigQuery for analytics and reporting.

I’ve loaded a number of initial records into this table and have a script to apply additional records once Striim has been configured to show how it picks up the changes automatically in real time.

Targeting Google BigQuery

You also need to make sure your instance of BigQuery has been set up to mirror the source or the on-premises data structure. There are a few ways to do this, but because you are using a small table structure, you are going to set this up using the Google Cloud Console interface. Open the Google Cloud Console, and select a project, or create a new one. You can now select BigQuery from the available cloud services. Create a new dataset to hold the incoming data from the MySQL database.

Once the dataset has been created, you also need to create a table structure. Striim can perform the transformations while the data flies through the synchronization process. However, to make things a little easier here, I have replicated the same structure as the on-premises data source.

You will also need a service account to allow your Striim application to access BigQuery. Open the service account option through the IAM window in the Google Cloud Console and create a new service account. Give the necessary permissions for the service account by assigning BigQuery Owner and Admin roles and download the service account key to a JSON file.

Set Up the Striim Application

Now you have your data in a table in the on-premises MySQL database and have a corresponding empty table with the same fields in BigQuery. Let’s now set up a Striim application on Google Cloud Platform for the migration service.

Open your Google Cloud Console and open or start a new project. Go to the marketplace and search for Striim. A number of options should return, but the option you are after is the first item that allows integration of real-time data to Google Cloud services.

Select this option and start the deployment process. For this tutorial, you are just using the defaults for the Striim server. In production, you would need to size appropriately depending on your load.

Click the deploy button at the bottom of this screen and start the deployment process.

Once this deployment has finished, the details of the server and the Striim application will be generated.

Before you open the admin site, you will need to add a few files to the Striim Virtual Machine. Open the SSH console to the machine and copy the JSON file with the service account key to a location Striim can access. I used /opt/striim/conf/servicekey.json.

You also need to restart the Striim services for these setting and changes to take effect. The easiest way to do this is to restart the VM.

Give these files the right permissions by running the following commands:

chown striim:striim <filename>

chmod 770 <filename>

You also need to restart the Striim services for this to take effect. The easiest way to do this is to restart the VM.

Once this is done, close the shell and click on the Visit The Site button to open the Striim admin portal.

Before you can use Striim, you will need to configure some basic details. Register your details and enter in the Cluster name (I used “DemoCluster”) and password, as well as an admin password. Leave the license field blank to get a trial license if you don’t have a license, then wait for the installation to finish.

 

When you get to the home screen for Striim, you will see three options. Let’s start by creating an app to connect your on-premises database with BigQuery to perform the initial load of data. To create this application, you will need to start from scratch from the applications area. Give your application a name and you will be presented with a blank canvas.

The first step is to read data from MySQL, so drag a database reader from the sources tab on the left. Double-click on the database reader to set the connection string with a JDBC-style URL using the template:

jdbc:mysql://<server_ip>:<port>/<database>

You must also specify the tables to synchronize — for this example, purchases — as this allows you to restrict what is synchronized.

Finally, create a new output. I called mine PurchasesDataStream.

You also need to connect your BigQuery instance to your source. Drag a BigQuery writer from the targets tab on the left. Double-click on the writer and select the input stream from the previous step and specify the location of the service account key. Finally, map the source and target tables together using the form:

<source-database>.<source-table>,<target-database>.<target-table>

For this use case this is just a single table on each side.

Once both the source and target connectors have been configured, deploy and start the application to begin the initial load process. Once the application is deployed and running, you can use the monitor menu option on the top left of the screen to watch the progress.

Because this example contains a small data load, the initial load application finishes pretty quickly. You can now stop this initial load application and move on to the synchronization.

Updating BigQuery with Change Data Capture

Striim has pushed your current database up into BigQuery, but ideally you want to update this every time the on-premises database changes. This is where the change data capture application comes into play.

Go back to the applications screen in Striim and create a new application from a template. Find and select the MySQL CDC to BigQuery option.

 

Like the first application, you need to configure the details for your on-premises MySQL source. Use the same basic settings as before. However, this time the wizard adds the JDBC component to the connection URL.

When you click Next, Striim will ensure that it can connect to the local source. Striim will retrieve all the tables from the source. Select the tables you want to sync. For this example, it’s just the purchases table.

Once the local tables are mapped, you need to connect to the BigQuery target. Again, you can use the same settings as before by specifying the same service key JSON file, table mapping, and GCP Project ID.

Once the setup of the application is complete, you can deploy and turn on the synchronization application. This will monitor the on-premises database for any changes, then synchronize them into BigQuery.

Let’s see this in action by clicking on the monitor button again and loading some data into your on-premises database. As the data loads, you will see the transactions being processed by Striim.

Next Steps

As you can see, Striim makes it easy for you to synchronize your on-premises data from existing databases, such as MySQL, to BigQuery. By constantly moving your data into BigQuery, you could now start building analytics or machine learning models on top, all with minimal impact to your current systems. You could also start ingesting and normalizing more datasets with Striim to fully take advantage of your data when combined with the power of BigQuery.

To learn more about Striim for Google BigQuery, check out the related product page. Striim is not limited to MySQL to BigQuery integration, and supports many different sources and targets. To see how Striim can help with your move to cloud-based services, schedule a demo with a Striim technologist or download a free trial of the platform.

Striim Security Enhancements

Striim 3.9.8 Adds Advanced Security Features for Cloud Adoption

 

 

We are pleased to announce the general availability of Striim 3.9.8 with a rich set of features that span multiple areas, including advanced data security, enhanced development productivity, data accountability, performance and scalability, and extensibility with new data targets.

The new release brings many new features that are summarized here:

Let’s review the key themes and features of the new release starting with the security topic.

Advanced Platform and Adapter Security:

With a sharp focus on business-critical systems and use cases, the Striim team has been boosting the platform’s security features for the last several years. However, in version 3.9.8, we introduced a broad range of advanced security features to both the platform and its adapters to provide users with robust security for the end-to-end solution, and higher control for managing data security.

The new platform security features include the following components:

  • Striim KeyStore, which is a secured, centralized repository based on Java Keystore, for storing passwords and encryption keys, streamlines security management across the platform.
  • Ultra-secure algorithms for user password encryption across all parts of the platform reducing platform’s vulnerabilities to external or internal breaches.
  • Stronger encryption support for inter-node cluster communication with internally generated, long string password and unified security management for all nodes and agents.
  • Multi-layered application security via advanced support for exporting and importing pipeline applications within the platform. In Striim, all password properties of an application are encrypted using their own keys. When exporting applications containing passwords or other encrypted property values, you can now add a second level of encryption with a passphrase that will be required at the time of import, to strengthen the application security.
  • Encryption support using customer provided key for securing permanent files, via the File Writer, and for the intermediate temporary files via the Google Cloud Storage Writer. Supported encryption algorithm types include RSA, AES and PGP. You can generate keys for encrypting by multiple tools available online or using in house Java program and easily configure the encryption settings of the adapters via the Encryption policy property on the UI.

Overall, these new security features enable:

  • Enhanced platform and adapter security for hybrid cloud deployments and mission-critical environments
  • Strengthened end-to-end data protection from ingestion to file delivery
  • Enhanced compliance with strict security policies and regulations
  • Secured application sharing between platform users

Improved Data Accountability:

Striim version 3.9.8 includes an application-specific exception store for storing events discarded by the application, including discarded records. The feature allows viewing discarded records and their details in real time. You can configure this feature with a simple on/off option when building an application. With this feature, Striim improves its accountability for all data passing through the platform and allows users to build applications for replaying and processing discarded records.

Enhanced Application Development Support and Ease of Use

The new release also includes features that accelerate and ease developing integration applications, especially in high-volume data environments.

  • A New Enrichment Transformer: Expanding the existing library of out-of-the-box transformers, the new enrichment transformer function allows you to enrich your streaming data in-flight without any manual coding step. You only need Striim’s drag and drop UI to create a real-time data pipeline that performs in-memory data lookups. With this transformer, you can, for example, add City Name and County Name fields to an event containing Zip Code.

  • External Lookups: Striim provides an in-memory data cache to enrich data in-flight at very high speeds. With the new release, Striim gives you the option to enrich data with lookups from external data stores. The platform can now execute a database query to fetch data from an external database and return the data as a batch. The external lookup option helps users avoid preloading data in the Striim cache. This is especially beneficial for lookups from or joining with large data sets. External lookups also eliminate the need for a cache refresh since the data is fetched from the external database. The external lookups are supported for all major databases, including Oracle, SQL Server, MySQL, PostgreSQL, HPE NonStop.
  • The Option to Use Sample Data for Continuous Queries: With this ability, Striim reduces the data required for computation or displaying results via the dashboards. You can select to use only a portion of your streaming data for the application, if your use case can benefit from this approach. As a result, it increases the speed for computation and displaying the results, especially when working with very large data volumes.
  • Dynamic Output Names for Writers: The Striim platform makes it now easy to organize and consume the files and objects on the target system by giving flexible options for naming them. Striim file and object output names can include data, metadata, and user data field values from the source event. This dynamic output naming feature is available for the following targets: Azure Data Lake Store Gen 1 and Gen 2, Azure Blob Storage, Azure File Storage, Google Cloud Storage, Apache HDFS, Amazon S3.
  • Event-Augmented Kafka Message Header: Starting with Apache Kafka v11, Striim 3.9.8 introduced a new property called MessageHeader that enriches the Kafka message header with a mix of the event’s dynamic and static values before delivering with sub-second latency. With the help of the additional contextual information, downstream consumer application can rapidly determine how to use the messages arriving via Striim.
  • Simplified User Experience: The new UI for configuring complex adapter properties, such as rollover policy, flush policy, encryption policy, speeds new application development.

  • New sample application for real-time dashboards: Striim version 3.9.8 added a new sample dashboarding application that uses real-time data from meetup-website and displays in details of the meet-up events happening around the globe using demonstrates the Vector Map visualization.

Other platform improvements for ease of use and manageability include:

  • The Open Processor component, which allows bringing external code into the Striim platform, can be loaded and unloaded dynamically without having to restart Striim.
  • The Striim REST API allows safely deleting or post-processing the files processed by the Striim File Reader.
  • The Striim REST API for application monitoring reports consolidated statistics of various application components within a specified time range.

Increased Performance and Scalability:

For further improving performance and scalability, we have multiple features, including dynamic partitioning and performance fine-tuning for writers:

  • Dynamic Partitioning with Higher-Level of Control: Partitions allow parallel processing of the events in the stream by splitting them across multiple servers in the deployment. Striim’s partitioning distributes events dynamically at run-time across server nodes in a cluster and enables high performance and easy scalability. In prior releases, Striim used one or more fields of the events in the stream as key for partitioning. In the new release, users have additional, flexible options for distributing and processing large data volumes in streams or windows. Striim 3.9.8 allows partitioning key to be one or more expressions composed with the fields of the events in the stream. Striim’s flexible partitioning enables load-balancing applications that are deployed on multi-node clusters and process large data volumes. Windows-based partitioning enables grouping the events in windows that can, for example, be consumed by specific downstream writers. As a result, you can perform load-balancing across multiple writers to improve writing performance.
  • Writer Fine-Tuning Options: Striim 3.9.8 now offers the ability to configure the number of parallel threads for writing into the target system and simplifies writer configuration for achieving even higher throughput from the platform. The fine-tuning option is available for the following list of writers at this time: Azure Synapse Analytics and Azure SQL Data Warehouse, Google BigQuery, Google Cloud Spanner, Azure Cosmos DB, Apache HBase, Apache Kudu, MapR Database, Amazon Redshift, and Snowflake.

Increased Extensibility with New Data Targets

  • The Striim platform now supports SAP Hana as a target with direct integration. SAP Hana customers can now stream real-time data from a diverse set of sources into the platform with in-flight, in-memory data processing. With the availability of real-time data pipelines to SAP Hana, deployed on-premises or in the cloud, customers can rapidly develop time-sensitive analytics applications that transform their business operations.
  • Expanding the HTTP Reader capabilities to send custom responses back to the requestor. The HTTP Reader can now defer responding until events reach a corresponding HTTP Writer. This feature enables users to build REST services using Striim.

Other extensibility improvements are:

  • Improved support for handling special characters for table names in Oracle and SQL Server databases
  • Hazelcast Writer supports multi-column primary keys to enable more complex Hot Cache use cases
  • Performance improvement options for the SQL Server CDC Reader

These are only a portion of the new features of Striim 3.9.8. There is more to discover. If you would like to learn more about the new release, please reach out to schedule a demo with a Striim expert.

Advancement of Data Movement Techologies

Advancement of Data Movement Technologies: Whiteboard Wednesdays

 

In this Whiteboard Wednesday video, Irem Radzik, Head of Product Marketing at Striim, looks at how data movement technologies have evolved in response to changing user demands. Read on, or watch the 8-minute video:

Today we’re going to talk about the advancement of data movement technologies. We’re going to look at the ETL technologies that we started seeing in ‘90s, then the CDC (Change Data Capture)/Logical Replication solutions that we started seeing a couple of decades ago, and then streaming data integration solutions that we more commonly see today.

ETL

Let’s look at ETL technologies. ETL is known for its batch extract, then bringing the data into the transformation step in the middle tier server, and then loading the target in bulk again, typically for next-day reporting. You end up having high latency with these types of solutions. That was good enough for the ‘90s, but then we started demanding more fresh data for operational decision making. Latency became an issue with ETL solutions.

Data Movement - ETL

The other issue with ETL was the batch-window dependency. Because of the high impact on the production sources, there had to be a dedicated time for these batch extracts when the main users wouldn’t be able to use the production database. The batch window that was available for data extract became shorter and shorter as business demanded continuous access to the OLTP system.

The data volumes increased at the same time. You ended up not having enough time to move all the data you needed. That became a pain point for ETL users, driving them to look into other solutions.

Change Data Capture/Logical Replication

Change Data Capture/Logical Replication solutions addressed several of the key concerns that ETL had. Change Data Capture basically means that you continuously capture new transactions happening in the source database and deliver it to the target in real time.

Data Movement - CDC / Logical ReplicationThat obviously helps with the data latency problem. You end up having real-time, up to date data in the target for your operational decision making. The other plus of CDC is the source impact.

When it’s using logs (database logs) to capture the data, it has negligible impact. The source production system is available for transaction users. There is no batch window needed and no limitations for how much time you have to extract and move the data.

The CDC/Logical Replication solutions handle some of the key concerns of ETL users. They are made more for the E and L steps. What ends up happening with these solutions is that you need to do transformations within the database or with another tool, in order to complete the transformation step for end users.

The transformation happening there creates an E L T architecture and requires another product, another step, another network hub in your architecture, which complicates the process.

When there’s an outage, when there is a process disruption, reconciling your data and recovering becomes more complicated. That’s the shortcoming CDC users have been facing. These solutions were mainly made for databases.

Once the cloud and big data solutions became popular, the CDC providers had to come up with new products for cloud and big data targets. These are add-ons, not part of the main platform.

Another shortcoming that we’ve seen with CDC/Logical Replication solutions is their single node architecture, which translates into a single point of failure. This is a shortcoming, especially for mission-critical systems that need continuous availability of the data integration processes.

Streaming Data Integration

In recent years, streaming data integration came about to address the issues that CDC/Logical Replication products raised. It is becoming increasingly common. With streaming data integration, you’re not limited to just database sources.

Data Movement Streaming Data IntegrationYou can have your files, log data, your machine data, your system log files for example, all moving in a real-time fashion. Your cloud sources, your service bus or your messaging systems can be your source. Your sensor data can be moved in real time, in a streaming fashion to multiple targets. Again, not limited to just databases.

You can have cloud databases or other cloud services as your target. You can, in addition to databases, have messaging systems as your target, on-premises or in cloud, your big data solutions, on-premises or cloud. You can also deliver in file format.

Everything is like it was in a logical replication solution. It is continuous, in real time, and Change Data Capture is still a big component of the streaming data integration.

It’s built on top of the Change Data Capture technologies and brings additional data sources and additional data targets. Another important difference, and handling one of the challenges of logical replication, is the transformation piece. As we discussed, a transformation needs to happen and where it happens makes a big difference.

With streaming data integration, it’s happening in-flight. While the data is moving, you can have stream processing without adding more latency to your data. While the data is moving, it can be filtered, it can be aggregated, it can be masked and encrypted, and enriched with reference data, all in flight before it’s delivered to your target, so that it’s available in a consumable format. This streamlines your architecture, simplifies it, and makes all the recovery steps easier. It’s also delivering the data in the format that your users need.

Another important thing to highlight is the distributed architecture. This natively clustered environment helps with a single point of failure risk. When one node fails, the other one takes over immediately, so you have a highly available data pipeline. This distributed clustered environment also helps you to scale out very easily, add more servers as you have more data to process and move.

These solutions now come with a monitoring part. The real time monitoring of the pipelines gives you an understanding of what’s happening with your integration flows. If there is an issue, if there is high data latency or process issue, you get immediate alerts so you can trust that everything is running.

Data reliability is really critical, whole pipeline reliability is very critical. To make sure that there is no data loss or duplicates, there is data delivery validation that can be included in some of these solutions. You can also make sure, with the right solution, that everything is processed exactly once, and you are not repeating or dropping data. There are checkpointing mechanisms to be able to do that.

As you see, the new streaming data integration solutions handle some of the challenges that we have seen in the past with outdated data movement technologies. To learn more about streaming data integration, please visit our Real-time Data Integration solution page, schedule a demo with a Striim expert, or download the Striim platform to get started.

 

5 Streaming Cloud Integration Use Cases: Whiteboard Wednesdays

 

 

Today we’re going to talk about five streaming cloud integration use cases. Streaming cloud integration moves data continuously in real time between heterogeneous databases, with in-flight data processing. Read on, or watch the 9-minute video:

Let’s focus on how to use streaming data integration in cloud initiatives, and the five common scenarios that we see.

Use Case #1 – Online Migration/Cloud Adoption

Let’s start with the first one. It is basically adopting the cloud or getting to the cloud. When you want to move your data to the cloud, streaming cloud integration helps you with online database migration. You have your legacy database and you want to move it to the cloud. If this is a critical database, you do not want to pause it during this migration, you want it to be operational to support your business.

Streaming data integration offers Change Data Capture technology. This captures all new transactions, change transactions, as soon as they happen and delivers them to the target. While you’re doing the initial load to the cloud database, you can start Change Data Capture, keep the system open to transactions, and capture all the new transactions happening with the CDC feature.

Once the initial load is done here, you can apply the change data to the system so that these two are in-sync. Because the system, this database, is open to transactions you basically have no database downtime. This is available for users and once the data is applied here you also have the ability to validate that these two databases are in-sync, and there is no data loss during this migration process. There are tools that provide this validation, to give you no data loss during the migration.

Because this database has production data and is up-to-date and the other one is still functional, you have unlimited time to test the new database. You can control the tests and be comfortable before you point any users or any applications to this cloud database. This unlimited testing minimizes your risks. The time pressure is gone, and you can be comfortable with your move to the cloud.

You also have the ability to perform phased migration. Bi-directional data flow between the legacy system and the cloud system allows you to have users on both sides. You can move some of your users to the cloud database and some of them are still in the legacy.

The streaming cloud integration solution can apply the changes happening in the cloud to the legacy and the changes happening in the legacy back to the cloud, so they stay in-sync. You can gradually move your users to the cloud database when you feel comfortable. Phased migration is another way to minimize your risk of moving your mission-critical systems to the cloud.

Use Case #2 – Hybrid Cloud Architecture for Analytics

We have discussed how to ease your cloud adoption, but once you’re in the cloud and you have adopted a cloud solution, you also need to treat it as part of your data center and build continuous data movement between your existing data sources and the new cloud solution.Hybrid cloud architecture

We see quite a bit of cloud analytics solutions and that’s our second use case. Many organizations these days offload their analytics to cloud solutions. Modern cloud solutions give them tons of new features to modernize their analytics environment and transform their business.

We help with moving all kinds of enterprise data. This can be your databases, your machine data, all kinds of log files (security files and system log files), your existing cloud sources, your messaging systems, and your sensor data. All of them can be moved in real time continuously to your cloud analytics solution.

I would like to add that some streaming cloud integration solutions give you the ability to do in-flight data processing. Transformations happen in-flight so that you deliver the data, without adding latency, to the target system in a consumable format that it needs. You end up having data flowing in the right format for your cloud analytics solution.

The main value from this is that you can now run operational workloads, high value or high operational value producing analytics applications in your analytics solution. You can influence the decisions, operational decisions, happening in your business. That will help you gain faster business transformation throughout your enterprise.

Use Case #3 – Building New Applications in the Cloud

Building applications in the cloudWe talked about the analytics use case, here is another similar one. As part of your hybrid cloud architecture, you might be building new applications in the cloud. You still need data coming from your enterprise data sources to your cloud environment. By moving all this diverse set of data in real time to your cloud messaging systems or cloud databases or storage solutions, you are able to easily build applications in the cloud.

These modern applications move your business forward because the data is available. You can make better use of these cloud applications if you have this real-time bridge between your existing data center and your new cloud environment. Streaming Integration helps you to move your data so you can quickly build new applications for your business, to help it move forward with more modern solutions.

Use Case #4 – Multi-Cloud Integration

We also see multi-cloud use cases. A lot of companies now haveMulti-cloud integration one cloud solution for one purpose, another cloud solution for another purpose, and are working with multiple vendors. You have the option to feed your data to multiple targets. After you capture it once you can feed it to all kinds of different targets, maybe one of them for analytics and one of them for supporting new applications. You have the ability to distribute your data to multiple cloud solutions.

Use Case #5 – Inter-Cloud Integration

Similarly, if you’re working with multiple cloud vendors, you will need to connect these solutions with each other. If you have an operational database in one cloud and you have an analytics solution in Inter-cloud integrationanother cloud, you need to move the data from this cloud solution to the other one in real time, so you can have operational reporting or operational analytics solutions in this cloud.

Streaming cloud integration gives you the agility and the ability to move your data wherever you want. Cloud can be easily a part of your data center, seamlessly part of your data infrastructure, by moving your data to that environment.

You can use streaming cloud integration to ease your migration to the cloud and adoption of cloud solutions by minimizing your risk and business disruption. You can also maintain your hybrid cloud architecture and multi-cloud architecture with a continuous data flow from your existing data sources.

To learn more about streaming data integration for your cloud solutions, please visit our Hybrid Cloud Integration solution page, schedule a demo with a Striim expert, or download the Striim platform to get started.

 

Evaluating Streaming Data Integration Platforms

Evaluating Streaming Data Integration Platforms: Whiteboard Wednesdays

 

 

In today’s Whiteboard Wednesday video, Steve Wilkes, founder and CTO of Striim, looks at what you need to consider when evaluating streaming data integration platforms. Read on, or watch the 15-minute video:

We’ve already gone through what the components of a streaming integration platform are. Today we’re going to talk about how you go about evaluating streaming data integration platforms based on these components.

Just to reiterate, you need the platform to be able to:

  • Do real-time continuous data collection
  • Move that data continuously from where it’s collected to where it’s going
  • Support delivery to all the different targets that you care about
  • Process the data as it’s moving, so stream processing
  • This all needs to be enterprise grade so that it is scalable and reliable, and all those other things that you care about for mission-critical data
  • Get insights and alerts on that data movement

Let’s think about the things that you need to consider in order to actually achieve this when you’re evaluating such platforms.

Data Collection & Delivery

For data collection and delivery, you care about quite a few different things. Firstly, it needs to be low latency. If it’s a streaming data integration platform, then just doing bulk loads or micro batch may not be sufficient. You want to be able to collect the data the instant it’s created, within milliseconds typically. You need low-latency data collection.

Evaluating Streaming Integration Platforms - Data CollectionIt needs to be able to support all the sources that you care about. If you’re looking for a streaming integration platform, then you’re thinking of more than just one use case. You’re thinking “what platform is going to support all of the streaming data integration needs within my organization?” Supporting just one data source or a couple of data sources isn’t enough.

You need to be able to support all the sources that you care about now and may care about in the future. That could be databases, files, or messaging systems. It could even be IoT. So think about that when you’re evaluating whether the platform has all the sources that you need. Think about how it can deal with those sources in a number of different ways.

For databases, you may need to be able to do bulk loads into a streaming infrastructure, as well as doing Change Data Capture. This is important for collecting real-time change as it’s happening in a database, the inserts, updates, and deletes. For files, you may need to do bulk files, files that exist already, but also files as they’re created, streaming out the data as it’s being written. Supporting both bulk and change data is equally important.

You also need to consider whether the adapters are actually part of the platform or are they third party. If they are part of the platform and the platform is built well, then it means that they will be able to handle all the different requirements of the platform – scalability, reliability, and recoverability. All of those things are integrated end to end because the adapters are part of the platform.

If they’re third party, then that may not be the case. If you have to plug in third party components into your infrastructure, then you can have areas of brittleness where things may not work properly or problematic interfaces when things change. Try to avoid third party adapters wherever you can.

Data collection and data delivery need to be able to support the end to end recovery and reliability that is part of being enterprise grade. That means that from a database perspective, for example, you may need to be able to support maintaining a database transaction context from one end to the other. You need to be able to pick up from where you left off and make sure that data that is collected is delivered to all of the appropriate targets. These could be variable and different.

You might be delivering some data on-premise and some data to the cloud, but you still need to be able to make sure that all the data has made it there. You need to be able to validate that the data is being written to all the different sources and targets that the platform is supporting.

If it’s part of a platform and they’re not third party, you would expect that to be there. If they are third party, then you have to investigate whether all of those things are supported. Data collection and data delivery is the first part of how you evaluate the platform.

Data Movement

The next part is how does it do data movement? This is crucial to maintaining the kind of high throughput and low latency that you’d expect. Data movement is a number of different things. It’s between processing steps. Between your source collection and your data delivery.

Between source collection, maybe some in-memory processing or maybe some enrichment and data delivery. Or it could be an even a more complex pipeline with multiple steps in it. You’re moving data between each step.

It’s also between nodes. If you have a clustered platform and that platform is moving data between nodes for different processing steps, or maybe between source and target because the target is closer to one of the nodes than other nodes. You need to be able to ensure that the data movement happens efficiently, with high throughput and low latency, between nodes.

You also need to be able to support collecting data on-premise and delivering it into cloud environments, or collecting it from cloud environments and delivering it to on-premise, or moving between clouds. Supporting all these different typologies is all part of data movement.

Ideally as much of the data movement as possible should be in memory only. Try to avoid having to write to disk or do any kind of IO in between processing steps. The reason for this is that each processing step needs to perform optimally in order to get high throughput.

If you are persisting data, that can add latency. Ideally when you’re doing multiple processing steps in a pipeline, you’re doing all of that data movement in memory only, between the steps or just between nodes. You’re not persisting to disk.

You should only use persistent data movement or persistent data streams where needed. There are a couple of really good use cases for this. One is if you have data sources that you can’t rewind into for recoverability, you may want to use a persistent data stream as the first step in the process, but everything downstream can be in memory only.

If you’re collecting data in real time, but you have multiple applications all running at their own speeds against that data, you may want to think about having persistent data streams between different steps. Typically, you want to minimize the amount of persistent data streams that you have and use in-memory only data streams wherever possible. That will really aid in reducing your latency and increasing your throughput.

Stream Processing

The next thing that you need to be able to do is stream processing. Stream processing obviously has to be able to support all of the different types of processing that you want to do. For example, it needs to be able to support complex transformations. If it doesn’t support the transformations that you want, you should be able to add in your own components or your own user defined functions to do the transformations.

It needs to be able to combine and enrich data. This requires a lot of different constructs for stream processing. When you are combining data together from multiple data streams, they run at high speed and typically events aren’t going to happen at the same time.

You need a flexible windowing structure that can maintain a set of events from different data streams to combine together, in order to be able to produce a combined output stream that has the last data from every stream apart from the current data from the current one.

When you’re enriching data, you need to be able to join streaming data with reference data. You can’t go back to a database or go back to the original source of the reference data for every event on a data stream. It’s just too slow. You need to be able to load, cache, and remember the data you are using for enrichment in memory so you can join it really efficiently, in order to keep and maintain the throughput that you’re looking for from the overall system.

You want the stream processing to be optimized. It should really run as fast as if you’d written it yourself manually. It also needs to be easy to use. We recommend that you look for SQL-based stream processing because SQL is the language of data. There are very few people that work with data that don’t understand SQL. It allows you to do filtering, transformation, and data enrichment through natural SQL constructs.

Obviously if you want to do more complex things, you should also be allowed to import your own transformations and work with those. For SQL-based transformations, it enables anyone that knows data to be able to build and understand what the transformations are. You also want building pipelines to be as easily accessible as possible to all the people that want to work with the data.

You need to have a good UI for building the data pipelines and have as much of the process as possible automated through wizards and other UI based assistance. You need to be able to build multi-step stream processing, not just a single source into single target or a single source into single piece of processing into single target. Potentially with fan in and fan out. Multiple data sources coming in, going into multiple processing steps in a staged environment, where they go step by step by step, to potentially multiple targets coming out at the other end.

This all needs to be coordinated, well-maintained, and deployable across a cluster in order to be scalable. Your stream processing should be very rich, very capable, and also very high throughput.

Enterprise Grade

You also need to think about the enterprise-grade qualities of the platform. I’ve mentioned before, for it to be enterprise grade it needs to be scalable. You need to be able to handle increasing the throughput, increasing the number of sources, increasing the number of targets, and increasing the volume of data being generated from each one of those.

When you’re evaluating platforms and evaluating for a production scenario, you should test the platform with a reasonable throughput that corresponds to what you’re expecting in order to see how it behaves and how it scales, and measure the throughput and the latency from end to end as you’re evaluating the platform.

You also need it to be reliable. You need to be able to ensure that you have guaranteed delivery from source all the way to target. Even if something fails, if a network fails, if the source or the target goes down, if any of the processing nodes in the cluster go down or the whole cluster goes down, you need to be able to ensure that it picks up from where it left off and doesn’t miss any messages.

It has to be able to recover from failures as well. Guaranteed delivery in the normal “I’m always running” case so you don’t miss any messages, just because they disappeared into the ether somewhere. But also, that if you have a failure, you should recover and not lose any messages, not lose any events that come from the source into the target.

Of course, security is also paramount. You can secure the data while it’s moving in transit, so it’s encrypted as it goes across the network. But also that you can secure who has access to the data, who can work with individual data streams, who can see the data on individual data streams, who can build applications, who can view the results of building applications.

You need security that works across the whole end to end and deals with every single component, so that you can secure them and lock them down and make sure that only the people that need to work with data, can.

Insights & Alerts

Finally, you need to make sure that the platform gives you visibility into your data, that you can monitor the data flows and see what’s going on in real time, that you get alerts when anything happens. This could be when CPU or memory usage on any of the nodes goes above certain criteria. It could be when applications crash, or data flows crash. It could be when volume goes above or below what you expect, and doing that in a granular fashion. For example, when an individual database table goes above or below what you expect.

You need to be able to work with insights into the data flows that help you operationalize this and make sure that it’s working full time, 24/7, when you actually put it into production. You may even want to get insights on the data itself, drill down into the actual data that’s flowing, and do some analytics on that. If your streaming integration platform can also give you those valuable insights on the streaming data, then that’s the icing on the cake.

Just to summarize, when you’re evaluating streaming data integration platforms, you need to make sure that the platform can do everything that you need, to get your data from where it’s generated to where it needs to be, in order to get real value out of your data.

 

To learn more about streaming data integration, please visit our Real-time Data Integration solution page, schedule a demo with a Striim expert, or download the Striim platform to get started.

Build vs. Buy for Streaming Data Integration

Build vs. Buy for Streaming Data Integration: Whiteboard Wednesdays

In this Whiteboard Wednesday video, Steve Wilkes, founder and CTO of Striim, asks the question, “Is it better to build or buy a Streaming Data Integration platform?”  Read on, or watch the 10-minute video:

You want to use streaming data integration to move data from existing sources, that generate lots of data, to targets. But is it better to build a streaming data integration platform from lots of readily available open source components, or buy one and have that work done for me?

Build

First, you need all the components that we’ve mentioned in previous videos that go into making a streaming data integration platform. You need to build data collectors and manage data delivery – which is not just one single thing. You need many adapters that support databases, files, maybe IOT cloud targets, essentially a whole bunch of different technologies to get the data from where it is now to where you want it to be.

You need to think about data movement? What messaging system do I use. Do I use multiple ways of doing data movement? How am I going to do stream processing? What is the engine for that? How do I define it? Is there a UI in open source? Not everything has a UI. How do I define the stream processing? Does it support SQL, which in a previous video we mentioned is the best approach working with streaming data. How do I enrich data? Does the platform include an in-memory data grid or other mechanism by which I can enrich data at high speed? How do I ensure that all of this is enterprise grade? Is it scalable, reliable, and secure? How do I get insights and alerts? How do I view the data flows and understand what’s going on if I’m piecing this together myself? This is essential for a lot of operational reasons.

Build vs. Buy for Streaming Data Integration

So, assume that you are going to try and build this from lots of different open source pieces. For every piece that goes into making up this platform, including all the framework pieces and the adapters that you will need on either end, you need to go through a process. This process involves designing the overall platform – how do all the pieces fit together. Then for each piece you need to look at what is available, and evaluate each option, not just on their own, but how they fit in with the other pieces that you are looking at. Once you have done that for each piece, you need to try and integrate all the pieces together. That integration involves building a lot of the glue code on top of the pieces that you have chosen to abstract it so that it is easier for people to work with and ensure that it is enterprise grade; it scales together, it is reliable, and it is secure. Assuming that you have built all of this, you can then start to build your applications. 

Maintenance

Now when it comes to maintenance, open source isn’t always maintained forever. People can stop supporting it. In that case you have to identify another piece of open source that can take its place and will fit in. Then you have to integrate again. Even if it’s just upgraded, that could mean the APIs have changed or the way that it functions has changed. You’re going to have to perform the integration and test everything again to make sure that everything is working. Once that is complete you are going to have to test your applications on the platform that you’ve just built again and make sure that’s working too.

Support

If you have issues, you need to go to support. Some open source platforms offer support, some don’t. There may be pieces that you’re supporting yourself or because you’ve chosen to put multiple things together. In the case of Change Data Capture (CDC) there may be multiple vendors providing support. That becomes a headache because different organizations tend to like to blame each other. You need to have support for all the pieces. If you need to upgrade, then you need to reintegrate again. Maintenance can become a big issue.

Buy

Now the difference between doing all of that and using a prebuilt STI solution is that you’re not doing any of these pieces in the middle. In going the buy route, I am going to replace all those pieces of open source that I have put together myself with the STI solution straight out of the box. You simply download the solution and start building your data flows. That is going to massively reduce the amount of time for you to start building your data integrations.

Build vs. Buy - Streaming Data Integration

The Differences Between Build vs. Buy

When it comes to summarizing all of this, what are the differences between the costs and risks of build vs. buy?

Development Costs

We went through all the steps of building a solution. Each one of those steps are going to involve engineers. They’re going to involve people that understand different technologies and there is going to have to be a team around that to manage building the platform. It can be a pretty high development cost and often that development cost can outweigh the cost of licensing a platform that you purchase. 

Time to Value

There’s a long time to market because you have to go through all those steps for each component. If components are upgraded or deprecated, you have to repeat all of those steps. So the time it takes from you saying, “I want to move data from my on-premise databases into my cloud data warehouse” or “I want to move data from IOT on premise into storage so I can do machine learning”. The time it takes for you to go from thinking “I want to do that to being able to do that” is as long as it takes you to build your framework and infrastructure, build your own SDI platform.

Whereas if you’re buying it, the time it takes is how long does it take to download it, which is much faster. The time to a solution is massively reduced because you’re not going through all the steps it takes to build it and during that time not focusing on building business value.

A lot of organizations out there are not software development companies, they are companies with a different purpose. They are a bank, or a finance organization, or a healthcare organization. Value to the organization is massively increased if you’re not spending huge amounts of time building something that someone else has already built. Focusing on business value enables you to build out solutions quickly and bring value to your business much more quickly – utilizing a prebuilt solution rather than building it yourself.

Build vs. Buy - Streaming Data Integration

Support

I’ve mentioned that open source can become obsolescent. There are cases where people invested in open source technologies which people just stopped developing because it was no longer the hot new thing. They now had to try and maintain it themselves and understand someone else’s code in order to maintain the open source that they’d heavily invested in.

If you are buying a streaming data integration solution from a vendor, then that vendor is going to support you. It requires a wide range of specialized skills. Each one of the pieces of open source requires a different skill set in order to understand it and work with it. It may be a different programming language; it may be a different environment that you need to set up. There’s also a meta skill required in understanding how all those components fit together. This is an architectural role that has to understand all the different APIs and different components and work at how to add all the enterprise grade features so that they all scale, are reliable, and all work together. If you’re buying the costs really are the license cost and limited visibility into the code, into the engine that makes it work.

Depending on who you are, that may or may not be a big issue. As long as what you’ve built and the concepts around what you’ve built are transportable then investing in streaming data integration is also transportable.

What are the differences between build vs. buy? It really comes down to, do you want to get going with building your solution and providing business value straight away? Or do you want to focus on being a software development organization building a platform that can then build the applications that will provide business value.

To learn more about build vs. buy for streaming data integration, please visit our Real-time Data Integration solution page, schedule a demo with a Striim expert, or download the Striim platform to get started.

Whiteboard Wednesdays - Streaming Data Integration

Streaming Data Integration: Whiteboard Wednesdays

 

 

In this Whiteboard Wednesday video, Steve Wilkes, founder and CTO of Striim, takes a look at Streaming Data Integration – what it is, what it is used for, and most importantly, what is needed set it up and manage it. Read on, or watch the 10-minute video:

Like all enterprises, you have a lot of data sources that generate data – databases, machine logs, or other files that are being produced in your systems. You may also have messaging systems sensors or IoT sensors generating data. This data, generated all of the time, may not be in the format you need it or where you need it. This limits your ability to get value from it or make decision using it.

In terms of moving it somewhere else, maybe you want to put it into cloud storage or other types of storage. Maybe you want to put it into different databases or different data warehouses. Maybe want to put it onto messaging systems to build applications for analytics and reporting that will enable analysts, data scientists, decision makers, and your customers can get value out of it. Streaming data integration is about being able to collect data in real time from the systems that are generating it, and moving it to where it needs to be to get value out of it.

Data Collection

Continuous data collection means being able to get the data in real time. Each type of data source requires a different way of collecting the data. Databases are quite often associated with records of what has happened in the past. Running queries against a database can provide bulk responses of historical data. In order capture real time data – that is, the data as it’s being generated – you need to use a technology called change data capture or CDC.

CDC will allow you to collect the inserts, updates and deletes that are happening in the database in real time as they’re being generated. Similarly, you need specific technologies to get the data you want from other sources of data such as files or messaging systems. In the case of files, you may need to be able to tail the file, read at the end of the file, or maybe collect multiple files across multiple systems in parallel. Once you’ve got that data, you need to be able deliver it to the target systems.

Streaming Data Integration - Sources

Data Delivery

Data delivery depends on the type of target system. If you are using database change data capture on the source side, the delivery process also needs to be change aware so an insert or an update or a delete is being managed correctly in the target system. For example, if you are synchronizing an on-premise database with a cloud database, then you need to initially load the data, but you also need to apply change to it in order to keep it synchronized. The delivery process needs to understand that, instead of just inserting all the events into the database, it treats an insert as an insert, an update as an update, and delete as a delete. This needs to be applied in a heterogeneous fashion.

You also need to be able to work with lots of different APIs. Again, just as you need to be able to work with databases, with storage, and with messaging, you also need to be able to support different cloud systems. The security mechanisms used by the different cloud vendors are all different, so you need to be able to manage that.

Data Movement

Streaming data integration needs to continuously to move the data that’s being collected into the appropriate target. It may be required to collect data from multiple sources, combine them together and push them out to a single target, or take data from a single source and move it and deliver it continuously to multiple targets.

Data movement has to be smart. It has to be able to move data between processes, between nodes, in a clustered environment. It has to be able to move data from on-premise to cloud, or from cloud to on-premise, or across different networks and even across different clouds. Also, this data movement has to be at really high speeds and low latency. You need to be able to deliver the data from one place to another really quickly.

In a lot of use cases, you’re not just dealing with the raw data. If you’re doing a database replication, then you can collect change on one side and apply that change exactly as is to the target. But in a lot of cases you want to manipulate the data in some way. The format, structure, or content of the data you collect may not be enough, it may not be correct, or it may not be sufficient for where you want to put it, you may need to do some kind of stream processing.

Streaming Data Integration - TargetsData Processing

Data processing is about being able to take the data that’s on the data stream and manipulate it. You may want to filter it. Maybe everything you collect doesn’t need to be written to everything you deliver to. Maybe only a subset is written to storage; maybe only a subset goes into messaging. You need to be able to filter that data and choose what goes where.

You also need to be able to transform it, to give it the structure that you want, and to apply functions to it in order to make it look like you need.

You also need to worry about security. Maybe you need to mask some of the data, obfuscate it in some way. If you’re doing cloud analytics and you have Personally Identifiable Information (PII) coming from your sources, then it’s probably a good idea to anonymize that data before you put it into the cloud for analytics. Being able to do that as the data streams to a target is really important.

Maybe you need to be able to encrypt the data or combine data together. You can take data from multiple sources and combine it into a single data stream that you then apply into a target. Maybe you need to aggregate or summarize the data before you write it into a data warehouse. Maybe it doesn’t contain enough information by itself. Maybe you need to denormalize it or enrich it before it is delivered somewhere else. For streaming integration platforms, you need to be able to do all of these different types of processing on the data in order to get it in the form that you want before you deliver it.

Data Insights

You also need to be able to get insights on the data, to see how the data’s flowing, whether there are any issues in the data flow, maybe even investigate the data itself if there are any issues. This can be done ad hoc, but from an operational perspective, it’s also important that you get alerts if anything is failing or if there is an unusual volume of data coming from one particular place. There suddenly could be an unusual volume coming from your audit table, or maybe there isn’t sufficient volume and it’s fallen behind what you’d expect for that time of day. Being able to get alerts on the data flows themselves can be really, really important.

Streaming Data Integration - Enterprise Grade Enterprise-Grade Integration

You can play about with streaming data integration components in the lab. But if you’re going to put things into production, streaming data integration needs to be enterprise-grade. Everything needs to be able to scale. Typically, you’re talking about large volumes of data being generated continuously for streaming integration. You need to be able to deal with scalability and adding additional scale over time.

You are probably looking at a cluster distributed environment where you can add new nodes into that environment as you need to in order to be able to handle scale. Of course, for mission-critical platforms for mission-critical data flows, it needs to be reliable.

You need to be able to say, if I collect data, if data is generated on this source side, then I have to be able to guarantee that it makes it into my targets. If I am replicating a database that’s on-premise into a data warehouse that is in the cloud, I need to make sure that every single operation makes it from one end to the other, and that I can investigate that and validate that that has actually happened.

For streaming data integration to be enterprise grade and work with mission-critical applications, you need to be able to ensure that it is reliable. At a minimum, there should be at-least-once processing, and that everything that’s generated on the source makes it to the target. In certain cases, you need “exactly once processing.” This means that not only does it make it to the target, but you are guaranteed that even in the cases of failure, it only ever makes it the target once.

Security is important. We’ve talked about data security in terms of being able to mask and encrypt data as it flows. As things are moving across networks, all of that data should be encrypted so that people can’t tap into the network flow. From a security perspective, you need to be able to lock down who has access to what in the system that is doing all of these things. If you’re generating a data stream on-premise and it has personally identifiable information in it, you need to be able to lock down which employees can access that data stream, even through the streaming platform. Being able to do authentication and authorization within the streaming platform itself is also essential.

In summary, streaming data integration is about continually collecting data from enterprise and cloud sources, being able to process that data while it’s moving, delivering it into any target that you want, whether it’s on-premise or cloud, being able to get insights into how it’s moving, and do all of this in a enterprise-grade way that is scalable, reliable, and secure.

To learn more about streaming data integration, please visit our Real-time Data Integration solution page, schedule a demo with a Striim expert, or download the Striim platform to get started.

Change Data Capture - change log

Streaming Data Integration Tutorial: Using CDC to Stream Database Changes

 

This is the first in a two-part blog post discussing how to use Striim for streaming database changes to Apache Kafka. Striim offers continuous data ingestion from databases and other sources in real time; transformation and enrichment using Streaming SQL; delivery of data to multiple targets in the cloud or on-premise; and visualization of results. In this part, we will use Striim’s low-impact, real-time change data capture (CDC) feature to stream database changes (inserts, updates, and deletes) from an operational database into Striim.

What is Change Data Capture

Databases maintain change logs that record all changes made to the database contents and metadata. These change logs can be used for database recovery in the event of a crash, and also for replication or integration.

Change data capture change log

With Striim’s log-based CDC, new database transactions – including inserts, updates, and deletes – are read from source databases’ change logs and turned into a stream of events without impacting the database workload. Striim offers CDC for Oracle, SQL Server, HPE NonStop, MySQL, PostgreSQL, MongoDB, and MariaDB.

Why use Striim’s CDC?

Businesses use Striim’s CDC capabilities to feed real-time data to their big data lakes, cloud databases, and enterprise messaging systems, such as Kafka, for timely operational decision making. They also migrate from on-premises databases to cloud environments without downtime and keep cloud-based analytics environments up-to-date with on-premises databases using CDC.

How to use Striim’s CDC?

Striim’s easy-to-use CDC template wizards automate the creation of applications that leverage change data capture, to stream events as they are created, from various source systems to various targets. Apps created with templates may be modified using Flow Designer or by exporting TQL, editing it, and importing the modified TQL. Striim has templates for many source-target combinations.

In addition, Striim offers pre-built integration applications for bulk loading and CDC from PostgreSQL source databases to target systems including PostgreSQL database, Kafka, and files. You can start these applications in seconds by going to the Applications section of the Striim platform.

Striim Pre-built Sample Integration Applications
Striim pre-built sample integration applications.

In this post, we will show how to use the PostgreSQL CDC (PostgreSQL Reader) with a Striim Target using the wizards for a custom application instead of using the pre-built application mentioned above. The instructions below assume that you are using the PostgreSQL instance that comes with the Striim platform. If you are using your own PostgreSQL database instance, please review our instructions on how to set up PostgreSQL for CDC.

Using the CDC Template

To start building the CDC application, in the Striim web UI, go to the Apps page and select Add App > Start with Template. Enter PostgreSQL in the search field to narrow down the sources and select “PostgreSQL Reader to Striim”.

CDC application template
Wizard template selection when creating a new app.

Next enter the name and namespace for your application (the namespace is a way of grouping applications together).

Creating a new application using Striim.

Specifying the Data Source Properties

In the SETUP POSTGRESQL READER specify the data source and table properties:

  • the connection URL, username, and password.
  • the tables for which you want to read change data.
Configuring the Data Source in the Wizard
Configuring the data source in the wizard.

After you complete this step, your application will open in the Flow Designer.

The Wizard Generates a Data Flow
The wizard generates a data flow.

In the flow designer, you can add various processors, enrichers, transformers, and targets as shown below to complete your pipeline, in some cases with zero coding.

Flow designer enrichers and processors.

Flow designer event transformers and targets.

 

In the next blog post, we will discuss how to add a Kafka target to this data pipeline. In the meantime, please feel free to request a demo with one of our lead technologists, tailored to your environment.

 

How to Migrate Oracle Database to Google Cloud SQL for PostgreSQL with Streaming Data Integration

 

 

For those who need to migrate an Oracle database to Google Cloud, the ability to move mission-critical data in real-time between on-premises and cloud environments without either database downtime or data loss data is paramount. In this video Alok Pareek, Founder and EVP of Products at Striim demonstrates how the Striim platform enables Google Cloud users to build streaming data pipelines from their on-premises databases into their Cloud SQL environment with reliability, security, and scalability. The full 8-minute video is available to watch below:

Striim offers an easy-to-use platform that maximizes the value gained from cloud initiatives; including cloud adoption, hybrid cloud data integration, and in-memory stream processing. This demonstration illustrates how Striim feeds real-time data from mission-critical applications from a variety of on-prem and cloud-based sources to Google Cloud without interruption of critical business operations.

Oracle database to Google Cloud

Through different interactive views, Striim users can develop Apps to build data pipelines to Google Cloud, create custom Dashboards to visualize their data, and Preview the Source data as it streams to ensure they’re getting the data they need. For this demonstration, Apps is the starting point from which to build the data pipeline.

There are two critical phases in this zero-downtime data migration scenario. The first involves the initial load of data from the on-premise Oracle database into the Cloud SQL Postgres database. The second is the synchronization phase, achieved through specialized readers to keep the source and target consistent.

Oracle database to Google Cloud
Striim Flow Designer

The pipeline from the source to the target is built using a flow designer that easily creates and modifies streaming data pipelines. The data can also be transformed while in motion, to be realigned or delivered in a different format. Through the interface, the properties of the Oracle database can also be configured – allowing users extensive flexibility in how the data is moved.

Once the application is started, the data can be previewed, and progress monitored. While in-motion, data can be filtered, transformed, aggregated, enriched, and analyzed before delivery. With up-to-the-second visibility of the data pipeline, users can quickly and easily verify the ingestion, processing, and delivery of their streaming data.

Oracle database to Google Cloud

During the time of initial load, the source data in the database is continually changing. Striim keeps the Cloud SQL Postgres database up-to-date with the on-premises Oracle database using change data capture (CDC). By reading the database transactions in the Oracle redo logs, Striim collects the insert, update, and delete operations as soon as the transactions commit, and makes only the changes to the target, This is done without impacting the performance of source systems, while avoiding any outage to the production database.

By generating DML activity using a simulator, the demonstration shows how inserts, updates, and deletes are managed. Running DMLS operations against the orders table, the preview shows not only the data being captured, but also metadata including the transaction ID, the system commit number, the table name, and the operation type. When you log into the orders table, the data is present in the table.

The initial upload of data from the source to the target, followed by change data capture to ensure source and target remain in-sync, allows businesses to move data from on-premises databases into Google Cloud with the peace of mind that there will be no data loss and no interruption of mission-critical applications.

Additional Resources:

To learn more about Striim’s capabilities to support the data integration requirements for a Google hybrid cloud architecture, check out all of Striim’s solutions for Google Cloud Platform.

To read more about real-time data integration, please visit our Real-Time Data Integration solutions page.

To learn more about how Striim can help you migrate Oracle database to Google Cloud, we invite you to schedule a demo with a Striim technologist.

 

Network Traffic Monitoring with Striim 2

Network Traffic Monitoring with Striim – Part 2

In the first post of this two-part series, we demonstrated how to use Striim’s ability to collect, integrate, process, analyze and visualize large streaming data in real time. In this post, we will provide more details about the machine learning pipeline to forecast network traffic and detect anomalies in Striim.

First, Striim reads raw data from files and then processes and aggregates the raw data. After aggregation, the data volume is dramatically reduced. In this application, it aggregates total network traffic for all IPs per minute. Thus it only has one data point for each minute to train the machine learning models.

In the machine learning pipeline, it first preprocesses the data, then fits the data into the machine learning models for prediction. Finally, it detects anomalies if the actual value is far from the predicted one.

  • Data Processing: Before fitting data into sophisticated models, it is usually necessary to transform and preprocess the data. Striim supports a collection of data preprocessing methods including standardization and normalization to make training less sensitive to the scale of features, power transformation (like square root, log) to stabilize volatile data, and seasonality decomposition like Loess or STL decomposition to remove the seasonality in time-series data, etc. In this application, we use log transformation to make data more stable and improve prediction accuracy.
  • Automated Prediction: Striim first transforms time series regression to a standard regression problem with lagged variables (i.e. autoregression), and then applies any regression model to it for prediction. Striim has a plethora of machine learning models including linear and non-linear regression, SVM, gaussian process regression, random forest, Bayesian networks, deep learning models, etc. In this application, we use Random Forest for several reasons: 
    • It is efficient enough to retrain continuously over streaming data. Some models, like deep neural networks, are time-consuming and not suitable for online training. 
    • It uses the ensembling method which combines multiple ML models to obtain better performance. 
    • It achieves good predictive performance compared to other models according to our comprehensive empirical study.
  • Anomaly Detection: Striim detects anomalies based on its prediction. The main idea is that anomalies tend to have a large difference between the actual value and predicted value. Striim uses a statistical-based threshold to find anomalies if the percentage error between actual and predicted values is larger than the threshold. In this application, Striim assumes percentage errors follow the normal distribution and defines the threshold accordingly. e.g. we can define the threshold as mean + 2 * standard deviation of percentage errors.  

The models can update continuously on the streaming data. One can bound the training data size and control update frequency. In this application, we use the recent 200 data points as training data and update the ML models when it receives a new data point. 

For more details, please see our 2019 DEBS paper: A Demonstration of Striim.

A Streaming Integration and Intelligence Platform. Striim won Best Demo Award in the 13th ACM International Conference on Distributed and Event‐based Systems (DEBS).

 

Real-Time Continuous Data Movement and Processing

Real-Time Data is for Much More Than Just Analytics

Striim’s Real-Time Data is for Much More Than Just Analytics article was originally published on Forbes.

The conversation around real-time data, fast data and streaming data is getting louder and more energetic. As the age of big data fades into the sunset — and many industry folks are even reluctant to use the term — there is much more focus on fast data and obtaining timely insights. The focus of many of these discussions is on real-time analytics (otherwise known as streaming analytics), but this only scratches the surface of what real-time data can be used for.

If you look at how real-time data pipelines are actually being utilized, you find that about 75% of the use cases are integration related. That is, continuous data collection creates real-time data streams, which are processed and enriched and then delivered to other systems. Often these other systems are not themselves streaming. The target could be a database, data warehouse or cloud storage, with a goal of ensuring that these systems are always up to date. This leaves only about 25% of companies doing immediate streaming analytics on real-time data. But these are the use cases that are getting much more attention.

There are many reasons why streaming data integration is more common, but the main reason is quite simple: This is a relatively new technology, and you cannot do streaming analytics without first sourcing real-time data. This is known as a “streaming first” data architecture, where the first problem to solve is obtaining real-time data feeds.

Organizations can be quite pragmatic about this and approach stream-enabling their sources on a need-to-have, use-case-specific basis. This could be because batch ETL systems no longer scale or batch windows have gone away in a 24/7 enterprise. Or, they want to move to more modern technologies, which are most suitable for the task at hand, and keep them continually up to date as part of a digital transformation initiative.

Cloud Is Driving Streaming Data Integration

The rise of cloud has made a streaming-first approach to data integration much more attractive. Simple use cases, like migrating an on-premise database that services an in-house business application to the cloud, are often not even viable without streaming data integration.

The naive approach would be to back up the database, load it into the cloud and point the cloud application at it. However, this assumes a few things:

1. You can afford application downtime.

2. Your application can be stopped while you are doing this.

3. You can spin up and use the cloud application without testing it.

For most business-critical applications, none of these things are true.

A better approach to minimizing or eliminating downtime is an online migration that keeps the application running. To perform this task, source changes from the in-house database, using a technology called change data capture (CDC), as real-time data streams, load the database to the cloud, then apply any changes from the real-time stream that happened while you were doing the loading. The change delivery to the cloud can be kept running while you test the cloud application, and when you cut over, it will be already up to date.

Streaming data integration is a crucial element of this type of use case, and it can also be applied to cloud bursting, operational machine learning, large scale cloud analytics or any other scenario where having up-to-the-second data is essential.

Streaming Data Integration Is The Precursor To Streaming Analytics

Once organizations are doing real-time data collection, typically for integration purposes, it then opens the door to doing streaming analytics. But you can’t put the cart before the horse and do streaming analytics unless you already have streaming data.

Streaming analytics also requires preprepared data. It’s a commonly known metric that 80% of the time spent in data science is in data preparation. This is true for machine learning and also true for streaming analytics. Obtaining the real-time data feed is just the beginning. You may also need to transform, join, cleanse and enrich data streams to give the data more context before performing analytics.

As a simple example, imagine you are performing CDC on a source database and have a stream of orders being made by customers. In any well-normalized, relational database, these tables are mostly just numbers relating to detail contained in other tables.

This might be perfect for a relational, transactional system, but it’s not very useful for analytics. However, if you can join the streaming data with reference data for customers and items, you have now added more context and more value. The analytics can now show real-time sales by customer location or item category and truly provide business insights.

Without the processing steps of streaming data integration, the streaming analytics would lose value, again showing how important the real-time integration layer really is.

Busting The Myth That Real-Time Data Is Prohibitively Expensive

A final consideration is cost. Something that has been said repeatedly is that real-time systems are expensive and should only be used when absolutely necessary. The typically cited use cases are algorithmic trading and critical control systems.

While this may have been true in the past, the massive improvements in the price-performance equation for CPU and memory over the last few decades have made real-time systems, and in-memory processing in general, affordable for mass consumption. Coupled with cloud deployments and containerization, the capability to have real-time data streamed to any system is within reach of any enterprise.

While real-time analytics and instant operational insights may get the most publicity and represent the long-term goal of many organizations, the real workhorse behind the scenes is streaming data integration. 

Simplify Your Azure Hybrid Cloud Architecture with Streaming Data Integration

While the typical conversation about Azure hybrid cloud architecture may be centered around scaling applications, VMs, and microservices, the bigger consideration is the data. Spinning up additional services on-demand in Azure is useless if the cloud services cannot access the data they need, when they need it.

“According to a March 2018 hybrid cloud report from 451 Research and NTT Communications, around 63% of firms have a formal strategy for hybrid infrastructure. In this case, hybrid cloud does not simply mean using a public cloud and a private cloud. It means having a seamless flow of data between all clouds, on and off-premises.” – Data Foundry

To help simplify providing a seamless flow of data to your Microsoft Azure hybrid cloud infrastructure, we’re happy to announce that the Striim platform is available in the Microsoft Azure Marketplace.

How Streaming Data Integration Simplifies Your Azure Hybrid Cloud Architecture

Enterprise-grade streaming data integration enables continuous real-time data movement and processing for hybrid cloud, connecting on-prem data sources and cloud environments, as well as bridging a wide variety of cloud services. With in-memory stream processing for hybrid cloud, companies can store only the data they need, in the format that they need. Additionally, streaming data integration enables delivery validation and data pipeline monitoring in real time.

Streaming data integration simplifies real-time streaming data pipelines for cloud environments. Through non-intrusive change data capture (CDC), organizations can collect real-time data without affecting source transactional databases. This enables cloud migration with zero database downtime and minimized risk, and feeds real-time data to targets with full context – ready for rich analytics on the cloud – by performing filtering, transformation, aggregation, and enrichment on data-in-motion.

Azure Hybrid Cloud Architecture

Key Traits of a Streaming Data Integration Solution for Your Azure Hybrid Cloud Architecture

There are three important objectives to consider when implementing a streaming data integration solution in an Azure hybrid cloud architecture:

  • Make it easy to build and maintain –The ability to use a graphical user interface (GUI) and a SQL-based language can significantly reduce the complexity of building streaming data pipelines, allowing more team members within the company to maintain the environment.
  • Make it reliable – Enterprise hybrid cloud environments require a data integration solution that is inherently reliable with failover, recovery and exactly-once processing guaranteed end-to-end, not just in one slice of the architecture.
  • Make it secure –Security needs to be treated holistically, with a single authentication and authorization model protecting everything from individual data streams to complete end-user dashboards. The security model should be role-based with fine-grained access, and provide encryption for sensitive resources.

Striim for Microsoft Azure

The Striim platform for Azure is an enterprise-grade data integration platform that simplifies an Azure-based hybrid cloud infrastructure. Striim provides real-time data collection and movement from a variety of sources such as enterprise databases (ie, Oracle, HPE NonStop, SQL Server, PostgreSQL, Amazon RDS for Oracle, Amazon RDS for MySQL via low-impact, log-based change data capture), as well as log files, sensors, messaging systems, NoSQL and Hadoop solutions.

Once the data is collected in real time, it can be streamed to a wide variety of Azure services including Azure Cosmos DB, Azure SQL Database, Azure SQL Data Warehouse, Azure Event Hubs, Azure Data Lake Storage, and Azure Database for PostgreSQL

While the data is streaming to Azure, Striim enables in-stream processing such as filtering, transformations, aggregations, masking, and enrichment, making the data more valuable when it lands. This is all done with sub-second latency, reliability and securty via an easy-to-use interface and SQL-based programming language.

To learn more about Striim’s capabilities to support the data integration requirements for an Azure hybrid cloud architecture, read today’s press release announcing the availability of the Striim platform in the Microsoft Azure Marketplace, and check out all of Striim’s solutions for Azure.

Log-based Change Data Capture

Log-Based Change Data Capture: the Best Method for CDC

 

 

Change data capture, and in particular log-based change data capture, has become popular in the last two decades as organizations have discovered that sharing real-time transactional data from OLTP databases enables a wide variety of use-cases. The fast adoption of cloud solutions requires building real-time data pipelines from in-house databases, in order to ensure the cloud systems are continually up to date. Turning enterprise databases into a streaming source, without the constraints of batch windows, lays the foundation for today’s modern data architectures. In this blog post, I would like to discuss Striim’s CDC capabilities along with its unique features that enhance the change data capture, as well as its processing and delivery across a wide range of sources and targets.

CDC-featuredimage.jpg”>CDC-featuredimage.jpg” alt=”Log-based Change Data Capture” width=”1200″ height=”630″ />Log-Based Change Data Capture

In our previous blog post on Change Data Capture Methods, we explained why log-based change data capture is a better method to identify and capture change data. Striim uses the log-based CDC technique for the same reasons we stated in that post: Log-based CDC minimizes the overhead on the source systems, reducing the chances of performance degradation. In addition, it is non-intrusive. It does not require changes to the application, such as adding triggers to tables would do. It is a light-weight but also a highly-performant way to ingest change data. While Striim reads DML operations (INSERTS, UPDATES, DELETES) from the database logs, these systems continue to run with high-performance for their end users.

Striim’s strengths for real-time CDC are not limited to the ingestion point. Here are a few capabilities of the Striim platform that build on its real-time, log-based change data capture in enabling robust, end-to-end streaming data integration solutions:

  1. Log-based CDC from heterogeneous databases for non-intrusive, low-impact real-time data ingestion: Striim uses log-based change data capture when ingesting from major enterprise databases including Oracle, HPE NonStop, MySQL, PostgreSQL, MongoDB, among others. It minimizes CPU overhead on sources, does not require application changes, and substantial management overhead to maintain the solution.
  2. Ingestion from multiple, concurrent data sources to combine database transactions with semi-structured and unstructured data. Striim’s real-time data ingestion is not limited to databases and the CDC method. With Striim you can merge real-time transactional data from OLTP systems with real-time log data (i.e., machine data), messaging systems’ events, sensor data, NoSQL, and Hadoop data to obtain rich, comprehensive, and reliable information about your business.
  3. End-to-end change data integration: Striim is designed from the ground-up to ingest, process, secure, scale, monitor, and deliver change data across a diverse set of sources and targets in real time. It does so by offering several robust capabilities out of the box:
    • Transaction integrity: When ingesting the change data from database logs, Striim moves committed transactions with the transactional context (i.e., ACID properties) maintained. Throughout the whole data movement, processing, and delivery steps, this transactional context is preserved so that users can create reliable replica databases, such as in the case of cloud bursting.
    • In-flight change data processing: Striim offers out-of-the-box transformers, and in-memory stream processing capabilities to filter, aggregate, mask, transform, and enrich change data while it is in motion. Using SQL-based continuous queries, Striim immediately turns change data into a consumable format for end users, without losing transactional context.
    • Built-in checkpointing for reliability: As the data moves and gets processed through the in-memory components of the Striim platform, every operation is recorded and tracked by the solution. If there is an outage, Striim can replay the transactions from where it was left off — without missing data or having duplicates.
    • Distributed processing in a clustered environment: Striim comes with a clustered environment for scalability and high availability. Without much effort, and using inexpensive hardware, you can scale out for very high data volumes with failover and recoverability assurances. With Striim, you don’t need to build your own clusters with third-party products.
    • Continuous monitoring of change data streams: Striim continuously tracks change data capture, movement, processing, and delivery processes, as well as the end-to-end integration solution via real-time dashboards. With Striim’s transparent pipelines, you have a clear view into the health of your integration solutions.
    • Schema change replication: When source Oracle database schema is modified and a DDL statement is created, Striim applies the schema change to the target system without pausing the processes.
    • Data delivery validation. For database sources and targets, Striim offers out-of-the-box data delivery verification. The platform continuously compares the source and target systems, as the data is moving, validating that the databases are consistent and all changed data has been applied to the target. In use cases, where data loss must be avoided, such as migration to a new cloud data store, this feature immensely minimizes migration risks.
    • Concurrent, real-time delivery to a wide range of targets: With the same software, Striim can deliver change datain real time not only to on-premise databases but also to databases running in the cloud, cloud services, messaging systems, files, IoT solutions, Hadoop and NoSQL environments. Striim’s integration applications can have multiple targets with concurrent real-time data delivery.
    • Pre-packaged applications for initial load and CDC: Striim comes with example integration applications that include initial load and CDC for PostgreSQL environments. These integration applications enable setting up data pipelines in seconds, and serve as a template for other CDC sources as well.
  1. Turning Change Data to Time-Sensitive Insights. In addition to building real-time integration solutions for change data, Striim can perform streaming analytics with flexible time windows allowing you to gain immediate insights from your data in motion. For example, if you are moving financial transactions using Striim, you can build real-time dashboards that alert on potential fraud cases before Striim delivers the data to your analytics solution.

Log-based change data capture is the modern way to turn databases into streaming data sources. However, ingesting the change data is only the first of many concerns that integration solutions should address. You can learn more about Striim’s CDC offering by scheduling a demo with a Striim technologist or experience its enterprise-grade streaming integration solution first-hand by downloading a free trial.

 

Microsoft SQL Server to Kafka

Microsoft SQL Server CDC to Kafka

By delivering high volumes of data using Microsoft SQL Server CDC to Kafka, organizations gain visibility of their business and the vital context needed for timely operational decision making. Getting maximum value from Kafka solutions requires ingesting data from a wide variety of sources – in real time – and delivering it to users and applications that need it to take informed action to support the business.Microsoft SQL Server to Kafka

Traditional methods used to move data, such as ETL, are just not sufficient to support high-volume, high-velocity data environments. These approaches delay getting data to where it can be of real value to the organization. Moving all the data, regardless of relevance, to the target creates challenges in storing it and getting actionable data to the applications and users that need it. Microsoft SQL Server CDC to Kafka minimizes latency and prepares data so it is delivered in the correct format for different consumers to utilize.

In most cases, the data that resides in transactional databases like Microsoft SQL Server is the most valuable to the organization. The data is constantly changing reflecting every event or transaction that occurs.  Using non-intrusive, low-impact change data capture (CDC) the Striim platform moves and processes only the changed data. With Microsoft SQL Server CDC to Kafka users manage their data integration processes more efficiently and in real time. 

Using a drag-and-drop UI and pre-built wizards, Striim simplifies creating data flows for Microsoft SQL Server CDC to Kafka. Depending on the requirements of users, the data can either be delivered “as-is,” or in-flight processing can filter, transform, aggregate, mask, and enrich the data. This delivers the data in the format needed with all the relevant context to meet the needs of different Kafka consumers –with sub-second latency.

Striim is an end-to-end platform that delivers the security, recoverability, reliability (including exactly once processing), and scalability required by an enterprise-grade solution. Built-in monitoring also compares sources and targets and validates that all data has been delivered successfully. 

In addition to Microsoft SQL Server CDC to Kafka, Striim offers non-intrusive change data capture (CDC) solutions for a range of enterprise databases including Oracle, Microsoft SQL Server, PostgreSQL, MongoDB, HPE NonStop SQL/MX, HPE NonStop SQL/MP, HPE NonStop Enscribe, and MariaDB.

For more information about how to use Microsoft SQL Server CDC to Kafka to maintain real-time pipelines for continuous data movement, please visit our Change Data Capture solutions page.

If you would like a demo of how Microsoft SQL Server CDC to Kafka works and to talk to one of our technologists, please contact us to schedule a demo.