Cloud ETL with Reliability

Implementing Streaming Cloud ETL with Reliability

 

 

If you’re considering adopting a cloud-based data warehousing and analytics solution the most important consideration is how to populate it with current on-site and cloud data in real time with cloud etl.

As noted in a recent study by IBM and Forrester, 88% of companies want to run near-real-time analytics on stored data. Streaming Cloud ETL enables real-time analytics by loading data from operational systems from private and public clouds to your analytics cloud of choice.

A streaming cloud ETL solution enables you to continuously load data from your in-house or cloud-based mission-critical operational systems to your cloud-based analytical systems in a timely and dependable manner. Reliable and continuous data flow is essential to ensuring that you can trust the data you use to make important operational decisions.

In addition to high-performance and scalability to handle your large data volumes, you need to look for data reliability and pipeline high availability. What should you look for in a streaming cloud ETL solution to be confident that it offers the high degree of reliability and availability your mission-critical applications demand?

For data reliability, the following two common issues with data pipelines should be avoided:

  1. Data loss. When data volumes increase and the pipelines are backed up, some solutions allow some of the data to be discarded. Also, when the CDC solution has limited data type support, it may not be able to capture the columns that contain that data type. If an outage or process interruption occurs, an incorrect recovery point can also lead to skipping some data.
  2. Duplicate data. After recovering from a process interruption, the system may create duplicate data. This issue becomes even more prevalent when processing the data with time windows after the CDC step.

How Striim Ensures Reliable and Highly Available Streaming Cloud ETL

Striim ingests, moves, processes, and delivers real-time data across heterogeneous and high-volume data environments. The software platform is built ground-up specifically to ensure reliability for streaming cloud ETL solutions with the following architectural capabilities.

Fault-tolerant architecture

Striim is designed with a built-in clustered environment with a distributed architecture to provide immediate failover. The metadata and clustering service watches for node failure, application failure, and failure of certain other services. If one node goes down, another node within the cluster immediately takes over without the need for users to do this manually or perform complex configurations.

Cloud ETL with ReliabilityExactly Once Processing for zero data loss or duplicates

Striim’s advanced checkpointing capabilities ensure that no events are missed or processed twice while taking time window contents into account. It has been tested and certified for cloud solutions to offer real-time streaming to Microsoft Azure, AWS, and/or Google Cloud with event delivery guarantees.

During data ingest, checkpointing keeps track of all events the system processes and how far they got down various data pipelines. If something fails, Striim knows the last known good state and what position it needs to recover from. Advanced checkpointing is designed to eliminate loss in data windows when the system fails.

If you have a defined data window (say 5 minutes) and the system goes down, you cannot typically restart from where you left off because you will have lost the 5 minutes’ worth of data that was in the data window. That means your source and target will no longer be completely synchronized. Striim addresses this issue by coordinating with the data replay feature that many data sources provide to rewind sources to just the right spot if a failure occurs.

In cases where data sources don’t support data replay—for example, data coming from sensors—Striim’s persistent messaging stores and checkpoints data as it is ingested. Persistent messaging allows previously non-replayable sources to replayed from a specific point. It also allows multiple flows from the same data source to maintain their own checkpoints. To offer exactly once processing, Striim checks to make sure input data has actually been written. As a result, the platform can checkpoint, confident in the knowledge that the data made it to the persistent queue.

End-to-end data flow management for simplified solution architecture and recovery

Striim’s streaming cloud ETL solution also delivers streamlined, end-to-end data integration between source and target systems that enhances reliability. The solution ingests data from the source in real time, performs all transformations such as masking, encryption, aggregation, and enrichment in memory as the data goes through the stream and then delivers the data to the target in a single network operation. All of these operations occur in one step without using a disk and deliver the streaming data to the target in sub-seconds. Because Striim does not require additional products, this simplified solution architecture enables a seamless recovery process and minimizes the risk of data loss or inaccurate processing.

In contrast, a data replication service without built-in stream processing requires data transformation to be performed in the target (or source), with an additional product and network hop. This minimum two-hop process introduces unnecessary data latency. It also complicates the solution architecture, exposing the customer to considerable recovery-related risks and requiring a great deal of effort for accurate data reconciliation after an outage.

For use cases where transactional integrity matters, such as migrating to a cloud database or continuously loading transactional data for a cloud-based business system, Striim also maintains ACID properties (atomicity, consistency, isolation, and durability) of database operations to preserve the transactional context.

The best way to choose a reliable streaming cloud ETL solution is to see it in action. Click here to request a customized demo for your specific environment.

 

Top Data Integration Use Cases For The Year Ahead

Over 80% of companies are set to use multiple cloud vendors for their data and analytics needs by 2025. Real-time data integration platforms are vital for making these plans a reality. They connect different cloud and on-premise sources and help move data around in real-time.

But the potential of this technology extends beyond cloud integration. Near instantaneous data transfer helps companies detect anomalies, make predictions, drive sales, apply machine learning (ML) models, and more. It provides a much-needed competitive edge.

As we wrap up what was an eventful year (to say the least), let’s take a look at some of the most popular data integration use cases in 2020 while looking ahead to new trends in cloud data platforms.

Moving on-premise data to the cloud

Moving data from legacy databases to the cloud in real-time reduces downtime, prevents business interruptions, and keeps databases synced.

A software process called Change Data Capture (CDC) is vital for reducing downtime. CDC allows real-time data integration (DI) to track and capture changes in the legacy system and then apply them to the cloud once the migration ends. CDC works later on as well, continuously syncing two databases. This technology allows companies to move data to the cloud without locking the legacy database.

Data can also be moved bidirectionally. Some users can be kept in the cloud and some in the legacy database. Data can then be gradually migrated to reduce risk, in case you’re dealing with mission-critical systems and can’t afford any business interruptions.

Transferring data to the cloud in real-time enables companies to offer innovative services. Courier businesses, for instance, may use real-time DI to move data from on-premise Oracle databases to Google BigQuery and run real-time analytics and reporting. They’re then able to provide customers with live shipment tracking.

Enabling real-time data warehousing in the cloud

Many companies are also turning to cloud data warehouses. This storage option is growing in popularity as it allows users to reduce the cost of ownership, improve speed, secure data, improve integration, and leverage the cloud.

But real-time analysis of data in cloud warehouses requires real-time integration platforms. They collect data from various on-prem and cloud-based sources – such as transactional databases, logs, IoT sensors – and move it to cloud warehouses.

These real-time integration platforms rely on CDC to ingest data from multiple sources without causing any modification or disruption to data production systems.

Data is then delivered to cloud warehouses with sub-second latency and in a consumable form. It’s processed in-flight using techniques such as denormalization, filtering, enrichment, and masking. In-flight data processing has multiple benefits including minimized ETL workload, reduced architecture complexity, and improved compliance with privacy regulations.

DI platforms also respect the ordering and transactionality of changes applied to cloud warehouses. And streaming integration also makes it possible to synchronize cloud data warehouses with on-premises relational databases. As a result, data can be moved to the cloud in a phased migration without disrupting the legacy environment.

Other businesses may prefer data lakes. This storage option doesn’t necessarily require data to be formatted or transformed because it can be stored in its raw state.

Adopting a multi-cloud strategy with cloud integration

Furthermore, real-time data integration allow you to be agile. You get to connect data, infrastructure, and applications in multiple cloud environments.

You can then avoid vendor lock-in and combine the cloud solutions that fit your needs.

For instance, you can have your applications write data to a data warehouse like Amazon Redshift. Meanwhile, the same records can also be inserted into another cloud vendor’s low-cost storage solution, such as Google Cloud Storage (GCS). If you later want to migrate from Redshift to BigQuery, your data will be ready in GCS for a low-friction migration.

Powering real-time applications and operations

Data integration enables companies to run real-time applications (RTA), whether these apps use on-premise or cloud databases. Real-time integration solutions move data with sub-second latency, and users perceive the functioning of RTAs as immediate and current.

Data integration can also support RTAs by transforming and cleaning data or running analytics. And applications from a wide range of fields — videoconferencing, online games, VoIP, instant messaging, ecommerce — can benefit from real-time integration.

Macy’s, for instance, makes great use of data integration platforms to scale their operations in the cloud. The giant US retailer is running real-time data pipelines in hybrid cloud environments for both operational and analytics needs. Its cloud and business apps need real-time visibility into orders and inventory. Otherwise, the company might have to deal with out-of-stock or inventory surpluses. It’s vital to avoid this scenario, especially during peak shopping periods, such as Black Friday and Cyber Monday, when Macy’s processing as much as 7,500 transactions per second.

Furthermore, real-time data pipelines are important for cloud-first apps, too. Designed specifically for cloud environments, these real-time apps can outperform on-premise competitors but require continuous data processing.

Also, real-time DI products can enhance operational reporting. Companies would receive up-to-date data from different sources and could detect immediate operational needs. Whether it’s about monitoring financial transactions, production chains, or store inventories, operational reporting adds value only if it’s delivered fast.

Detecting anomalies and making predictions

A real-time data pipeline allows companies to collect data and run various types of analytics, including anomaly detection and prediction. These two types of analytics are critical for making timely decisions. And they can be of help in many different ways.

Real-time data integration platforms, for instance, help companies manipulate IoT data produced by a range of sensor sources. Once cleaned and collected in a unified environment, this IoT data can be analyzed. The system may detect anomalies, such as high temperatures or rising pressure, and instruct a manager to act and prevent damage. Or, the data may reveal failing industrial robots that need replacement. Integration technologies also allow you to combine IoT sensor data with other data sources for better insights. Legacy technologies are rarely up to this challenge.

Besides factories and robots, sensors also monitor planes, cars, and trucks. Analyzing vehicle data can reveal if an engine is likely to fail soon if certain parts aren’t replaced. But this benefit can only be realized if various types of data are collected and analyzed in real-time. Otherwise, companies wouldn’t be able to fix engines on time. Data integration is thus vital for predictive maintenance.

Anomaly detection capabilities are especially useful in the cybersecurity field. Real-time collection and analysis of logs, IP addresses, sessions, and other pieces of information enable teams to detect and prevent suspicious transactions or credit card fraud.

Real-time analytics can also make the difference between scoring a sale or losing a customer. Up-to-the-minute suggestions based on customer emotions can push online visitors to buy products instead of going away. Companies can bring together data from multiple sources to help the system make the most relevant prediction.

Supporting machine learning solutions

Real-time DI platforms can help teams run ML models more effectively.

DI programs can save you the time you’d spend on cleaning, enriching, and labeling data. They deliver prepared data that can be pushed into algorithms.

Also, real-time architecture ensures ML models are fed with up-to-date data from various sources instead of obsolete data, as was historically the case. These real-time data streams can be used to train ML models and prepare for their deployment. Companies can develop an algorithm to spot a specific type of malicious behavior by correlating data from multiple sources.

Or, you could pass the streams through already trained algorithms and get real-time results. ML programs would be processing cleansed data from real-time pipelines and raising an alarm or executing an action once a predefined event is detected. These insights can then guide further decision-making.

Syncing records to multiple systems

Near-instantaneous data integration enables companies to sync records across multiple systems and ensure all departments always have access to up-to-date information. There are many situations in which this ability can make a difference.

Take, for example, two beverage producers that recently merged. They’ll likely have many retail customers and chemical suppliers in common but keep information about them in different databases. Some details, such as phone numbers or product prices, may not even agree. But now that those two producers are a single company, they need to find a way to merge or sync data. Integration platforms can take data from multiple repositories and update records in both companies.

Or, different departments in the same company might use siloed systems. The finance team’s system may not be linked with the receiving team’s system, which means that data updates won’t be visible to everyone. Real-time DI can link these systems and ensure data is synced.

Creating a sales and marketing performance dashboard

Companies can also use integration technologies to improve sales and marketing performance. This is done by using real-time DI products to integrate data points from internal and external sources into a unified environment. As Kelsey Fecho, growth lead at Avo, says, “If you have data in multiple platforms – point and click behavioral analytics tools, marketing tools, raw databases – the data integration tools will help you unify your data structures and control what data goes where from a user-friendly UI.”

Companies can then track sales, open rates, conversion metrics, and various other KPIs in a single dashboard. Data is visualized using charts and graphs, making it easier to spot trends in real-time and have a better sense of ROI.

And the rise of online sales and advertising makes this capability ever more relevant. Businesses now have vast amounts of data on sales and marketing activities and look for ways to extract more value.

Creating a 360-degree view of a customer

Real-time data integration platforms enable businesses to build other types of dashboards, such as a 360-degree view of a customer.

In this case, customer data is pulled from multiple systems, such as CRM, ERP, or support, into a single environment. Details on past calls, emails, purchases, chat sessions, and various other activities are added as well. And integration tech can further enrich the dashboard with external data taken from social media or data brokers.

Companies can apply predictive analytics to this wealth of data. The system could then make a personalized product recommendation or provide tips to agents dealing with demanding customers. And agents will also get to save some time. They no longer have to put customers on hold to collect information from other departments when solving an inquiry. All details are readily available. Customers will be more satisfied, too, as their problems are solved promptly.

Data integration platforms help you scale faster

The world is becoming increasingly data-driven. Realizing value from this trend starts with bringing data from disparate sources together and making it work for you. In that regard, real-time DI platforms are a game-changer. From moving data to running analytics to optimizing sales, they enable you to step up your data game and take on the competition. And to achieve these benefits, it’s vital to choose cutting-edge integration solutions that can rise to this challenge.

Change Data Capture Methods

 

In databases, change data capture (CDC) is a set of software design patterns used to determine and track the data that has changed so that action can be taken using the changed data. Companies use change data capture for several use cases such as cloud adoption and enabling real-time data warehousing. There are multiple common change data capture methods that you can implement depending on your application requirements and tolerance for performance overhead.

  1. Introduction
  2. Audit Columns
  3. Table Deltas
  4. Triggers
  5. Log-based change data capture

 

Introduction

In high-velocity data environments where time-sensitive decisions are made, change data capture is an excellent fit to achieve low-latency, reliable, and scalable data integration. With over 80% of companies planning on implementing multi-cloud strategies by 2025, picking the right change data capture method for your business is more critical than ever given the need to replicate data across multiple environments.

The business transactions captured in relational databases are critical to understanding the state of business operations. Traditional batch-based approaches to move data once or several times a day introduce latency and reduce the operational value to the organization. Change Data Capture provides real-time or near real-time movement of data by moving and processing data continuously as new database events occur.

There are several change data capture methods to identify changes that need to be captured and moved. Here are the common methods, how they work, and their advantages as well as shortcomings.

Audit Columns

By using existing “LAST_UPDATED” or “DATE_MODIFIED” columns, or by adding one if not available in the application, you can create your own change data capture solution at the application level. This approach retrieves only the rows that have been changed since the data was last extracted.

The CDC logic for the technique would be

  1. Get the maximum value of both the target (blue) table’s ‘Created_Time’ and ‘Updated_Time’ columns

2. Select all the rows from the data source with ‘Created_Time’ greater than (>) the target table’s ‘Updated_Time’ , which are all the newly created rows since the last CDC process was executed.

3. Select all rows from the source table that have a ‘Updated_Time’ greater than (>) the target table’s ‘Updated_Time’ but less than (<) its ‘Updated_Time’. The reason for the exclusion of rows less than the maximum target create date is that they were included in step 2.

4. Insert new rows from step 2 or modify existing rows from step 3 in the target.

Pros of this method

  • It can be built with native application logic
  • It doesn’t require any external tooling

Cons of this method

  • Adds additional overhead to the database
  • DML statements such as deletes will not be propagated to the target without additional scripts to track deletes
  • Error prone and likely to cause issues with data consistency

This approach also requires CPU resources to scan the tables for the changed data and maintenance resources to ensure that the DATE_MODIFIED column is applied reliably across all source tables.

Table Deltas

You can use table delta or ‘tablediff’ utilities to compare the data in two tables for non-convergence. Then you can use additional scripts to apply the deltas from the source table to the target as another approach to change data capture. There are several examples of SQL scripts that can find the difference of two tables.

Advantages of this approach:

  • It provides an accurate view of changed data while only using native SQL scripts

Disadvantage of this approach:

  • Demand for storage significantly increases because you need three copies of the data sources that are being used in this technique: the original data, previous snapshot, and current snapshot
  • It does not scale well in applications with heavy transactional workloads

Although this works better for managing deleted rows, the CPU resources required to identify the differences is significant and the overhead increases linearly with the volume of data. The diff method also introduces latency and cannot be performed in real time. Some log-based change data capture tools come with the ability to analyze different tables to ensure replication consistency.

Triggers

Another method for building change data capture at the application level is defining triggers and creating your own change log in shadow tables. Triggers fire before or after INSERT, UPDATE, or DELETE commands (that indicate a change) and are used to create a change log. Operating at the SQL level, some users prefer this approach. Some databases even have native support for triggers.

However, triggers are required for each table in the source database, and they have greater overhead associated with running triggers on operational tables while the changes are being made. In addition to having a significant impact on the performance of the application, maintaining the triggers as the application change leads to management burden.

Advantages of this approach:

  • Shadow tables can provide an immutable, detailed log of all transactions
  • Directly supported in the SQL API for some databases

Disadvantage of this approach:

  • Significantly reduces the performance of the database by requiring multiple writes to a database every time a row is inserted, updated, or deleted

Many application users do not want to risk the application behavior by introducing triggers to operational tables. DBAs and data architects should always heavily test the performance of any triggers added into their environment and decide if they can tolerate the additional overhead.

Log-Based Change Data Capture

Databases contain transaction (sometimes called redo) logs that store all database events allowing for the database to be recovered in the event of a crash. With log-based change data capture, new database transactions – including inserts, updates, and deletes – are read from source databases’ native transaction or redo logs.

The changes are captured without making application level changes and without having to scan operational tables, both of which add additional workload and reduce source systems’ performance.

Advantages of this approach

  • Minimal impact on production database system – no additional queries required for each transaction
  • Can maintain ACID reliability across multiple systems
  • No requirement to change the production database system’s schemas or the need to add additional tables

Challenges of this approach

  • Parsing the internal logging format of a database is complex – most databases do not document the format nor do they announce changes to it in new releases. This would potentially require you to change your database log parsing logic with each new database release.
  • Would need system to manage the database change events metadata
  • Additional log levels required to produce scannable redo logs can add marginal performance overhead

Data integration platforms that natively perform change data capture can handle the complexity mentioned above by automatically mining the database change logs while managing additional metadata to ensure the replication between two or more systems is reliable.

If you would like a demo of to see how low-impact, real-time log-based change data capture works, or to talk to one of our CDC experts, you can schedule a demo for The Striim Platform.

data architecture

Types of Data Integration: ETL vs ELT and Batch vs Real-Time

 

 

Overview

1. Introduction
2. Batch Data Integration
3. Real-Time Data Integration
4. ETL vs ELT
5. Why Real-Time Matters

Introduction

The world is drowning in data. More than 80% of respondents in an IBM study said that the sources, the volume, and the velocity of data they work with had increased. So it comes as no surprise that companies are eager to take advantage of these trends; the World Economic Forum’s report lists data analysts and scientists as the most in-demand job role across industries in 2020. And although companies are ramping up efforts in this field, there are major obstacles on the road ahead.

Not only are most analysts forced to work with unreliable and outdated data, but many also lack tools to quickly integrate data from different sources into a unified view. Traditional batch data integration is hardly up to this challenge.

That’s why a growing number of companies are looking for more effective and faster types of data integration. One solution is real-time data integration, a technology superior to batch methods because it enables rapid decision-making, breaks down data silos, future-proofs your business, and offers many other benefits.

Different types of data integration

data architecture

Depending on their business needs and IT infrastructures, companies opt for different types of data integration. Some choose to ingest, process, and deliver data in real time, while others might use batch integration. Let’s quickly dive into each one of those.

Batch data integration

Batch data integration involves storing all the data in a single batch and moving it at scheduled periods of time or only once a certain amount is collected. This approach is useful if you can wait to receive and analyze data.

Batch data integration, for instance, can be used for maintaining an index of company files. You don’t necessarily need an index to be refreshed each time a document is added or modified; once or twice a day should be sufficient.

Electric bills are another relevant example. Your electric consumption is collected during a month and then processed and billed at the end of that period. Banks also use batch processing, which is why some card transactions might take time to be reflected in your online banking dashboard.

Real-time data integration with change data capture

Real-time data integration involves processing and transferring data as soon as it’s collected. The process isn’t literally instantaneous, though. It takes a fraction of a second to transfer, transform, and analyze data using change data capture (CDC), transform-in-flight, and other technologies.

Event as a Change to an Entry in a Database
Imagine Each Event as a Change to an Entry in a Database

CDC involves tracking the database’s change logs and then turning inserts, updates, and other events into a stream of data applied to a target database. In many situations, however, data needs to be delivered in a specific format. That’s where the transform-in-flight feature comes into play as it turns data that’s in motion into a required format and enriches it with inputs from other sources. Data is delivered to the master file in a consumable form and is ready for processing.

Real-time data integration can be deployed in a range of time-sensitive use cases. Take, for example, reservation systems: When you book a vacation at your favorite hotel, its master database is automatically updated to prevent others from booking the same room. Point-of-sale terminals rely on the same data-processing tech. As you type your PIN and then take money from a terminal, your account is automatically updated to reflect this action.

ETL VS ELT

ETL (extract, transform, load) is another approach to data integration and has been standard for decades. It consists of three parts:

  • The first component of this method involves extracting data from the source systems using database queries (JDBC, SQL) or change data capture in the case of real-time data integration.
  • Transform, a second component of ETL, includes processing the data so it can be consumed properly in the target system. Examples of transformation include data type mapping, re-formatting data (e.g. removing special characters), or deriving aggregated values from raw data.
  • And load is the third component of ETL. It relates to the writing of the data to the target platform. This can be as simple as writing to a delimited file. Or, it can be as complex as creating schemas in a database or performing merge operations in a data warehouse.

ELT (Extract, load, transform) re-orders the equation by allowing the target data platform to handle transformation while the integration platform simply collects and delivers the data.

There are a few factors that have led to the recent popularity of ELT:

  • The cost of compute has been optimized over time with open source tools (Spark, Hadoop) and cloud infrastructure such as AWS, Microsoft Azure, and Google Cloud.
  • Modern cloud data platforms like Snowflake and Databricks provide analysts and cloud architects with a simple user experience to analyze disparate data sources in one platform. ELT tools load raw and unstructured data into these types if data platforms so analysts can join and correlate the data.
  • ETL has increasingly become synonymous with legacy, batch data integration workloads that poorly integrate with the modern data stack

Andreesen Horowitz’s recent paper on modern data infrastructure highlighted ELT as being a core component of next-generation data stacks while referring to ETL as ‘brittle’. It’s unclear why they are categorizing all ETL tools as brittle, but it’s clear there’s a perception that ETL has become synonymous with legacy, outdated data management practices.

However, real-time data integration modernizes ETL by using the latest paradigms to transform and correlate streaming data in-flight so it’s ready for analysis the moment it’s written to the target platform. This allows analysts to avoid data transformation headaches, reduce their cloud resource usage, and simply start analyzing their data in their platform of choice.

And real-time data processing is evolving and growing in popularity because it helps solve many difficult challenges and offers a range of benefits.

Real-time data flows allow rapid decision-making

By 2023, there will be over 5 billion internet users and 29.3 billion networked devices, each producing ever-larger amounts of different types of data. Real-time integration allows companies to act quickly on this information.

Data from on-premises and cloud-based sources can easily be fed, in real-time, into cloud-based analytics built on, for instance, Kafka (including cloud-hosted versions such as Google PubSub, AWS Kinesis, Azure EventHub), Snowflake, or BigQuery, providing timely insights and allowing fast decision making.

And speed is becoming a critical resource. Detecting and blocking fraudulent credit card usage requires matching payment details with a set of predefined parameters in real time. If, in this case, data processing took hours or even minutes, fraudsters could get away with stolen funds. But real-time data integration allows banks to collect and analyze information rapidly and cancel suspicious transactions.

Companies that ship their products also need to make decisions quickly. They require up-to-date information on inventory levels so that customers don’t order out-of-stock products. Real-time data integration prevents this problem because all departments have access to continuously updated information, and customers are notified about sold-out goods.

Real-time data integration breaks down data silos

When deciding which types of data integration to use, data silos are another obstacle companies have to account for. When data sets are scattered across ERP, CRM, and other systems, they’re isolated from each other. Engineers then find it hard to connect the dots, uncover insights, and make better decisions. Fortunately, real-time data integration helps businesses break down data silos.

From relational databases and data warehouses to IoT sensors and log files, real-time data integration delivers data with sub-second latency from various sources to a new environment. Organizations then have better visibility into their processes. Hospitals, for example, can integrate their radiology units with other departments and ensure that patient imaging data is shared with all stakeholders instead of being siloed.

Real-time data integration future-proofs your business

Speed is essential in a world that produces more and more data. Annual mobile traffic alone will reach almost a zettabyte by 2022, changing the existing technologies and giving rise to new ones. Thriving in this digital revolution requires handling an array of challenges and opportunities. It also requires navigating between different types of data integration options, with real-time tech capable of future-proofing your business in many different ways.

Avoid vendor lock-in with a multi-cloud strategy

According to IBM, 81% of all enterprises have a multi-cloud strategy already laid out or in the works.
Real-time data integration allows your team to get more value from the cloud by making it possible to experiment with or adopt different technologies. You’d be able to use a broader range of cloud services and, by extension, build better applications and improve machine-learning models. And these capabilities are critical to a resilient and flexible IT architecture that underpins innovation efforts across on-premises and cloud environments.

Improving customer service ops

Your support reps can better serve customers by having data from various sources readily available. Agents with real-time access to purchase history, inventory levels, or account balances will delight customers with an up-to-the-minute understanding of their problems. Rapid data flows also allow companies to be creative with customer engagement. They can program their order management system to inform a CRM system to immediately engage customers who purchased products or services.

Better customer experiences then translate into increased revenue, profits, and brand loyalty. Almost 75% of consumers say a good experience is critical for brand loyalties, while most businesses consider customer experience as a competitive differentiator vital for their survival and growth.

Optimizing business productivity

Spotting inefficiencies and taking corrective actions is another important goal for today’s companies. Manufacturers, for instance, achieve this goal by deploying various improvement methodologies, such as Lean production, Six Sigma, or Kaizen.

Whichever of those or other productivity tactics they choose, companies need access to real-time data and continuously updated dashboards. Relying on periodically refreshed data can slow down progress. Instead of tackling problems in real time, managers take a lot of time to spot problems, causing unnecessary costs and increased waste.

Therefore, the key to optimizing business productivity is collecting, transferring, and analyzing data in real time. And many companies agree with this argument. According to an IBM study, businesses expect that fast data will allow them to “make better informed decisions using insights from analytics (44%), improved data quality and consistency (39%), increased revenue (39%), and reduced operational costs (39%).”

Harnessing the power of digital transformation

 

Among different types of data integration, real-time tech is the one that allows companies to truly take their data game to the next level. No longer constrained by batch processing, businesses can innovate more, build better products, and drive profits. Harnessing the power of data will provide them with a much-needed competitive edge. And that can make all the difference between growth and stagnation as the digital revolution reshapes the world.

Definitions

Batch data integration involves storing all the data in a single batch and moving it only once a certain amount is collected or at scheduled periods of time.

Real-time data integration involves processing and transferring data as soon as it’s collected using change data capture (CDC), transform-in-flight, and other technologies.

Benefits of real-time data integration

  • Enables rapid decision-making
  • Accelerates ELT with faster loads
  • Modernizes ETL with high throughput transformations
  • Breaks down data silos
  • Prepares teams for a multi-cloud, anti-vendor lock-in strategy
  • Improves customer experiences

Online Enterprise Database Migration to Google Cloud

 

 

Migrate to cloud

Migrating existing workloads to the cloud is an formidable step in the journey of digital transformation for enterprises. Moving an enterprise application from on premises to run in the cloud, or modernizing with the best use of cloud-native technologies, is only part of the challenge. A major part of this task is to move the existing enterprise databases while business continuously operate at full speed.

Pause never

How the data is extracted and loaded into the new cloud environment plays a big role in keeping the business critical systems performant. Particularly for enterprise databases supporting mission-critical applications, avoiding downtime is a must-have requirement during migrations to minimize both the risk and operational disruption.

For business critical applications, the acceptable downtime precipitously approaches zero. All the while, moving large amounts of data, and essential testing of the business critical applications can take days, weeks, or even months.

Keep running your business

The best practice in enterprise database migration, to minimize and even altogether eliminate the downtime, is to use online database migration that keeps the application running.

In the online migration, changes from the enterprise source database are captured non-intrusively as real-time data streams using Change Data Capture (CDC) technology. This capability is available for most major databases, including Oracle, Microsoft SQL Server, HPE NonStop, MySQL, PostgreSQL, MongoDB, and Amazon RDS, but has to be harnessed in the correct way.

In online database migration, first, you initially load the source database to the cloud. Then, any changes in the source database that have happened since you were executing the initial load are applied to the target cloud database continuously from the real-time data stream. The source and target databases will remain up to date until you are ready to completely cut over. You will also have the option to fallback to the source all along, further minimizing risks.

Integrate continuously

Online database migration also provides essential data integration services for the new application development in the cloud. The change delivery can be kept running while you develop and test the new cloud applications. You may even choose to keep the target and source databases in sync indefinitely typically for continuous database replication in hybrid or multi-cloud use cases.

Keep fresh

Once the real-time streaming data pipelines to the cloud are set up, businesses can easily build new applications, and seamlessly adopt new cloud services to get the most operational value from the cloud environment. Real-time streaming is a crucial element in all such data movement use cases, and it can be widely applied to hybrid or multi-cloud architectures, operational machine learning, analytics offloading, large scale cloud analytics, or any other scenario where having up-to-the-second data is essential to the business.

Change Data Capture

Striim, in strategic partnership with Google Cloud, offers online database migrations and real-time hybrid cloud data integration to Google Cloud through non-intrusive Change Data Capture (CDC) technologies. Striim enables real-time continuous data integration from on-premises and other cloud data sources to BigQuery, Cloud Spanner, Cloud SQL for PostgreSQL, for MySQL, and for SQL Server, as well as Cloud Pub/Sub and Cloud Storage as well as other databases running in the Google Cloud.

Replicate to Google Cloud

In addition to data migration, data replication is an important use case as well. In contrast to data migration, data replication continuously replicates data from a source system to a target system “forever” without the intent to shut down the source system.

An example target system in the context of data replication is BigQuery. It is the data analytics platform of choice in Google Cloud. Striim supports continuous data streaming (replication) from an on-premises database to BigQuery in Google Cloud in case the data has to remain on-premises and cannot be migrated. Striim bridges the two worlds and makes Google Cloud data analytics accessible by supporting the hybrid environment.

Transform in flight

Data migration and continuous streaming in many cases transports the data unmodified from the source to the target systems. However, many use cases require data to be transformed to match the target systems, or to enrich and combine data from different sources in order to complement and complete the target data set for increased value and expressiveness in a simple and robust architecture. This method is frequently referred to as Extract Transform Load, or ETL.

Striim provides a very flexible and powerful in-flight transformation and augmentation functionality in order to support use cases that go beyond simple one-time data migration.

More to migrate? Keep replicate!

Enterprises in general have several data migration and online streaming use cases at the same time. Often data migration takes place for some source databases, while data replication is ongoing for others.

A single Striim installation can support several use cases at the same time, reducing the need for management and operational supervision. The Striim platform supports high-volume, high velocity data with built-in validation, security, high-availability, reliability, and scalability as well as backup-driven disaster recovery addressing enterprise requirements and operational excellence.

The following architecture shows an example where migration and online streaming is implemented at the same time. On the left, the database in the Cloud is migrated to the Cloud SQL database on the right. After a successful migration the source database is going to be removed. In addition, the two source databases on the left in an on-premises data center are continuously streamed (replicated) to BigQuery for analytics and Cloud Spanner for in-Cloud processing.

Keep going

In addition, Striim as the data migration technology is implemented in a high-availability configuration. The three servers on Compute Engine form a cluster, and each of the servers is executing in a different zone, making the cluster highly available and protecting the migration and online streaming from zone failures or zone outages.

Accelerate Cloud adoption

As organizations modernize their data infrastructure, integrating mission-critical databases is essential to ensure information is accessible, valuable, and actionable. Striim and Google Cloud’s partnership supports Google customers with a smooth data movement and continuous integration solutions, accelerating Google Cloud adoption and driving business growth.

Learn more

To learn more about the enterprise cloud data integration questions, feel free to reach out to Striim and check out these references:?

Google Cloud Solution Architecture: Architecting database migration and replication using Striim

Blog: Zero downtime database migration and replication to and from Cloud Spanner

Tutorial: Migrating from MySQL to BigQuery for Real-Time Data Analytics

Striim Google Virtual Hands-On Lab: Online Database Migration to Google Cloud using Striim

Self-paced Hands-on Lab: Online Data Migration to Cloud Spanner using Striim

Striim 3.10.1 Further Speeds Cloud Adoption

 

 

We are pleased to announce the general availability of Striim 3.10.1 that includes support for new and enhanced Cloud targets, extends manageability and diagnostics capabilities, and introduces new ease of use features to speed our customers’ cloud adoption. Key Features released in Striim 3.10.1 are directly available through Snowflake Partner Connect to enable rapid movement of enterprise data into Snowflake.

Striim 3.10.1 Focus Areas Including Cloud Adoption

This new release introduces many new features and capabilities, summarized here:

3.10.1 Features Summary

 

Let’s review the key themes and features of this new release, starting with the new and expanded cloud targets

Striim on Snowflake Partner Connect

From Snowflake Partner Connect, customers can launch a trial Striim Cloud instance directly as part of the Snowflake on-boarding process from the Snowflake UI and load data, optionally with change data capture, directly into Snowflake from any of our supported sources. You can read about this in a separate blog.

Expanded Support for Cloud Targets to Further Enhance Cloud Adoption

The Striim platform has been chosen as a standard for our customers’ cloud adoption use-cases partly because of the wide range of cloud targets it supports. Striim provides integration with databases, data warehouses, storage, messaging systems and other technologies across all three major cloud environments.

A major enhancement is the introduction of support for the Google BigQuery Streaming API. This not only enables real-time analytics on large scale data in BigQuery by ensuring that data is available within seconds of its creation, but it also helps with quota issues that can be faced by high volume customers. The integration through the BigQuery streaming API can support data transfer up to 1GB per second.

In addition to this, Striim 3.10.1 also has the following enhancements:

  • Optimized delivery to Snowflake and Azure Synapse that facilitates compacting multiple operations on the same data to a single operation on the target resulting in much lower change volume
  • Delivery to MongoDB cloud and MongoDB API for Azure Cosmos DB
  • Delivery to Apache Cassandra, DataStax Cassandra, and Cassandra API for Azure Cosmos DB

  • Support for delivery of data in Parquet format to Cloud Storage and Cloud Data Lakes to further support cloud analytics environments

Schema Conversion to Simplify Cloud Adoption Workflows

As part of many cloud migration or cloud integration use-cases, especially during the initial phases, developers often need to create target schemas to match those of source data. Striim adds the capability to use source schema information from popular databases such as Oracle, SQL Server, and PostgreSQL and create appropriate target schema in cloud targets such as Google BigQuery, Snowflake and others. Importantly, these conversions understand data type and structure differences between heterogeneous sources and targets and act intelligently to spot problems and inconsistencies before progressing to data movement, simplifying cloud adoption.

Enhanced Monitoring, Alerting and Diagnostics

On-going data movement between on-premise and cloud environments for migrations, or powering reporting and analytics solutions, are often part of an enterprise’s critical applications. As such they demand deep insights into the status of all active data flows.

Striim 3.10.1 adds the capability to inherently monitor data from its creation in the source to successful delivery in a target, generate detailed lag reports, and alert on situations where lag is outside of SLAs.

End to End Lag Visualization

In addition, this release provides detailed status on checkpointing information for recovery and high availability scenarios, with insight into checkpointing history and currency.

Real-time Checkpointing Information

Simplifies Working with Complex Data

As customers work with heterogeneous environments and adopt more complex integration scenarios, they often have to work with complex data types, or perform necessary data conversions. While always possible through user defined functions, this release adds multiple commonly requested data manipulation functions out of the box. This simplifies working with JSON data and document structures, while also facilitating data cleansing, and regular expression operations.

On-Going Support for Enterprise Sources

As customers upgrade their environments, or adopt new technologies, it is essential that their integration platform keeps pace. In Striim 3.10.1 we extend our support for the Oracle database to include Oracle 19c, including change data capture, add support for schema information and metadata for Oracle GoldenGate trails, and certify our support for Hive 3.1.0

These are a high level view of the new features of Striim 3.10.1. There is a lot more to discover to aid on your cloud adoption journey. If you would like to learn more about the new release, please reach out to schedule a demo with a Striim expert.

Implementing Gartner’s Cloud Smart FEVER selection process using Striim

In their recent research note, “Move From Cloud First to Cloud Smart to Improve Cloud Journey Success” (February 2020), Gartner introduced the concept of using the FEVER selection process to prioritize workloads to move to cloud.

According to the research note, to ensure rapid results by building on the knowledge of earlier experiences with cloud, IT leaders “should prioritize the workloads to move to cloud by using a ‘full circle’ continuous loop selection process: faster, easier, valuable, efficient and repeat (FEVER; see Figure 2). This allows them to deliver results in waves of migrations according to the organization’s delivery capacities.”

While thinking about this concept I realized that following this approach is one of the reasons that Striim’s customers are so successful with their cloud migration and integration initiatives.  They are utilizing a cloud smart approach for real-world use-cases, including online database migrations enabled by change data capture, offloading reporting to cloud environments, and continuous data delivery for cloud analytics.

Faster

The speed of solutions is critical to many of our customers that have strict SLAs, and limited timeframes in which they want to complete their projects. Striim allows customers to build and test data flows supporting cloud adoption very quickly, while Striim’s optimized architecture enables rapid transfer of data from data sources to cloud for both initial load, and on-going real-time data delivery.

Easier

Customers don’t want to spend days or weeks learning a new solution. In order to implement quickly, the solution must be easy to learn and work with. Striim’s wizard-based approach and intuitive UI enables our customers to rapidly build out their data pipelines, and transfer knowledge for on-going operations.

Valuable

Many of our customers are already ‘Cloud Smart’ and approach cloud initiatives in a pragmatic way. They often start with highly critical, but simple migrations, that gives them the highest value in the shortest time. Once all the “lowest-hanging fruits” are picked and successfully implemented, they move onto more complex scenarios, or integrate additional sources.

Efficient

Cost-efficiency for our customers is more than just the on-going cost reductions inherent in moving to a cloud solution. It also includes the time taken by their valuable employees to build and maintain the solution, and the data ingress costs inherent in moving their data to the cloud. By utilizing Striim, they can reduce the amount of time spent to achieve success and reduce their data movement costs by utilizing one-time loads, with on-going change delivery.

Repeat

It is seldom that our customers have a single migration, or cloud adoption to perform. Repeatability, and reusability of the cloud migration or integration is essential to their long-term plans. Not only do they want to be able to repeat similar migrations, but they also want to be able to use the same platform for all of their cloud adoption initiatives. By standardizing on Striim, our customers can take advantage of the large numbers of sources and cloud targets we support and focus on the business imperatives without having to worry whether it’s possible.

 

If you would like to learn more about becoming cloud smart, you can access the full report “Move From Cloud First to Cloud Smart to Improve Cloud Journey Success” (February 2020), for a limited time using this link.

 

Move From Cloud First to Cloud Smart to Improve Cloud Journey Success, Henrique Cecci, 25 February 2020

GARTNER is a registered trademark and service mark of Gartner, Inc. and/or its affiliates in the U.S. and internationally, and is used herein with permission. All rights reserved.

This graphic was published by Gartner, Inc. as part of a larger research document and should be evaluated in the context of the entire document. The Gartner document is available upon request from Striim.

Streaming Kafka Integration

New Quick Start Tutorial for Streaming Kafka Integration

 

 

If you have adopted Apache Kafka as your high-performance, fault-tolerant messaging system, Kafka’s real-time integration with your critical data sources and consumers is essential for you to get the most business value. With real-time, streaming Kafka integration, you can build modern applications that access timely data in the right format, enabling time-sensitive operational decisions. However, as Kafka was designed for developers, typically, organizations rely on a team of developers’ manual efforts and specialized skillsets to stream data into, and out of Kafka, and automate data processing for its consumers.

Striim has offered SQL-query-based processing and analytics for Kafka since 2015. Drag-and-drop UI, pre-built applications and wizards for configuring Kafka integration, and custom utilities make Striim the easiest solution to build streaming integration pipelines with built-in analytics applications.

Our new tech guide: “Quick Start Tutorial for Streaming Kafka Integration,” details Striim’s capabilities for Kafka integration and stream processing. Quick Start Tutorial for Streaming Kafka IntegrationIt illustrates how users can get optimal value from Kafka by simplifying the real-time ingestion, stream processing, and delivery of a wide range of data types, including transactional data from enterprise databases, without impacting their performance. The guide offers step-by-step instructions on how to build a Kafka integration solution for moving data from a MySQL database to Kafka with log-based change data capture and in-flight data processing. You can easily use these instructions for other data sources supported by Striim.

 

Some of the key areas covered in this tech guide include how to:

  • Ingest data to Kafka in a streaming fashion from enterprise databases, such as Oracle, SQL Server, MySQL, PostgreSQL, and HPE NonStop using low-impact change data capture. Other data sources, such as system logs, sensors, Hadoop, and cloud data stores, are also discussed in this section.
  • Use SQL-based stream processing to put Kafka data in the consumable format before delivery in sub-seconds. Data formatting and the use of in-memory data cache for in-flight data enrichment are explained too.
  • Support mission-critical applications with built-in scalability, security, exactly-once-processing (E1P), and high-availability.
  • Perform SQL-based in-memory analytics, as the data is flowing through, and rapidly visualize the results of the analytics without needing to code manually.
  • Deliver real-time data from Kafka to other systems, including cloud solutions, databases, data warehouses, other messaging systems, and files with pre-built adapters.

The Striim platform addresses the complexity of Kafka integration with an end-to-end, enterprise-grade software platform. By downloading the new tech guide: “Quick Start Tutorial for Streaming Kafka Integration,” you can get started to rapidly build integration and analytics solutions without extensive coding and needing specialized skillsets. That capability enables your data scientists, business analysts, and other data professionals to focus on delivering fast business value to transform your operations.

For more resources on integrating high volumes of data in and out of Kafka, please visit our Kafka Integration and Stream Processing solution page. If you prefer to discuss your specific requirements, we would be happy to provide you a customized demo of streaming Kafka integration or other relevant use cases.

 

Mitigating Data Migration and Integration Risks for Hybrid Cloud Architecture

 

Cloud computing has transformed how businesses use technology and drive innovation for improved outcomes. However, the journey to the cloud, which includes data migration from legacy systems, and integration of cloud solutions with existing systems, is not a trivial task. There are multiple cloud adoption risks that businesses need to mitigate to achieve the cloud’s full potential.

 

Common Risks in Data Migration and Integration to Cloud Environments

In addition to data security and privacy, there are additional concerns and risks in cloud migration and integration. These include:

Downtime: The bulk data loading technique, which takes a snapshot of the source database, requires you to lock the legacy database to preserve the consistent state. This translates to downtime and business disruption for your end users. While this disruption can be acceptable for some of your business systems, the mission-critical ones that need modernization are typically the ones that cannot tolerate even planned downtime. And sometimes, planned downtime extends beyond the expected duration, turning into unplanned downtime with detrimental effects on your business.

Data loss: Some data migration tools might lose or corrupt data in transit because of a process failure or network outage. Or they may fail to apply the data to the target system in the right transactional order. As a result, your cloud database ends up diverging from the legacy system, also negatively impacting your business operations.

Inadequate Testing: Many migration projects operate under tense time pressures to minimize downtime, which can lead to a rushed testing phase. When the new environment is not tested thoroughly, the end result can be an unstable cloud environment. Certainly, not the desired outcome when your goal is to take your business systems to the next level.

Stale Data: Many migration solutions focus on the “lift and shift” of existing systems to the cloud. While it is a critical part of cloud adoption, your journey does not end there. Having a reliable and secure data integration solution that keeps your cloud systems up-to-date with existing data sources is critical to maintaining your hybrid cloud or multi-cloud architecture. Working with outdated technologies can lead to stale data in the cloud and create delays, errors, and other inefficiencies for your operational workloads.

 

Upcoming Webinar on the Role of Streaming Data Integration for Data Migration and Integration to Cloud

Streaming data integration is a new approach to data integration that addresses the multifaceted challenges of cloud adoption. By combining bulk loading with real-time change data capture technologies, it minimizes downtime and risks mentioned above and enables reliable and continuous data flow after the migration.

Striim - Data Migration to Cloud

In our next live, interactive webinar, we dive into this particular topic; Cloud Adoption: How Streaming Data Integration Minimizes Risks. Our Co-Founder and CTO, Steve Wilkes, will present the practical ways you can mitigate the data migration risks and handle integration challenges for cloud environments. Striim’s Solution Architect, Edward Bell, will walk you through with a live demo of zero downtime data migration and continuous streaming integration to major cloud platforms, such as AWS, Azure, and Google Cloud.

I hope you can join this live, practical presentation on Thursday, May 7th 10:00 AM PT / 1:00 PM ET to learn more about how to:

  • Reduce migration downtime and data loss risks, as well as allow unlimited testing time of the new cloud environment.
  • Set up streaming data pipelines in just minutes to reliably support operational workloads in the cloud.
  • Handle strict security, reliability, and scalability requirements of your mission-critical systems with an enterprise-grade streaming data integration platform.

Until we see you at the webinar, and afterward, please feel free to reach out to get a customized Striim demo for data migration and integration to cloud to support your specific IT environment.

 

Top 4 Highlights from Our Streaming Data and Analytics Webinar with GigaOm

 

 

On April 9, 2020, Striim’s co-founder and CTO Steve Wilkes joined GigaOm’s analyst Andrew Brust (bio) in an interview-style webinar on “Streaming Data: The Nexus of Cloud Modernized Analytics.” GigaOm and Striim Webinar SpeakersOver the course of the hour, the two talked about the evolution of data integration needs, what defines streaming data integration, capturing transactional data through change data capture (CDC), comparative approaches for data integration, where companies typically start with streaming data, use case examples, how it supports cloud initiatives, providing a foundation for operational intelligence, and even its role in AI/ML advancements.

While we can’t cover it all in one blog post, here is a “top 4” list of our favorite things highlighted during the webinar — and we invite you to view the entire on-demand event by watching it online

 

#1: “Today, People Expect to Have Up-to-the-Second Information” — Steve Wilkes

Andrew asked Steve to do a bit of “wayback machine” to trace how we arrived at the need for streaming, real-time data. “Twenty years ago, most data was created by humans working on applications with data stored in databases, and you’d use ETL to move and store the data in batches into a data warehouse. It was OK to see data hours or even days later, and everyone did that,” said Steve. But fast-forward to our daily lives today and how we get immediate updates on things like Twitter feeds, news alerts, instant messaging with friends, and expectations have changed.

“So the business world needs to work the same way, and this does drive competitive pressures,” he continued. “If you’re not having this view into your operations and what your customers need, someone else will and they can push you out of business.”

Related to this, Andrew said later in the webinar: “We have new modes of thinking. But using older modes of technology, we’re going to run into issues.”

GigaOm: Old vs New Approaches to Data Movement

#2: Cloud Adoption Driving the Need for Streaming Data

As Steve noted, there’s been a significant shift from all on-premises systems to cloud-based environments, but there is still the need to get data into the cloud in order to get use from it.

Steve shared with Andrew that what Striim sees across its global customer case in terms of adoption is that the majority have a first goal of building the ability to stream their data first and then use it to power the analytics.

“Initial use cases are often zero-downtime data migrations to cloud or feeding a cloud-based data warehouse…. Once they’ve stream-enabled a lot of their sources, they will start to think about what analytics they can promote to real time and where they can get value out of that,” said Steve.

 

#3: A Range of Business Use Cases

Throughout the webinar, Andrew mentioned a few possible use cases, particularly in the context of the global pandemic being faced. “There’s nothing more frustrating, especially in these times of lockdown, when it says something is in stock and then you go to confirm the purchase and it says it’s out of stock … or you find out later.”

From Steve: “That real immediacy into what customers are doing, need, and want is key to what streaming data can do.”

Another example Andrew used illustrated the need for operational intelligence using real-time data. He referenced his home state of New York as it faces the coronavirus pandemic, where the real-time sharing of data about medical supplies and personnel data across the state’s hospitals could improve decisions to best allocate and redistribute those assets.

Shifting to the analytics side, Steve described operational intelligence as being able to change what you know about your operations and the decisions you make, based on current information. He gave the example of being able to track down critical devices, such as wheelchairs, in settings such as airports and hospitals.

The two also discussed how streaming data fits with AI/ML, where Steve commented how streaming data can be used to get data ready and processed for AI models to improve efficiency and performance.

 

#4: Status of Streaming Data

Andrew polled attendees with the question of where they are today with having streaming data in their organization.

GigaOm Poll: Use of Streaming Data In Your Organization

At least half of the attendees said they are using streaming data at least occasionally, which suggests that streaming data integration will continue to grow in popularity and ubiquity. Another 25% are currently evaluating streaming data technology.

Andrew asked Steve for his thoughts on the 15% who felt they don’t have a need for streaming data. As Steve commented: “A lot of organizations have a perception of what a real-time application is and the categories of use cases they are good for. But if you are moving applications to the cloud and they are business-critical, if you can’t turn them off for a few days, how do you do that without turning it off when data is still changing. There’s a need for real-time streaming data there.”

As you can see, the two covered a lot of ground — and so much more during this interactive webinar event. It is available to watch on demand at your convenience, so please check it out. We thank GigaOm and Andrew Brust for hosting this engaging program.

Also, you can learn more about the topic of Streaming Integration in a new 100+ book published by O’Reilly Media and co-authored by Steve Wilkes, who was the speaker of this webinar. Download your free PDF copy today.

 

A New Comprehensive Guide to Streaming ETL for Google Cloud

 

 

Not to brag, but since we literally wrote the book on data modernization with streaming data integration, it is our pleasure to provide you with a guide book on using streaming ETL for Google Cloud Platform. This eBook will help your company unleash innovative services and solutions by combining the power of streaming data integration with Google Cloud Platform services.

As part of your data modernization and cloud adoption efforts, you cannot ignore how you collect and move your data to your new data management platform. But, like adopting any new technology, there is complexity in the move and a number of things to consider, especially when dealing with mission-critical systems. We realize that the process of researching options, building requirements, getting consensus, and deciding on a streaming ETL for Google Cloud is never a trivial task.

A Buyer's Guide to Streaming Data Integration to Google Cloud PlatformAs a technology partner of Google Cloud, we, at Striim, are thrilled to invite you to easily tap into the power of streaming ETL by way of our new eBook: A Buyer’s Guide to Streaming Data Integration for Google Cloud. If you’ve been looking to move to the Google Cloud or get more operational value in your cloud adoption journey, this eBook is your go-to guide.

This eBook provides an in-depth analysis of the game-changing trends of digital transformation. It explains why a new approach to data integration is required, and how www.striim.com/blog/2020/01/streaming-data-integration-whiteboard-wednesdays/“>streaming data integration (SDI) fits into a modern data architecture. With many use case examples, the eBook shows you how streaming ETL for Google Cloud provides business value, and why this is a foundational step. You’ll discover how this technology is enabling the business innovations of today – from ride sharing and fintech, to same-day delivery and retail/e-retail.

Here’s a rundown of what we hope you’ll learn through this eBook:

  • A clear definition of what streaming integration is, and how it compares and contrasts to traditional extract/transform/load (ETL) tools
  • An understanding of how SDI fits into existing as well as emerging enterprise architectures
  • The role streaming data integration architecture plays in regards to cloud migration, hybrid cloud, multi-cloud, etc.
  • The true business value of adopting SDI
  • What companies and IT professionals should be looking for in a streaming data integration solution, focusing on the value of combining SDI and stream processing in one integrated platform
  • Modern SDI use cases, and how these are helping organizations to transform their business
  • Specifically, the benefits of using the Striim SDI platform in combination with the Google Cloud Platform

The digital business operates in real time, and the limitations of legacy integration approaches will hold you back from the limitless potential that cloud platforms bring to your business. To ease your journey into adopting streaming ETL to Google Cloud, please accept our tested and proven guidance with this new eBook: A Buyer’s Guide to Streaming Data Integration for Google Cloud. By following the practical steps provided for you, you can reap the full benefits of Google Cloud for your enterprise. For further information on streaming data integration or the Striim platform, please feel free to contact us.

MySQL to Google BigQuery using CDC

Tutorial: Migrating from MySQL to BigQuery for Real-Time Data Analytics

 

 

In this post, we will walk through an example of how to replicate and synchronize your data from on-premises MySQL to BigQuery using change data capture (CDC).

Data warehouses have traditionally been on-premises services that required data to be transferred using batch load methods. Ingesting, storing, and manipulating data with cloud data services like Google BigQuery makes the whole process easier and more cost effective, provided that you can get your data in efficiently.

Striim real-time data integration platform allows you to move data in real-time as changes are being recorded using a technology called change data capture. This allows you to build real-time analytics and machine learning capabilities from your on-premises datasets with minimal impact.

Source MySQL Database

Before you set up the Striim platform to synchronize your data from MySQL to BigQuery, let’s take a look at the source database and prepare the corresponding database structure in BigQuery. For this example, I am using a local MySQL database with a simple purchases table to simulate a financial datastore that we want to ingest from MySQL to BigQuery for analytics and reporting.

I’ve loaded a number of initial records into this table and have a script to apply additional records once Striim has been configured to show how it picks up the changes automatically in real time.

Targeting Google BigQuery

You also need to make sure your instance of BigQuery has been set up to mirror the source or the on-premises data structure. There are a few ways to do this, but because you are using a small table structure, you are going to set this up using the Google Cloud Console interface. Open the Google Cloud Console, and select a project, or create a new one. You can now select BigQuery from the available cloud services. Create a new dataset to hold the incoming data from the MySQL database.

Once the dataset has been created, you also need to create a table structure. Striim can perform the transformations while the data flies through the synchronization process. However, to make things a little easier here, I have replicated the same structure as the on-premises data source.

You will also need a service account to allow your Striim application to access BigQuery. Open the service account option through the IAM window in the Google Cloud Console and create a new service account. Give the necessary permissions for the service account by assigning BigQuery Owner and Admin roles and download the service account key to a JSON file.

Set Up the Striim Application

Now you have your data in a table in the on-premises MySQL database and have a corresponding empty table with the same fields in BigQuery. Let’s now set up a Striim application on Google Cloud Platform for the migration service.

Open your Google Cloud Console and open or start a new project. Go to the marketplace and search for Striim. A number of options should return, but the option you are after is the first item that allows integration of real-time data to Google Cloud services.

Select this option and start the deployment process. For this tutorial, you are just using the defaults for the Striim server. In production, you would need to size appropriately depending on your load.

Click the deploy button at the bottom of this screen and start the deployment process.

Once this deployment has finished, the details of the server and the Striim application will be generated.

Before you open the admin site, you will need to add a few files to the Striim Virtual Machine. Open the SSH console to the machine and copy the JSON file with the service account key to a location Striim can access. I used /opt/striim/conf/servicekey.json.

You also need to restart the Striim services for these setting and changes to take effect. The easiest way to do this is to restart the VM.

Give these files the right permissions by running the following commands:

chown striim:striim <filename>

chmod 770 <filename>

You also need to restart the Striim services for this to take effect. The easiest way to do this is to restart the VM.

Once this is done, close the shell and click on the Visit The Site button to open the Striim admin portal.

Before you can use Striim, you will need to configure some basic details. Register your details and enter in the Cluster name (I used “DemoCluster”) and password, as well as an admin password. Leave the license field blank to get a trial license if you don’t have a license, then wait for the installation to finish.

 

When you get to the home screen for Striim, you will see three options. Let’s start by creating an app to connect your on-premises database with BigQuery to perform the initial load of data. To create this application, you will need to start from scratch from the applications area. Give your application a name and you will be presented with a blank canvas.

The first step is to read data from MySQL, so drag a database reader from the sources tab on the left. Double-click on the database reader to set the connection string with a JDBC-style URL using the template:

jdbc:mysql://<server_ip>:<port>/<database>

You must also specify the tables to synchronize — for this example, purchases — as this allows you to restrict what is synchronized.

Finally, create a new output. I called mine PurchasesDataStream.

You also need to connect your BigQuery instance to your source. Drag a BigQuery writer from the targets tab on the left. Double-click on the writer and select the input stream from the previous step and specify the location of the service account key. Finally, map the source and target tables together using the form:

<source-database>.<source-table>,<target-database>.<target-table>

For this use case this is just a single table on each side.

Once both the source and target connectors have been configured, deploy and start the application to begin the initial load process. Once the application is deployed and running, you can use the monitor menu option on the top left of the screen to watch the progress.

Because this example contains a small data load, the initial load application finishes pretty quickly. You can now stop this initial load application and move on to the synchronization.

Updating BigQuery with Change Data Capture

Striim has pushed your current database up into BigQuery, but ideally you want to update this every time the on-premises database changes. This is where the change data capture application comes into play.

Go back to the applications screen in Striim and create a new application from a template. Find and select the MySQL CDC to BigQuery option.

 

Like the first application, you need to configure the details for your on-premises MySQL source. Use the same basic settings as before. However, this time the wizard adds the JDBC component to the connection URL.

When you click Next, Striim will ensure that it can connect to the local source. Striim will retrieve all the tables from the source. Select the tables you want to sync. For this example, it’s just the purchases table.

Once the local tables are mapped, you need to connect to the BigQuery target. Again, you can use the same settings as before by specifying the same service key JSON file, table mapping, and GCP Project ID.

Once the setup of the application is complete, you can deploy and turn on the synchronization application. This will monitor the on-premises database for any changes, then synchronize them into BigQuery.

Let’s see this in action by clicking on the monitor button again and loading some data into your on-premises database. As the data loads, you will see the transactions being processed by Striim.

Next Steps

As you can see, Striim makes it easy for you to synchronize your on-premises data from existing databases, such as MySQL, to BigQuery. By constantly moving your data into BigQuery, you could now start building analytics or machine learning models on top, all with minimal impact to your current systems. You could also start ingesting and normalizing more datasets with Striim to fully take advantage of your data when combined with the power of BigQuery.

To learn more about Striim for Google BigQuery, check out the related product page. Striim is not limited to MySQL to BigQuery integration, and supports many different sources and targets. To see how Striim can help with your move to cloud-based services, schedule a demo with a Striim technologist or download a free trial of the platform.

Streaming Data: The Nexus of Cloud-Modernized Analytics

 

 

On April 9th I am going to be having a conversation with Andrew Brust of GigaOm about the role of streaming integration in digital transformation initiatives, especially cloud modernization and real-time analytics. The format of this webinar is light on power-point, rich on lively discussion and interaction — so we hope you can join us.

Streaming Data: The Nexus of Cloud-Modernized Analytics

APR 9, 2020- 10:00 AM PDT/ 1:00 PM EDT

Digital transformation is the integration of digital technology into all areas of a business resulting in fundamental changes to how the businesses operate and how they deliver value to customers. Cloud has been the number one driving technology in a majority of such transformations. It could be you have a cloud-first strategy, with all new applications being built in the cloud, or you may need to migrate online databases without taking downtime. You may want to take advantage of cloud-scale for infinite data storage, coupled with machine learning to gain new insights and make proactive decisions.

In all cases, the key component is data. The data for your new applications, cloud analytics, or your data migration could originate on-premise, in another cloud or be generated from millions of IoT devices. It is essential that this data can be collected, processed, and delivered rapidly, reliably and at scale. This is why streaming data is the key major component of data modernization, and why streaming integration platforms are vital to the success of digital transformation initiatives.

In a modern data architecture, the goal is to harvest your existing data sources and enable your analysts and data scientists to provide value in the form of applications, visualizations, and alerts to your decision makers, customers, and partners.

In this webinar we will discuss the key aspects of this architecture, including the role of change data capture (CDC) and IoT technologies in data collection, options for data processing, and the differing requirements for data delivery. You will also learn how streaming integration platforms can be utilized for cloud modernization, large scale and stream analytics, and machine learning operationalization, in a reliable and scalable way.

I hope you can join us on April 9th, and see why streaming integration is the engine of data modernization for digital transformation.

 

Introducing Our New O’Reilly Book: Streaming Integration: A Guide to Realizing the Value of Real-Time Data

 

 

Data modernization is key to surviving and thriving in challenging market conditions, providing the foundation needed for improved customer experience, optimized processes, and the ability to outperform the competition. Streaming integration is the engine of data modernization enabling higher operational value from real-time data. More and more businesses today tap into the power of real-time data to transform how they conduct daily operations.

We often write about this concept in our blogs. Now, we are excited to introduce you to a new O’Reilly book: Streaming Integration: A Guide to Realizing the Value of Real-Time Data. If you are building a modern data architecture as the foundation for business transformation, you will enjoy reading this book, as it dives deep into the architecture and technology best practices for streaming integration. At 100+ pages, it’s truly a comprehensive resource for you and your team.

The authors, Steve Wilkes and Alok Pareek, who are also co-founders of Striim, describe how streaming integration can be applied to solve the very real business challenges you face in a world transformed by digital technologies. The book provides a history of data before introducing streaming integration and discussing various use cases, such as:

  • how to apply streaming integration to achieve a zero-downtime data migration to a cloud database;
  • how to continuously update a data warehouse;
  • how to ingest, aggregate and react to IoT data, and
  • how to perform real-time machine learning predictions

Beyond the use cases for reaping fast value from operational data, the book provides a deep dive into technology and architecture best practices for streaming integration with details on:

  • how to ingest data from a wide range of heterogeneous sources, especially from transactional databases using low-impact CDC.
  • how to put the data into the form that is needed through filtering, transformation, enrichment, and correlation using in-memory components.
  • how to maintain sub-second delivery of data to both cloud and on-premises endpoints in high-volume environments.

Last but not least, the book covers the key considerations for an enterprise-grade streaming integration platform to meet the needs of today’s business-critical solutions. With decades of expertise in streaming integration solutions serving many industry leading organizations, Steve and Alok offer architectural guidelines to ensure data security, high-availability, scalability and reliability aspects of the technology.

I invite you to read this new O’Reilly book: Streaming Integration: A Guide to Realizing the Value of Real-Time Data to discover what to look for in a streaming integration platform to truly be the engine of your data modernization and how to address real-world business challenges.

We are entering a new era – the era of real-time data, as the book’s conclusion states. At Striim, we are proud to be pioneers in bringing a robust streaming integration platform that enables your data revolution. If you would like to learn more about streaming data integration, please feel free to contact us.

Striim Security Enhancements

Striim 3.9.8 Adds Advanced Security Features for Cloud Adoption

 

 

We are pleased to announce the general availability of Striim 3.9.8 with a rich set of features that span multiple areas, including advanced data security, enhanced development productivity, data accountability, performance and scalability, and extensibility with new data targets.

The new release brings many new features that are summarized here:

Let’s review the key themes and features of the new release starting with the security topic.

Advanced Platform and Adapter Security:

With a sharp focus on business-critical systems and use cases, the Striim team has been boosting the platform’s security features for the last several years. However, in version 3.9.8, we introduced a broad range of advanced security features to both the platform and its adapters to provide users with robust security for the end-to-end solution, and higher control for managing data security.

The new platform security features include the following components:

  • Striim KeyStore, which is a secured, centralized repository based on Java Keystore, for storing passwords and encryption keys, streamlines security management across the platform.
  • Ultra-secure algorithms for user password encryption across all parts of the platform reducing platform’s vulnerabilities to external or internal breaches.
  • Stronger encryption support for inter-node cluster communication with internally generated, long string password and unified security management for all nodes and agents.
  • Multi-layered application security via advanced support for exporting and importing pipeline applications within the platform. In Striim, all password properties of an application are encrypted using their own keys. When exporting applications containing passwords or other encrypted property values, you can now add a second level of encryption with a passphrase that will be required at the time of import, to strengthen the application security.
  • Encryption support using customer provided key for securing permanent files, via the File Writer, and for the intermediate temporary files via the Google Cloud Storage Writer. Supported encryption algorithm types include RSA, AES and PGP. You can generate keys for encrypting by multiple tools available online or using in house Java program and easily configure the encryption settings of the adapters via the Encryption policy property on the UI.

Overall, these new security features enable:

  • Enhanced platform and adapter security for hybrid cloud deployments and mission-critical environments
  • Strengthened end-to-end data protection from ingestion to file delivery
  • Enhanced compliance with strict security policies and regulations
  • Secured application sharing between platform users

Improved Data Accountability:

Striim version 3.9.8 includes an application-specific exception store for storing events discarded by the application, including discarded records. The feature allows viewing discarded records and their details in real time. You can configure this feature with a simple on/off option when building an application. With this feature, Striim improves its accountability for all data passing through the platform and allows users to build applications for replaying and processing discarded records.

Enhanced Application Development Support and Ease of Use

The new release also includes features that accelerate and ease developing integration applications, especially in high-volume data environments.

  • A New Enrichment Transformer: Expanding the existing library of out-of-the-box transformers, the new enrichment transformer function allows you to enrich your streaming data in-flight without any manual coding step. You only need Striim’s drag and drop UI to create a real-time data pipeline that performs in-memory data lookups. With this transformer, you can, for example, add City Name and County Name fields to an event containing Zip Code.

  • External Lookups: Striim provides an in-memory data cache to enrich data in-flight at very high speeds. With the new release, Striim gives you the option to enrich data with lookups from external data stores. The platform can now execute a database query to fetch data from an external database and return the data as a batch. The external lookup option helps users avoid preloading data in the Striim cache. This is especially beneficial for lookups from or joining with large data sets. External lookups also eliminate the need for a cache refresh since the data is fetched from the external database. The external lookups are supported for all major databases, including Oracle, SQL Server, MySQL, PostgreSQL, HPE NonStop.
  • The Option to Use Sample Data for Continuous Queries: With this ability, Striim reduces the data required for computation or displaying results via the dashboards. You can select to use only a portion of your streaming data for the application, if your use case can benefit from this approach. As a result, it increases the speed for computation and displaying the results, especially when working with very large data volumes.
  • Dynamic Output Names for Writers: The Striim platform makes it now easy to organize and consume the files and objects on the target system by giving flexible options for naming them. Striim file and object output names can include data, metadata, and user data field values from the source event. This dynamic output naming feature is available for the following targets: Azure Data Lake Store Gen 1 and Gen 2, Azure Blob Storage, Azure File Storage, Google Cloud Storage, Apache HDFS, Amazon S3.
  • Event-Augmented Kafka Message Header: Starting with Apache Kafka v11, Striim 3.9.8 introduced a new property called MessageHeader that enriches the Kafka message header with a mix of the event’s dynamic and static values before delivering with sub-second latency. With the help of the additional contextual information, downstream consumer application can rapidly determine how to use the messages arriving via Striim.
  • Simplified User Experience: The new UI for configuring complex adapter properties, such as rollover policy, flush policy, encryption policy, speeds new application development.

  • New sample application for real-time dashboards: Striim version 3.9.8 added a new sample dashboarding application that uses real-time data from meetup-website and displays in details of the meet-up events happening around the globe using demonstrates the Vector Map visualization.

Other platform improvements for ease of use and manageability include:

  • The Open Processor component, which allows bringing external code into the Striim platform, can be loaded and unloaded dynamically without having to restart Striim.
  • The Striim REST API allows safely deleting or post-processing the files processed by the Striim File Reader.
  • The Striim REST API for application monitoring reports consolidated statistics of various application components within a specified time range.

Increased Performance and Scalability:

For further improving performance and scalability, we have multiple features, including dynamic partitioning and performance fine-tuning for writers:

  • Dynamic Partitioning with Higher-Level of Control: Partitions allow parallel processing of the events in the stream by splitting them across multiple servers in the deployment. Striim’s partitioning distributes events dynamically at run-time across server nodes in a cluster and enables high performance and easy scalability. In prior releases, Striim used one or more fields of the events in the stream as key for partitioning. In the new release, users have additional, flexible options for distributing and processing large data volumes in streams or windows. Striim 3.9.8 allows partitioning key to be one or more expressions composed with the fields of the events in the stream. Striim’s flexible partitioning enables load-balancing applications that are deployed on multi-node clusters and process large data volumes. Windows-based partitioning enables grouping the events in windows that can, for example, be consumed by specific downstream writers. As a result, you can perform load-balancing across multiple writers to improve writing performance.
  • Writer Fine-Tuning Options: Striim 3.9.8 now offers the ability to configure the number of parallel threads for writing into the target system and simplifies writer configuration for achieving even higher throughput from the platform. The fine-tuning option is available for the following list of writers at this time: Azure Synapse Analytics and Azure SQL Data Warehouse, Google BigQuery, Google Cloud Spanner, Azure Cosmos DB, Apache HBase, Apache Kudu, MapR Database, Amazon Redshift, and Snowflake.

Increased Extensibility with New Data Targets

  • The Striim platform now supports SAP Hana as a target with direct integration. SAP Hana customers can now stream real-time data from a diverse set of sources into the platform with in-flight, in-memory data processing. With the availability of real-time data pipelines to SAP Hana, deployed on-premises or in the cloud, customers can rapidly develop time-sensitive analytics applications that transform their business operations.
  • Expanding the HTTP Reader capabilities to send custom responses back to the requestor. The HTTP Reader can now defer responding until events reach a corresponding HTTP Writer. This feature enables users to build REST services using Striim.

Other extensibility improvements are:

  • Improved support for handling special characters for table names in Oracle and SQL Server databases
  • Hazelcast Writer supports multi-column primary keys to enable more complex Hot Cache use cases
  • Performance improvement options for the SQL Server CDC Reader

These are only a portion of the new features of Striim 3.9.8. There is more to discover. If you would like to learn more about the new release, please reach out to schedule a demo with a Striim expert.

Advancement of Data Movement Techologies

Advancement of Data Movement Technologies: Whiteboard Wednesdays

 

In this Whiteboard Wednesday video, Irem Radzik, Head of Product Marketing at Striim, looks at how data movement technologies have evolved in response to changing user demands. Read on, or watch the 8-minute video:

Today we’re going to talk about the advancement of data movement technologies. We’re going to look at the ETL technologies that we started seeing in ‘90s, then the CDC (Change Data Capture)/Logical Replication solutions that we started seeing a couple of decades ago, and then streaming data integration solutions that we more commonly see today.

ETL

Let’s look at ETL technologies. ETL is known for its batch extract, then bringing the data into the transformation step in the middle tier server, and then loading the target in bulk again, typically for next-day reporting. You end up having high latency with these types of solutions. That was good enough for the ‘90s, but then we started demanding more fresh data for operational decision making. Latency became an issue with ETL solutions.

Data Movement - ETL

The other issue with ETL was the batch-window dependency. Because of the high impact on the production sources, there had to be a dedicated time for these batch extracts when the main users wouldn’t be able to use the production database. The batch window that was available for data extract became shorter and shorter as business demanded continuous access to the OLTP system.

The data volumes increased at the same time. You ended up not having enough time to move all the data you needed. That became a pain point for ETL users, driving them to look into other solutions.

Change Data Capture/Logical Replication

Change Data Capture/Logical Replication solutions addressed several of the key concerns that ETL had. Change Data Capture basically means that you continuously capture new transactions happening in the source database and deliver it to the target in real time.

Data Movement - CDC / Logical ReplicationThat obviously helps with the data latency problem. You end up having real-time, up to date data in the target for your operational decision making. The other plus of CDC is the source impact.

When it’s using logs (database logs) to capture the data, it has negligible impact. The source production system is available for transaction users. There is no batch window needed and no limitations for how much time you have to extract and move the data.

The CDC/Logical Replication solutions handle some of the key concerns of ETL users. They are made more for the E and L steps. What ends up happening with these solutions is that you need to do transformations within the database or with another tool, in order to complete the transformation step for end users.

The transformation happening there creates an E L T architecture and requires another product, another step, another network hub in your architecture, which complicates the process.

When there’s an outage, when there is a process disruption, reconciling your data and recovering becomes more complicated. That’s the shortcoming CDC users have been facing. These solutions were mainly made for databases.

Once the cloud and big data solutions became popular, the CDC providers had to come up with new products for cloud and big data targets. These are add-ons, not part of the main platform.

Another shortcoming that we’ve seen with CDC/Logical Replication solutions is their single node architecture, which translates into a single point of failure. This is a shortcoming, especially for mission-critical systems that need continuous availability of the data integration processes.

Streaming Data Integration

In recent years, streaming data integration came about to address the issues that CDC/Logical Replication products raised. It is becoming increasingly common. With streaming data integration, you’re not limited to just database sources.

Data Movement Streaming Data IntegrationYou can have your files, log data, your machine data, your system log files for example, all moving in a real-time fashion. Your cloud sources, your service bus or your messaging systems can be your source. Your sensor data can be moved in real time, in a streaming fashion to multiple targets. Again, not limited to just databases.

You can have cloud databases or other cloud services as your target. You can, in addition to databases, have messaging systems as your target, on-premises or in cloud, your big data solutions, on-premises or cloud. You can also deliver in file format.

Everything is like it was in a logical replication solution. It is continuous, in real time, and Change Data Capture is still a big component of the streaming data integration.

It’s built on top of the Change Data Capture technologies and brings additional data sources and additional data targets. Another important difference, and handling one of the challenges of logical replication, is the transformation piece. As we discussed, a transformation needs to happen and where it happens makes a big difference.

With streaming data integration, it’s happening in-flight. While the data is moving, you can have stream processing without adding more latency to your data. While the data is moving, it can be filtered, it can be aggregated, it can be masked and encrypted, and enriched with reference data, all in flight before it’s delivered to your target, so that it’s available in a consumable format. This streamlines your architecture, simplifies it, and makes all the recovery steps easier. It’s also delivering the data in the format that your users need.

Another important thing to highlight is the distributed architecture. This natively clustered environment helps with a single point of failure risk. When one node fails, the other one takes over immediately, so you have a highly available data pipeline. This distributed clustered environment also helps you to scale out very easily, add more servers as you have more data to process and move.

These solutions now come with a monitoring part. The real time monitoring of the pipelines gives you an understanding of what’s happening with your integration flows. If there is an issue, if there is high data latency or process issue, you get immediate alerts so you can trust that everything is running.

Data reliability is really critical, whole pipeline reliability is very critical. To make sure that there is no data loss or duplicates, there is data delivery validation that can be included in some of these solutions. You can also make sure, with the right solution, that everything is processed exactly once, and you are not repeating or dropping data. There are checkpointing mechanisms to be able to do that.

As you see, the new streaming data integration solutions handle some of the challenges that we have seen in the past with outdated data movement technologies. To learn more about streaming data integration, please visit our Real-time Data Integration solution page, schedule a demo with a Striim expert, or download the Striim platform to get started.

 

MySQL to Google Cloud SQL

Tutorial: Migrating from MySQL to Google Cloud SQL with Change Data Capture

 

 

Migrating from MySQL to Google Cloud SQL opens up cloud services that offer a wealth of capabilities with low management overhead and cost. But, moving your existing on-premises applications to the cloud can be a challenge. Existing applications built on top of on-premises deployments of databases like MySQL. In this blog post we are going to use a database technology called Change Data Capture to synchronize data from MySQL into a Google Cloud SQL instance.

Introduction

One of the major hurdles when migrating applications, whether you’re changing the technology or moving to the cloud, is migrating your data. The older and bigger the application, the more difficult that migration becomes. Traditional Extract, Translate, and Load (ETL) tools require multiple passes and, potentially, significant downtime to handle data migration activities. This is where real-time ETL tools like Striim shine.

There are a number of benefits in migrating applications this way, such as being able to:

  • Add a new, client-facing cloud application by synchronizing an existing, traditionally on-premises application’s data set.
  • Migrate one or more on-premises application (with data) to the cloud for production testing with almost zero impact on the existing application.

Let’s walk through an example of connecting an on-premises instance of MySQL to Google Cloud SQL for MySQL.

Set Up the MySQL Database

Before we dive into Striim, we are assuming you have an on-premises MySQL instance already configured and containing relevant data. For the purpose of this post, the dataset we have loaded data from a GitHub source (https://github.com/datacharmer/test_db) in a local MySQL instance. The data set is pretty large, which is perfect for our purposes, and contains a dummy set of employee information, including salaries.

Rather than importing all the data this data set contains, I’ve excluded the load_salaries2.dump and load_salaries3.dump files. This will allow us to insert a lot of data after Striim has been configured to show how powerful Change Data Capture is.

Set Up the Striim Application

Now that we have an on-premises data set in MySQL, let’s set up a new Striim application on Google Cloud Platform to act as the migration service.

Open your Google Cloud console and open or start a new project. Go to the marketplace and search for Striim.

A number of options should return, but the one we’re after is the first item, which allows integration of real-time data to GCP.

Select this option and start the deployment process by pressing the deploy button at the bottom of this screen. For this tutorial, we’ll use the basic defaults for a Striim server. In production, however, you’d need to size appropriately depending on your load.

Create a Target Database

While we wait for the Striim server to deploy, let’s create a Google SQL database to which we’ll migrate our database. Select the SQL option from the side menu in Google Cloud and create a new MySQL instance.

Once again, we’ll use the defaults for a basic Google MySQL instance. Open the instance and copy the instance connection name for use later. Then open the database instance and take note of the IP address.

We also need to create the database structure for the data we imported into the local MySQL instance. To do this, open the Google Cloud shell, log into the MySQL server, and run the SQL to create the table structure. Striim also needs a checkpoint table to keep the state in the event of failures, so create that table structure using the following:

CREATE TABLE CHKPOINT (

    id VARCHAR(100) PRIMARY KEY,

         sourceposition BLOB,

         pendingddl BIT(1),

     ddl LONGTEXT

);

Initial Load Application

Open the Google Console and go back to the Deployment Manager, and click “Visit site”. It’s important to note that the Striim VM currently has a dynamic external IP address. In a production environment, you’ll want to set this to static so it won’t change.

When you first visit the site, you’ll see a congratulations screen. Click accept and fill in the basic details. Leave the license field blank for the trial version of Striim, or add your license key if you have one.

The first thing we need to do is create an application that performs the initial load of our current data set. There is no wizard for setting up an initial load application that we require, so go to Apps and create an app from scratch.

First, let’s add a MySQL reader from the sources tab on the left. This will access our local database to load the initial set of data. To read from a local server we need to use a JDBC style URL using the template: jdbc:mysql://<server_ip>:<port>/<database>. We are also mapping the tables we want to sync by specifying them in the tables folder using <database>.<tablename>. This allows us to restrict what is synchronized. Finally, under output to, specify a new WAEvent type for this connector.

Once we have our source wired up, we need to add a target to the flow so our data starts to transfer. Using a process similar to the one we used previously, add the GoogleCloudWriter target with the Google cloud instance in the connection URL. For the tables, this time we need to match the source and targets together using the form:

<source-database>.<source-table>,<target-database>.<target-table>;

Once both the source and target connectors have been configured, deploy and start the application to begin the initial load process.

After the application goes through the starting process we can click on the monitor button to show the performance of the application. This will take a little while to complete, depending on your database size

Change Data Capture

While the initial load takes place, let’s create the Change Data Capture (CDC) application to get ready for the synchronization process.

This time we are going to use a wizard to create the application. Click on Add Apps, then select the option to start with a Template. Striim comes with a lot of templates for different use cases out of the box. Scroll down to Streaming Integration for MySQL, click “show more,” then look for MySQL CDC to Cloud SQL MySQL. This option sets up a CDC application for MySQL to Google Cloud SQL.

Fill out the connection information for your on-premises application and click next. This should connect to the agent and ensure everything is correct.

Once everything is connected, check the tables you selected in the first application. These will synchronize any changes that occur.

Now we need to link our source to our target. Specify the connection details for your Google SQL instance using the IP address from the previous step. Fill in the username, password, and list of tables from the source database and click next. When you’ve finished the wizard, the application should be ready to go.

If the previous data load application has finished, stop the data load application and start the Change Data Capture application. Once the application has started, start loading transactions into your on-premises database. This should start synchronizing the data that changes up to your Google Cloud instance.

Open the Change Data Capture application and select monitor. You should see both the input and output figures as the application keeps track of your on-premises database. The activity chart should be showing the throughput of the records synchronizing from one location to another.

If you open the database console in Google Cloud and run a “SELECT COUNT(salary) FROM salaries” statement a couple of times, you should see the count figure rising.

Adding More Load

While the servers are synchronizing, let’s go back to our local MySQL and add some other transactions. Import the remaining two salaries files, load_salaries2.dump and load_salaries3.dump. This will provide additional transactions to be synchronized and you’ll see Striim continue to add transactions as they happen without needing to do anything else.

Next Steps

We looked at a really quick and easy way to synchronize an on-premises instance of MySQL to Google Cloud SQL using Striim. At this point, you could start using the cloud database to run additional applications or do data analysis — without affecting the performance and use of your existing system.

If you open the menu on the Striim admin page, then open the apps section, and finally open this application, you’ll also see other steps you could add to this flow that support even more complex use cases, such as adding in transforms, combining multiple sources, or even splitting across targets.

To learn more about migrating from MySQL to Google Cloud SQL, check out the product page at https://www.striim.com/striim-for-google-cloud-sql/. To see how Striim can help with your move to cloud-based services, schedule a demo with a Striim technologist, or download a free trial of the platform.

 

 

Use Cases for Streaming Data Integration

The Top 4 Use Cases for Streaming Data Integration: Whiteboard Wednesdays

 

 

Today we are talking about the top four use cases for streaming data integration. If you’re not familiar with streaming data integration, please check out our channel for a deeper dive into the technology. In this 7-minute video, let’s focus on the use cases.

 

Use Case #1 Cloud Adoption – Online Database Migration

The first one is cloud adoption – specifically online database migration. When you have your legacy database and you want to move it to the cloud and modernize your data infrastructure, if it’s a critical database, you don’t want to experience downtime. The streaming data integration solution helps with that. When you’re doing an initial load from the legacy system to the cloud, the Change Data Capture (CDC) feature captures all the new transactions happening in this database as it’s happening. Once this database is loaded and ready, all the changes that happened in the legacy database can be applied in the cloud. During the migration, your legacy system is open for transactions – you don’t have to pause it.

While the migration is happening, CDC helps you to keep these two databases continuously in-sync by moving the real-time data between the systems. Because the system is open to transactions, there is no business interruption. And if this technology is designed for both validating the delivery and checkpointing the systems, you will also not experience any data loss.

Because this cloud database has production data, is open to transactions, and is continuously updated, you can take your time to test it before you move your users. So you have basically unlimited testing time, which helps you minimize your risks during such a major transition. Once the system is completely in-sync and you have checked it and tested it, you can point your applications and run your cloud database.

This is a single switch-over scenario. But streaming data integration gives you the ability to move the data bi-directionally. You can have both systems open to transactions. Once you test this, you can run some of your users in the cloud and some of you users in the legacy database.

All the changes happening with these users can be moved between databases, synchronized so that they’re constantly in-sync. You can gradually move your users to the cloud database to further minimize your risk. Phased migration is a very popular use case, especially for mission-critical systems that cannot tolerate risk and downtime.

Cloud adoptionUse Case #2 Hybrid Cloud Architecture

Once you’re in the cloud and you have a hybrid cloud architecture, you need to maintain it. You need to connect it with the rest of your enterprise. It needs to be a natural extension of your data center. Continuous real-time data moment with streaming data integration allows you to have your cloud databases and services as part of your data center.

The important thing is that these workloads in the cloud can be operational workloads because there’s fresh information (ie, continuously updated information) available. Your databases, your machine data, your log files, your other cloud sources, messaging systems, and sensors can move continuously to enable operational workloads.

What do we see in hybrid cloud architectures? Heavy use of cloud analytics solutions. If you want operational reporting or operational intelligence, you want comprehensive data delivered continuously so that you can trust that’s up-to-date, and gain operational intelligence from your analytics solutions.

You can also connect your data sources with the messaging systems in the cloud to support event distribution for your new apps that you’re running in the cloud so that they are completely part of your data center. If you’re adopting multi-cloud solutions, you can again connect your new cloud systems with existing cloud systems, or send data to multiple cloud destinations.

Hybrid Cloud ArchitectureUse Case #3 Real-Time Modern Applications

A third use case is real-time modern applications. Cloud is a big trend right now, but not everything is necessarily in the cloud. You can have modern applications on-premises. So, if you’re building any real-time app and modern new system that needs timely information, you need to have continuous real-time data pipelines. Streaming data integration enables you run real-time apps with real-time data.

Use Case #4 Hot Cache

Last, but not least, when you have an in-memory data grid to help with your data retrieval performance, you need to make sure it is continuously up-to-date so that you can rely on that data – it’s something that users can depend on. If the source system is updated, but your cache is not updated, it can create business problems. By continuously moving real-time data using CDC technology, streaming data integration helps you to keep your data grid up-to-date. It can serve as your hot cache to support your business with fresh data.

 

To learn more about streaming data integration use cases, please visit our Solutions section, schedule a demo with a Striim expert, or download the Striim platform to get started.