Cloud ETL with Reliability

Implementing Streaming Cloud ETL with Reliability

 

 

If you’re considering adopting a cloud-based data warehousing and analytics solution the most important consideration is how to populate it with current on-site and cloud data in real time with cloud etl.

As noted in a recent study by IBM and Forrester, 88% of companies want to run near-real-time analytics on stored data. Streaming Cloud ETL enables real-time analytics by loading data from operational systems from private and public clouds to your analytics cloud of choice.

A streaming cloud ETL solution enables you to continuously load data from your in-house or cloud-based mission-critical operational systems to your cloud-based analytical systems in a timely and dependable manner. Reliable and continuous data flow is essential to ensuring that you can trust the data you use to make important operational decisions.

In addition to high-performance and scalability to handle your large data volumes, you need to look for data reliability and pipeline high availability. What should you look for in a streaming cloud ETL solution to be confident that it offers the high degree of reliability and availability your mission-critical applications demand?

For data reliability, the following two common issues with data pipelines should be avoided:

  1. Data loss. When data volumes increase and the pipelines are backed up, some solutions allow some of the data to be discarded. Also, when the CDC solution has limited data type support, it may not be able to capture the columns that contain that data type. If an outage or process interruption occurs, an incorrect recovery point can also lead to skipping some data.
  2. Duplicate data. After recovering from a process interruption, the system may create duplicate data. This issue becomes even more prevalent when processing the data with time windows after the CDC step.

How Striim Ensures Reliable and Highly Available Streaming Cloud ETL

Striim ingests, moves, processes, and delivers real-time data across heterogeneous and high-volume data environments. The software platform is built ground-up specifically to ensure reliability for streaming cloud ETL solutions with the following architectural capabilities.

Fault-tolerant architecture

Striim is designed with a built-in clustered environment with a distributed architecture to provide immediate failover. The metadata and clustering service watches for node failure, application failure, and failure of certain other services. If one node goes down, another node within the cluster immediately takes over without the need for users to do this manually or perform complex configurations.

Cloud ETL with ReliabilityExactly Once Processing for zero data loss or duplicates

Striim’s advanced checkpointing capabilities ensure that no events are missed or processed twice while taking time window contents into account. It has been tested and certified for cloud solutions to offer real-time streaming to Microsoft Azure, AWS, and/or Google Cloud with event delivery guarantees.

During data ingest, checkpointing keeps track of all events the system processes and how far they got down various data pipelines. If something fails, Striim knows the last known good state and what position it needs to recover from. Advanced checkpointing is designed to eliminate loss in data windows when the system fails.

If you have a defined data window (say 5 minutes) and the system goes down, you cannot typically restart from where you left off because you will have lost the 5 minutes’ worth of data that was in the data window. That means your source and target will no longer be completely synchronized. Striim addresses this issue by coordinating with the data replay feature that many data sources provide to rewind sources to just the right spot if a failure occurs.

In cases where data sources don’t support data replay—for example, data coming from sensors—Striim’s persistent messaging stores and checkpoints data as it is ingested. Persistent messaging allows previously non-replayable sources to replayed from a specific point. It also allows multiple flows from the same data source to maintain their own checkpoints. To offer exactly once processing, Striim checks to make sure input data has actually been written. As a result, the platform can checkpoint, confident in the knowledge that the data made it to the persistent queue.

End-to-end data flow management for simplified solution architecture and recovery

Striim’s streaming cloud ETL solution also delivers streamlined, end-to-end data integration between source and target systems that enhances reliability. The solution ingests data from the source in real time, performs all transformations such as masking, encryption, aggregation, and enrichment in memory as the data goes through the stream and then delivers the data to the target in a single network operation. All of these operations occur in one step without using a disk and deliver the streaming data to the target in sub-seconds. Because Striim does not require additional products, this simplified solution architecture enables a seamless recovery process and minimizes the risk of data loss or inaccurate processing.

In contrast, a data replication service without built-in stream processing requires data transformation to be performed in the target (or source), with an additional product and network hop. This minimum two-hop process introduces unnecessary data latency. It also complicates the solution architecture, exposing the customer to considerable recovery-related risks and requiring a great deal of effort for accurate data reconciliation after an outage.

For use cases where transactional integrity matters, such as migrating to a cloud database or continuously loading transactional data for a cloud-based business system, Striim also maintains ACID properties (atomicity, consistency, isolation, and durability) of database operations to preserve the transactional context.

The best way to choose a reliable streaming cloud ETL solution is to see it in action. Click here to request a customized demo for your specific environment.

 

Oracle Change Data Capture (CDC)

Oracle Change Data Capture Tutorial – An Event-Driven Architecture for Cloud Adoption

 

 

All businesses rely on data. Historically, this data resided in monolithic databases, and batch ETL processes were used to move that data to warehouses and other data stores for reporting and analytics purposes. As businesses modernize, looking to the cloud for analytics, and striving for real-time data insights, they often find that these databases are difficult to completely replace, yet the data and transactions happening within them are essential for analytics. With over 80% of businesses noting that the volume & velocity of their data is rapidly increasing, scalable cloud adoption and change data capture from databases like Oracle, SQLServer, MySQL and others is more critical than ever before. Oracle change data capture is specifically one area where companies are seeing an influx of modern data integration use cases.

To resolve this, more and more companies are moving to event-driven architectures, because of the dynamic distributed scalability which makes sharing large volumes of data across systems possible.

In this post we will look at an example which replaces batch ETL by event-driven distributed stream processing: Oracle change data capture events are extracted as they are created; enriched with in-memory, SQL-based denormalization; then delivered to the Azure Cloud to provide scalable, real-time, low-cost analytics, without affecting the source database. We will also look at using the enriched events, optionally backed by Kafka, to incrementally add other event-driven applications or services.

Continuous Data Collection, Processing, Delivery, and Analytics with the Striim Platform
Continuous Data Collection, Processing, Delivery, and Analytics with the Striim Platform

Event-Driven Architecture Patterns

Most business data is produced as a sequence of events, or an event stream: for example, web or mobile app interactions, devices, sensors, bank transactions, all continuously generate events. Even the current state of a database is the outcome of a sequence of events. Treating state as the result of a sequence of events forms the core of several event-driven patterns.

Event Sourcing is an architectural pattern in which the state of the application is determined by a sequence of events. As an example, imagine that each “event” is an incremental update to an entry in a database. In this case, the state of a particular entry is simply the accumulation of events pertaining to that entry. In the example below the stream contains the queue of all deposit and withdrawal events, and the database table persists the current account balances.

Event as a Change to an Entry in a Database
Imagine Each Event as a Change to an Entry in a Database

The events in the stream can be used to reconstruct the current account balances in the database, but not the other way around. Databases can be replicated with a technology called Change Data Capture (CDC), which collects the changes being applied to a source database, as soon as they occur by monitoring its change log, turns them into a stream of events, then applies those changes to a target database. Source code version control is another well known example of this, where the current state of a file is some base version, plus the accumulation of all changes that have been made to it.

The Change Log can be used to Replicate a Database
The Change Log can be used to Replicate a Database

What if you need to have the same set of data for different databases, for different types of use? With a stream, the same message can be processed by different consumers for different purposes. As shown below, the stream can act as a distribution point, where, following the polygot persistence pattern, events can be delivered to a variety of data stores, each using the most suited technology for a particular use case or materialized view.

Streaming Events Delivered to a Variety of Data Stores
Streaming Events Delivered to a Variety of Data Stores

Event-Driven Streaming ETL Use Case Example

Below is a diagram of the Event-Driven Streaming ETL use case example:

Event-Driven Streaming ETL Use Case Diagram
Event-Driven Streaming ETL Use Case Diagram
  1. Striim’s low-impact, real-time Oracle change data capture (CDC) feature is used to stream database changes (inserts, updates and deletes) from an Operational Oracle database into Striim
  2. CDC Events are enriched and denormalized with Streaming SQL and Cached data, in order to make relevant data available together
  3. Enriched, denormalized events are streamed to CosmosDB for real-time analytics
  4. Enriched streaming events can be monitored in real time with the Striim Web UI, and are available for further Streaming SQL analysis, wizard-based dashboards, and other applications on-premise or in the cloud.

Replacing Batch Extract with Real Time Streaming of CDC Order Events

Striim’s easy-to-use CDC wizards automate the creation of applications that leverage change data capture, to stream events as they are created, from various source systems to various targets. In this example, shown below, we use Striim’s OracleReader (Oracle Change Data Capture) to read the Order OLTP transactions in Oracle redo logs and stream these insert, update, delete operations, as soon as the transactions commit, into Striim, without impacting the performance of the source database.

Configuring Database Properties for the Oracle <a href=CDC Data Source” width=”353″ height=”379″ />
Configuring Database Properties for the Oracle CDC Data Source

Utilizing Caches For Enrichment

Relational Databases typically have a normalized schema which makes storage efficient, but causes joins for queries, and does not scale well horizontally. NoSQL databases typically have a denormalized schema which scales across a cluster because data that is read together is stored together.

Normalized Schema with Joins for Queries Does Not Scale Horizontally
Normalized Schema with Joins for Queries Does Not Scale Horizontally

With a normalized schema, a lot of the data fields will be in the form of IDs. This is very efficient for the database, but not very useful for downstream queries or analytics without any meaning or context. In this example we want to enrich the raw Orders data with reference data from the SalesRep table, correlated by the Order Sales_Rep_ID, to produce a denormalized record including the Sales Rep Name and Email information in order to make analysis easier by making this data available together.

Since the Striim platform is a high-speed, low latency, SQL-based stream processing platform, reference data also needs to be loaded into memory so that it can be joined with the streaming data without slowing things down. This is achieved through the use of the Cache component. Within the Striim platform, caches are backed by a distributed in-memory data grid that can contain millions of reference items distributed around a Striim cluster. Caches can be loaded from database queries, Hadoop, or files, and maintain data in-memory so that joining with them can be very fast. In this example, shown below, the cache is loaded with a query on the SalesRep table using the Striim DatabaseReader.

Configuring Database Properties for the Sales Rep Cache
Configuring Database Properties for the Sales Rep Cache

Joining Streaming and Cache Data For Real Time Transforming and Enrichment With SQL

We can process and enrich data-in-motion using continuous queries written in Striim’s SQL-based stream processing language. Using a SQL-based language is intuitive for data processing tasks, and most common SQL constructs can be utilized in a streaming environment. The main differences between using SQL for stream processing, and its more traditional use as a database query language, are that all processing is in-memory, and data is processed continuously, such that every event on an input data stream to a query can result in an output.

Dataflow Showing Joining and Enrichment of <a href=CDC data with Cache” width=”455″ height=”247″ />
Dataflow Showing Joining and Enrichment of CDC data with Cache

This is the query we will use to process and enrich the incoming data stream:

Full Transformation and Enrichment Query Joining the <a href=CDC Stream with Cache Data” width=”408″ height=”251″ />
Full Transformation and Enrichment Query Joining the CDC Stream with Cache Data

In this query we select the Order stream and SalesRep cache fields that we want, apply transformations to convert data types, put the Order stream and SalesRep cache in the FROM clause, and include a join on SALES_REP_ID as part of the WHERE clause. The result of this query is to continuously output enriched (denormalized) events, shown below, for every CDC event that occurs for the Orders table. So with this approach we can join streams from an Oracle Change Data Capture reader with cached data for enrichment.

Events After Transformation and Enrichment
Events After Transformation and Enrichment

Loading the Enriched Data to the Cloud for Real Time Analytics

Now the Oracle CDC (Oracle change data capture) data, streamed and enriched through Striim, can be stored simultaneously in Azure Cloud blob storage and Azure Cosmos DB, for elastic storage with advanced big data analytics, using the Striim AzureBlobWriter and the CosmosDBWriter shown below.

The image below shows the Striim flow web UI for our streaming ETL application. Flows define what data an application receives, how it processes the data, and what it does with the results.

End-to-End Data Flow
End-to-End Data Flow

Using Kafka for Streaming Replay and Application Decoupling

The enriched stream of order events can be backed by or published to Kafka for stream persistence, laying the foundation for streaming replay and application decoupling. Striim’s native Integration with Apache Kafka makes it quick and easy to leverage Kafka to make every data source re-playable, enabling recovery even for streaming sources that cannot be rewound. This also acts to decouple applications, enabling multiple applications to be powered by the same data source, and for new applications, caches or views to be added later.

Streaming SQL for Aggregates

We can further use Striims Streaming SQL on the denormalized data to make a real time stream of summary metrics about the events being processed available to Striim Real-Time Dashboards and other applications. For example, to create a running count and sum of orders per SalesRep in the last hour, from the stream of enriched orders, you would use a window, and the familiar group by clause.

CREATE WINDOW OrderWindow
OVER EnrichCQ
KEEP WITHIN 1 HOUR
PARTITION BY sales_rep_id

SELECT sales_rep_id, sales_rep_Name,
COUNT(*) as orderCount,
SUM(order_total) as totalAmount
FROM OrderWindow
GROUP BY sales_rep_id

Monitoring

With the Striim Monitoring Web UI we can now monitor our data pipeline with real-time information for the cluster, application components, servers, and agents. The Main monitor page allows to visualize summary statistics for Events Processed, App CPU%, Server Memory, or Server CPU%. Below the Monitor App page displays our App Resources, Performance and Components.

Striim Monitoring Web UI Monitor App Page
Striim Monitoring Web UI Monitor App Page

Clicking on an app component ‘more details’ button will display more detailed performance information such as CPU and Event rate as shown below:

Striim Monitoring Web UI Monitor App Component Details Page
Striim Monitoring Web UI Monitor App Component Details Page

Summary

In this blog post, we discussed how we can use Striim to:

  1. Perform Oracle Change Data Capture to stream data base changes in real-time
  2. Use streaming SQL and caches to easily denormalize data in order to make relevant data available together
  3. Load streaming enriched data to the cloud for real-time analytics
  4. Use Kafka for persistent streams
  5. Create rolling aggregates with streaming SQL
  6. Continuously monitor data pipelines

Additional Resources:

To read more about real-time data ingestion, please visit our Real-Time Data Integration solutions page.

To learn more about the power of streaming SQL, visit Striim Platform Overview product page, schedule a demo with a Striim technologist, or download a free trial of the platform and try it for yourself!

To learn more about Striim’s capabilities to support the data integration requirements for an Azure hybrid cloud architecture check out all of Striim’s solutions for Azure.

Getting Started with Real-Time ETL to Azure SQL Database

 

 

Running production databases in the cloud has become the new norm. For us at Striim, real-time ETL to Azure SQL Database and other popular cloud databases has become a common use case. Striim customers run critical operational workloads in cloud databases and rely on our enterprise-grade streaming data pipelines to keep their cloud databases up-to-date with existing on-premises or cloud data sources.

Striim supports your cloud journey starting with the first step. In addition to powering fully-connected hybrid and multi-cloud architectures, the streaming data integration platform enables cloud adoption by minimizing risks and downtime during data migration. When you can migrate your data to the cloud without database downtime or data loss, it is easier to modernize your mission-critical systems. And when you liberate your data trapped in legacy databases and stream to Azure SQL DB in sub-seconds, you can run high-value, operational workloads in the cloud and drive business transformation faster.

Streaming Integration from Oracle to Azure SQL DBBuilding continuous, streaming data pipelines from on-premises databases to production cloud databases for critical workloads requires a secure, scalable, and reliable integration solution. Especially if you have enterprise database sources that cannot tolerate performance degradation, traditional batch ETL will not suffice. Striim’s low-impact change data capture (CDC) feature minimizes overhead on the source systems while moving database operations (inserts, updates, and deletes) to Azure SQL DB in real time with security, reliability, and transactional integrity.

Striim is available as a PaaS offering in major cloud marketplaces such as Microsoft Azure Cloud, AWS, and Google Cloud. You can run Striim in the Azure Cloud to simplify real-time ETL to Azure SQL Database and other Azure targets, such as Azure Synapse Analytics, Azure Cosmos DB, Event Hubs, ADLS, and more. The service includes heterogeneous data ingestion, enrichment, and transformation in a single solution before delivering the data to Azure services with sub-second latency. What users love about Striim is that it offers a non-intrusive, quick-to-deploy, and easy-to-iterate solution for streaming data integration into Azure.

To illustrate the ease of use of Striim and to help you get started with your cloud database integration project, we have prepared a Tech Guide: Getting Started with Real-Time Data Integration to Microsoft Azure SQL Database. You will find step-by-step instructions on how to move data from an on-premises Oracle Database to Azure SQL Database using Striim’s PaaS offering available in the Azure Marketplace. In this tutorial you will see how Striim’s log-based CDC enables a solution that doesn’t impact your source Oracle Database’s performance.

If you have, or plan to have, Azure SQL Databases that run operational workloads, I highly recommend that you use a free trial of Striim along with this tutorial to find out how fast you can set up enterprise-grade, real-time ETL to Azure SQL Database. On our website you can find additional tutorials for different cloud databases. So be sure to check out our other resources as well. For any streaming integration questions, please feel free to reach out.

 

A New Comprehensive Guide to Streaming ETL for Google Cloud

 

 

Not to brag, but since we literally wrote the book on data modernization with streaming data integration, it is our pleasure to provide you with a guide book on using streaming ETL for Google Cloud Platform. This eBook will help your company unleash innovative services and solutions by combining the power of streaming data integration with Google Cloud Platform services.

As part of your data modernization and cloud adoption efforts, you cannot ignore how you collect and move your data to your new data management platform. But, like adopting any new technology, there is complexity in the move and a number of things to consider, especially when dealing with mission-critical systems. We realize that the process of researching options, building requirements, getting consensus, and deciding on a streaming ETL for Google Cloud is never a trivial task.

A Buyer's Guide to Streaming Data Integration to Google Cloud PlatformAs a technology partner of Google Cloud, we, at Striim, are thrilled to invite you to easily tap into the power of streaming ETL by way of our new eBook: A Buyer’s Guide to Streaming Data Integration for Google Cloud. If you’ve been looking to move to the Google Cloud or get more operational value in your cloud adoption journey, this eBook is your go-to guide.

This eBook provides an in-depth analysis of the game-changing trends of digital transformation. It explains why a new approach to data integration is required, and how www.striim.com/blog/2020/01/streaming-data-integration-whiteboard-wednesdays/“>streaming data integration (SDI) fits into a modern data architecture. With many use case examples, the eBook shows you how streaming ETL for Google Cloud provides business value, and why this is a foundational step. You’ll discover how this technology is enabling the business innovations of today – from ride sharing and fintech, to same-day delivery and retail/e-retail.

Here’s a rundown of what we hope you’ll learn through this eBook:

  • A clear definition of what streaming integration is, and how it compares and contrasts to traditional extract/transform/load (ETL) tools
  • An understanding of how SDI fits into existing as well as emerging enterprise architectures
  • The role streaming data integration architecture plays in regards to cloud migration, hybrid cloud, multi-cloud, etc.
  • The true business value of adopting SDI
  • What companies and IT professionals should be looking for in a streaming data integration solution, focusing on the value of combining SDI and stream processing in one integrated platform
  • Modern SDI use cases, and how these are helping organizations to transform their business
  • Specifically, the benefits of using the Striim SDI platform in combination with the Google Cloud Platform

The digital business operates in real time, and the limitations of legacy integration approaches will hold you back from the limitless potential that cloud platforms bring to your business. To ease your journey into adopting streaming ETL to Google Cloud, please accept our tested and proven guidance with this new eBook: A Buyer’s Guide to Streaming Data Integration for Google Cloud. By following the practical steps provided for you, you can reap the full benefits of Google Cloud for your enterprise. For further information on streaming data integration or the Striim platform, please feel free to contact us.

Advancement of Data Movement Techologies

Advancement of Data Movement Technologies: Whiteboard Wednesdays

 

In this Whiteboard Wednesday video, Irem Radzik, Head of Product Marketing at Striim, looks at how data movement technologies have evolved in response to changing user demands. Read on, or watch the 8-minute video:

Today we’re going to talk about the advancement of data movement technologies. We’re going to look at the ETL technologies that we started seeing in ‘90s, then the CDC (Change Data Capture)/Logical Replication solutions that we started seeing a couple of decades ago, and then streaming data integration solutions that we more commonly see today.

ETL

Let’s look at ETL technologies. ETL is known for its batch extract, then bringing the data into the transformation step in the middle tier server, and then loading the target in bulk again, typically for next-day reporting. You end up having high latency with these types of solutions. That was good enough for the ‘90s, but then we started demanding more fresh data for operational decision making. Latency became an issue with ETL solutions.

Data Movement - ETL

The other issue with ETL was the batch-window dependency. Because of the high impact on the production sources, there had to be a dedicated time for these batch extracts when the main users wouldn’t be able to use the production database. The batch window that was available for data extract became shorter and shorter as business demanded continuous access to the OLTP system.

The data volumes increased at the same time. You ended up not having enough time to move all the data you needed. That became a pain point for ETL users, driving them to look into other solutions.

Change Data Capture/Logical Replication

Change Data Capture/Logical Replication solutions addressed several of the key concerns that ETL had. Change Data Capture basically means that you continuously capture new transactions happening in the source database and deliver it to the target in real time.

Data Movement - CDC / Logical ReplicationThat obviously helps with the data latency problem. You end up having real-time, up to date data in the target for your operational decision making. The other plus of CDC is the source impact.

When it’s using logs (database logs) to capture the data, it has negligible impact. The source production system is available for transaction users. There is no batch window needed and no limitations for how much time you have to extract and move the data.

The CDC/Logical Replication solutions handle some of the key concerns of ETL users. They are made more for the E and L steps. What ends up happening with these solutions is that you need to do transformations within the database or with another tool, in order to complete the transformation step for end users.

The transformation happening there creates an E L T architecture and requires another product, another step, another network hub in your architecture, which complicates the process.

When there’s an outage, when there is a process disruption, reconciling your data and recovering becomes more complicated. That’s the shortcoming CDC users have been facing. These solutions were mainly made for databases.

Once the cloud and big data solutions became popular, the CDC providers had to come up with new products for cloud and big data targets. These are add-ons, not part of the main platform.

Another shortcoming that we’ve seen with CDC/Logical Replication solutions is their single node architecture, which translates into a single point of failure. This is a shortcoming, especially for mission-critical systems that need continuous availability of the data integration processes.

Streaming Data Integration

In recent years, streaming data integration came about to address the issues that CDC/Logical Replication products raised. It is becoming increasingly common. With streaming data integration, you’re not limited to just database sources.

Data Movement Streaming Data IntegrationYou can have your files, log data, your machine data, your system log files for example, all moving in a real-time fashion. Your cloud sources, your service bus or your messaging systems can be your source. Your sensor data can be moved in real time, in a streaming fashion to multiple targets. Again, not limited to just databases.

You can have cloud databases or other cloud services as your target. You can, in addition to databases, have messaging systems as your target, on-premises or in cloud, your big data solutions, on-premises or cloud. You can also deliver in file format.

Everything is like it was in a logical replication solution. It is continuous, in real time, and Change Data Capture is still a big component of the streaming data integration.

It’s built on top of the Change Data Capture technologies and brings additional data sources and additional data targets. Another important difference, and handling one of the challenges of logical replication, is the transformation piece. As we discussed, a transformation needs to happen and where it happens makes a big difference.

With streaming data integration, it’s happening in-flight. While the data is moving, you can have stream processing without adding more latency to your data. While the data is moving, it can be filtered, it can be aggregated, it can be masked and encrypted, and enriched with reference data, all in flight before it’s delivered to your target, so that it’s available in a consumable format. This streamlines your architecture, simplifies it, and makes all the recovery steps easier. It’s also delivering the data in the format that your users need.

Another important thing to highlight is the distributed architecture. This natively clustered environment helps with a single point of failure risk. When one node fails, the other one takes over immediately, so you have a highly available data pipeline. This distributed clustered environment also helps you to scale out very easily, add more servers as you have more data to process and move.

These solutions now come with a monitoring part. The real time monitoring of the pipelines gives you an understanding of what’s happening with your integration flows. If there is an issue, if there is high data latency or process issue, you get immediate alerts so you can trust that everything is running.

Data reliability is really critical, whole pipeline reliability is very critical. To make sure that there is no data loss or duplicates, there is data delivery validation that can be included in some of these solutions. You can also make sure, with the right solution, that everything is processed exactly once, and you are not repeating or dropping data. There are checkpointing mechanisms to be able to do that.

As you see, the new streaming data integration solutions handle some of the challenges that we have seen in the past with outdated data movement technologies. To learn more about streaming data integration, please visit our Real-time Data Integration solution page, schedule a demo with a Striim expert, or download the Striim platform to get started.

 

Why Choose Striim + Snowflake

Why Choose Striim + Snowflake

 

 

With greater speed, ease, and flexibility, Snowflake’s cloud data warehouse helps you gain meaningful insights by providing the performance and simplicity that traditional data warehouse offerings could not support.

Adopting a data warehouse in the cloud with Snowflake requires a modern approach to the movement of enterprise data. This data is often generated from diverse data sources deployed in various locations – including on-prem data centers, major public clouds, and devices. Snowflake users need real-time data movement capabilities to realize the full potential of data warehousing in the cloud, and benefit from more meaningful operational intelligence.

Why Choose Striim + SnowflakeWhile there are many vendors that provide data movement to Snowflake, Striim’s real-time, enterprise-grade streaming ETL solution offers advanced capabilities that other vendors can’t match, including the ability to:

  • Bring a wide range of data sets (including security log data, Kafka/messaging, IoT data, OLAP and OLTP) in a consumable format to Snowflake to achieve rich, timely, and reliable insights fast
  • Use robust, reliable, low-impact change data capture (CDC) from major enterprise database
  • Aggregate, filter, denormalize, enrich, mask real-time data in-flight using a SQL-based language before delivering to Snowflake to rapidly gain time-sensitive insights
  • Combine and correlate machine data, OLAP, OLTP and IoT data with other data sources in-flight for complete and rich operational intelligence
  • Perform online migration from existing on-prem data warehouses such as (Oracle Exadata or Teradata) to Snowflake with minimized interruption and risks
  • Offer an enterprise-grade solution designed for mission-critical, high data volume environments with built-in HA, scalability, exactly-once-processing (no data loss or duplicates) and security, all in a patented distributed platform

In fact, many current Striim + Snowflake customers previously deployed a solution from another vendor, only to find that the solution did not meet their needs for scalability, reliability, in-flight processing, or, simply, data access.

Let’s drill down on the ways Striim supports Snowflake’s advanced analytics applications with modern, enterprise-grade streaming ETL, and further allows customers to gain operational value from their Snowflake solutions.

Moving data in real time from diverse sources

Data ingestion providers that only collect data from a limited range of sources cannot support complete and rich operational intelligence in Snowflake. The Striim platform ingests real-time streaming data from a variety of sources out-of-the box, including data warehouses databases (including Oracle, SQL Server, HPE NonStop, MongoDB, Amazon RDS, and MySQL), log files from security devices and other systems, sensors, messaging systems, and Hadoop solutions.

Avoiding batch ETL-related inefficiencies

Data integration providers that use traditional batch ETL for the movement of data into Snowflake are unable to support real-time operational intelligence and time-sensitive analytics use cases in Snowflake. When users adopt an ELT architecture by using replication solutions with another tool for in-target transformations, this architecture creates complexities, especially during process recovery. Striim offers an end-to-end solution with a simplified solution architecture to bring a wide range of data in real time.

Minimizing source impact and interruption

While there are providers that offer very simplistic movement of file data, and do not support CDC from databases, these solutions cannot scale and may require the use of additional products to be integrated into the configuration. Using Striim, businesses can easily adopt cloud data warehouse with online migration from existing data warehouses with minimal disruption and risk.

Assuring security and reliability

Striim is an enterprise-grade solution with built-in HA, security, scalability, exactly once processing for no data loss or duplication for business-critical production systems.

Applying transformations and enrichments

Striim applies filtering, transformations, aggregations, masking, and enrichment using static or streaming reference datain real time – as the data is being delivered into Snowflake, in a consumable format to Snowflake – to accelerate the delivery of rich, timely, and reliable insights.

Striim + Snowflake Customer Use Cases

To illustrate the above points, we would like to share a couple of Snowflake customers that chose Striim to gain operational intelligence for critical operations.

A Leading International Cruise Line

  • Ingests real-time POS data from spa, casinos, and stores to enterprise data warehouse on Snowflake to generate near real-time offers for travelers.
  • Striim moves spending data from Oracle, SQL Server databases and GoldenGate Trail Files to Snowflake.
  • The solution provides real-time promotional offers and discounts to customers based on their spending behavior to improve both satisfaction and revenue.

A European HR Solutions Provider

  • Ingests data from Oracle and SQL Server database using real-time CDC from to Snowflake.
  • Chose Striim for low data latency with built-in scalability, security, and reliability.
  • Zero maintenance solution and pay-per-use models also were key considerations.

By streaming enterprise data to Snowflake with built-in scalability, security, and reliability, Striim simplifies the adoption of a modern, cloud data warehouse for time-sensitive, operational decision making.

We invite you to learn more about Striim’s Snowflake offering by visiting our Snowflake solutions page. Feel free to contact us if we can help you reliably move data in real time to Snowflake.

Microsoft SQL Server to Kafka

Microsoft SQL Server CDC to Kafka

By delivering high volumes of data using Microsoft SQL Server CDC to Kafka, organizations gain visibility of their business and the vital context needed for timely operational decision making. Getting maximum value from Kafka solutions requires ingesting data from a wide variety of sources – in real time – and delivering it to users and applications that need it to take informed action to support the business.Microsoft SQL Server to Kafka

Traditional methods used to move data, such as ETL, are just not sufficient to support high-volume, high-velocity data environments. These approaches delay getting data to where it can be of real value to the organization. Moving all the data, regardless of relevance, to the target creates challenges in storing it and getting actionable data to the applications and users that need it. Microsoft SQL Server CDC to Kafka minimizes latency and prepares data so it is delivered in the correct format for different consumers to utilize.

In most cases, the data that resides in transactional databases like Microsoft SQL Server is the most valuable to the organization. The data is constantly changing reflecting every event or transaction that occurs.  Using non-intrusive, low-impact change data capture (CDC) the Striim platform moves and processes only the changed data. With Microsoft SQL Server CDC to Kafka users manage their data integration processes more efficiently and in real time. 

Using a drag-and-drop UI and pre-built wizards, Striim simplifies creating data flows for Microsoft SQL Server CDC to Kafka. Depending on the requirements of users, the data can either be delivered “as-is,” or in-flight processing can filter, transform, aggregate, mask, and enrich the data. This delivers the data in the format needed with all the relevant context to meet the needs of different Kafka consumers –with sub-second latency.

Striim is an end-to-end platform that delivers the security, recoverability, reliability (including exactly once processing), and scalability required by an enterprise-grade solution. Built-in monitoring also compares sources and targets and validates that all data has been delivered successfully. 

In addition to Microsoft SQL Server CDC to Kafka, Striim offers non-intrusive change data capture (CDC) solutions for a range of enterprise databases including Oracle, Microsoft SQL Server, PostgreSQL, MongoDB, HPE NonStop SQL/MX, HPE NonStop SQL/MP, HPE NonStop Enscribe, and MariaDB.

For more information about how to use Microsoft SQL Server CDC to Kafka to maintain real-time pipelines for continuous data movement, please visit our Change Data Capture solutions page.

If you would like a demo of how Microsoft SQL Server CDC to Kafka works and to talk to one of our technologists, please contact us to schedule a demo.

Real-Time Data Integration to Google Cloud Spanner

Striim Announces Real-Time Data Migration to Google Cloud Spanner

The Striim team has been working closely with Google to deliver an enterprise-grade solution for online data migration to Google Cloud Spanner. We’re happy to announce that it is available in the Google Cloud Marketplace. This PaaS solution facilitates the initial load of data (with exactly once processing and delivery validation), as well as the ongoing, continuous movement of data to Cloud Spanner.Real-Time Migration to Google Cloud Spanner

The real-time data pipelines enabled by Striim from both on-prem and cloud sources are scalable, reliable and high-performance. Cloud Spanner users can further leverage change data capture to replicate data in transactional databases to Cloud Spanner without impacting the source database, or interrupting operations.

Google Cloud Spanner is a cloud-based database system that is ACID compliant, horizontally scalable, and global. Spanner is the database that underlies much of Google’s own data collection, and it has been designed to offer the consistency of a relational database with the scale and performance of a non-relational database.

Migration to Google Cloud Spanner requires a low-latency, low-risk solution to feed mission-critical applications. Striim offers an easy-to-use solution to move data in real time from Oracle, SQL Server, PostgreSQL, MySQL, and HPE NonStop to Cloud Spanner while ensuring zero downtime and zero data loss. Striim is also used for real-time data migration from Kafka, Hadoop, log files, sensors, and NoSQL databases to Cloud Spanner.

While the data is streaming, Striim enables in-flight processing and transformation of the data to maximize usability of the data the instant it lands in Cloud Spanner.

To learn more about Striim’s Real-Time Migration to Google Cloud Spanner, read the related press release, view our Striim for Google Cloud Spanner product page, or provision Striim’s Real-Time Data Integration to Cloud Spanner in the Google Cloud Marketplace.

Setting Up Streaming ETL to Snowflake

Snowflake, the data warehouse built for the cloud, is designed to bring power and simplicity to your cloud-based analytics solutions, especially when combined with a streaming ETL to Snowflake running in the cloud.Streaming ETL to Snowflake

Snowflake helps you make better and faster business decisions using your data on a massive scale, fueling data-driven organizations. Just take a look at Snowflake’s example use cases and you can see how companies are creating value from their data with Snowflake. There’s just one key caveat – how do you get your data into Snowflake in the first place?

Approaches – ETL/CDC/ELT
There are plenty of options when it comes to using data integration technologies, including ETL to Snowflake.

Let’s start with traditional ETL. Now a 50+ year old legacy technology, ETL was the genesis of data movement and enabled batch, disk-based transformations. While ETL is still used for advanced transformation capabilities, the high latencies and immense load on your source databases leave something to be desired.

Next, there was Change Data Capture (CDC). Pioneered by the founders of Striim at their previous company, GoldenGate Software (acquired by Oracle), CDC technology enabled use cases such as zero downtime database migration and heterogeneous data replication. However, CDC lacks transformational capabilities, forcing you into an ELT approach – first landing the data into a staging area such as storage, and then transforming to its final form. While this works, the multiple hops increase your end-to-end latency and architectural complexity.

Continuously Integrating Transactional Data into Snowflake

Enter, Striim. Striim is an evolution from GoldenGate and combines the real-time nature of CDC with many of the transformational capabilities of ETL into a next-generation streaming solution for ETL to Snowflake and other analytics platforms, on-premises or in the cloud. Enabling real-time data movement into Snowflake, Striim continuously ingests data from on-premises systems and other cloud environments to Snowflake. In this quick start guide, we will walk you through, step-by-step, how to use Striim for streaming ETL to Snowflake by loading data in real time, whether you run Snowflake on Azure or AWS.

Data Flow
We’ll get started with an on-premises Oracle to Snowflake application with in-line transformations and denormalization. This guide assumes you already have Striim installed either on-premises or in the cloud, along with your Oracle database and Snowflake account configured.

After installing Striim, there are a variety of ways to create applications, or data pipelines, from a source to a target. Here, I’ll focus on using our pre-built wizards and drag-and-drop UI, but you can also build applications with the drag-and-drop UI from scratch, or using a declarative language using the CLI.

We will show how you can set up the flow between the source and target, and then how you can enrich records using an in-memory cache that’s preloaded with reference data.

  1. In the Add App page, select Start with Template.

Striim for ETL to Snowflake

2. In the following App Wizard screen, search for Snowflake.

Striim for ETL to Snowflake

3. For this example, we’ll choose Oracle CDC to Snowflake.

Striim for ETL to Snowflake

4. Name the application whatever you’d like – we’ll choose oracleToSnowflake. Go ahead and use the default admin Namespace. Namespaces are used for both application organization and enable a microservices approach when you have multiple data pipelines. Click Save.

Striim for ETL to Snowflake

5. Follow the wizards, entering first your on-premises Oracle configuration properties, and then your Snowflake connection properties. In this case I’m migrating an Oracle orders table. Click Save, and you’ll be greeted by our drag and drop UI with the source and target pre-populated. If you want to just do a straight source-to-target migration, that’s it! However, we’ll continue this example with enrichment and denormalization, editing our application using the connectors located on the left-hand side menu bar.

Striim for ETL to Snowflake

6. In this use case, we’ll enrich the Orders table with another table of the same on-premises Oracle database. Locate the Enrichment tab on the left-hand menu bar, and drag and drop the DB Cache to your canvas.

Striim for ETL to Snowflake

7. First, name the cache whatever you’d like – I chose salesRepCache. Then, specify the Type of your cache. In this case, my enrichment table contains three fields: ID, Name, and Email. Specify a Key to map. This tells Striim’s in-memory cache how to position the data for the fastest possible joins. Finally, specify your Oracle Username, the JDBC Connection URL, your password, and the tables that you want to use as a cache. Click Save.

Striim for ETL to Snowflake

8. Now we’ll go ahead and join our streaming CDC source with the static Database Cache. Click the circular stream beneath your Oracle source, and click Connect next CQ component.

Striim for ETL to Snowflake

9. Application logic in Striim is expressed using Continuous Queries, or CQs. You do so using standard SQL syntax and optional Java functionality for custom scenarios. Unlike a query on a database where you run one query and receive one result, a CQ is constantly running, executing the query on an event-by-event basis as the data flows through Striim. Data can be easily pre-formatted or denormalized using CQs.

10. In this example, we are doing a few simple transformations of the fields of the streaming Oracle CDC source, as well as enriching the source with the database cache – adding in the SALES_REP_NAME and SALES_REP_EMAIL fields where the SALES_REP_ID of the streaming CDC source equals the SALES_REP_ID of the static database cache. Specify the name of the stream you want to output the result to, and click Save. Your logic here may vary depending on your use case.

Striim for ETL to Snowflake

11. Lastly, we have to configure our SnowflakeTarget to read from the enrichedStream, not the original CDC source. Click on your Snowflake target and change the Input Stream from the Oracle source stream to your enriched stream. Click Save.

Striim for ETL to Snowflake

12. Now you’re good to go! In the top menu bar, click on Created and press Deploy App.

Striim for ETL to Snowflake

13. The deployment page allows you to specify where you want specific parts of your data pipeline to run. In this case I have a very simple deployment topology – I’m just running Striim on my laptop, so I’ll choose the default option.

Striim for ETL to Snowflake

14. Click the eye next to your enrichedStream to preview your data as it’s flowing through, and press Start App in the top menu bar.

Striim for ETL to Snowflake

15. Now that the apps running, let’s generate some data. In this case I just have a sample data generator that is connecting to my source Oracle on-premises database.

Striim for ETL to Snowflake

16. Data is flowing through the Striim platform, and you can see the enriched Sales Rep Name and Emails. 

Striim for ETL to Snowflake

17. Lastly, let’s go to our Snowflake warehouse and just do a simple select * query. Data is now being continuously written to Snowflake.

Striim for ETL to Snowflake

That’s it! Without any coding, you now have set up streaming ETL to Snowflake to load data continuously, in real time.

Interested in learning more about streaming ETL to Snowflake? Check out our Striim for Snowflake solution page, schedule a demo with a Striim technologist, or download the Striim platform to get started!

5 Advantages of Real-Time ETL for Snowflake

 

 

If you have Snowflake or are considering it, now is the time to think about your ETL for Snowflake. This blog post describes the advantages of real-time ETL and how it increases the value gained from Snowflake implementations.Striim - ETL for Snowflake

With instant elasticity, high-performance, and secure data sharing across multiple clouds, Snowflake has become highly in-demand for its cloud-based data warehouse offering. As organizations adopt Snowflake for business-critical workloads, they also need to look for a modern data integration approach. A streaming ETL for Snowflake approach loads data to Snowflake from diverse sources – such as transactional databases, security systems’ logs, and IoT sensors/devices – in real time, while simultaneously meeting scalability, latency, security, and reliability requirements.

Striim offers an out-of-the-box adapter for Snowflake to stream real-time data from enterprise databases (using low-impact change data capture), log files from security devices and other systems, IoT sensors and devices, messaging systems, and Hadoop solutions, and provide in-flight transformation capabilities. With built-in HA, security, reliability, and scalability features, Striim makes it easy for Snowflake customers to seamlessly adopt a modern, enterprise data warehouse that uses up-to-date data and supports operational decision making.

There are several advantages of using a streaming data integration solution such as Striim to enable real-time ETL for Snowflake. Here are the Top 5:

  1. Using Striim’s real-time data synchronization capabilities, businesses have the option to implement a phased migration to Snowflake from existing on-prem or cloud-based data warehouses. As such, there is no downtime for the legacy environment, and risks are minimized by allowing for extensive testing of the new Snowflake environment.
  2. When combined with low-impact CDC, real-time ETL turns enterprise databases into a streaming source of critical business transactions, enabling richer business insights. In addition to log files, sensors, and messaging systems, Striim continuously ingests real-time data from cloud-based or on-premises data warehouses and databases such as Oracle, Oracle Exadata, Teradata, Netezza, Amazon Redshift, SQL Server, HPE NonStop, MongoDB, and MySQL.
  3. Real-time ETL enables low-latency data for time-sensitive analytics use cases, such as detecting and predicting security threats instantaneously, enabling location-based marketing etc. that provide significant operational value to the business.
  4. Before loading the data to Snowflake with sub-second latency, Striim allows users to perform in-line transformations, including denormalization, filtering, enrichment and masking, using a SQL-based language. In-flight data processing reduces the time needed for data preparation as it delivers the data in a consumable form.
  5. In-flight transformation also enables a simplified and scalable data architecture that has many related benefits including:
    • Minimizing ETL workloads by performing transformations while data is in motion
    • Optimizing data storage by filtering out unnecessary data
    • Enabling end-to-end recoverability and full resiliency without needing to handle many different components and network hops across these components
    • Supporting compliance with privacy-related regulations by enabling data masking before delivery

Striim runs on both AWS and Azure, and enables easy deployment and fast time-to-market for real-time ETL to Snowflake with its wizards-based, drag-and-drop UI.

I invite you to learn more about how Striim supports Real-Time ETL for Snowflake by visiting our Striim for Snowflake solution page, schedule a demo with a Striim technologist, or download the Striim platform to get started!

 

Why Hybrid Cloud Requires Enterprise-Grade Streaming

Today, we are discussing why hybrid cloud and hybrid IT infrastructures require not just streaming data integration, but enterprise-grade streaming data integration. While streaming integration linked with hybrid deployment environments is necessary for organizations to take full advantage of their real-time data within their organization, the combination is not sufficient to make this a reality.

For the exploration and validation example above, any grade of streaming integration will do. “Free” open-source facilities, such as Apache Software Foundation projects, can be used for the validation of a scenario or a proof of concept. However, as organizations move their efforts toward mission-critical operations of a business, they need to have a technical infrastructure that provides stability and enterprise-grade performance.

The traits of an enterprise-grade streaming integration facility include the following:

  • Ease of use for configuration and management – Configuring and managing streaming integration is no easy task. However, an enterprise- grade graphical user interface (GUI) is key to allowing multiple members of an organization to configure or manage events within the environment. In particular, they need the ability of data center support teams to monitor and manage processes and data pipelines without having to involve senior technical resources.
  • Ability to recover from failure – Similar to the ability to easily monitor and manage via a GUI, organizations need their streaming integration facilities to provide the ability to recover from failures. If upstream providers of data make unannounced or unintended adjustments to data formats, streaming platforms need to be able to handle those changes gracefully without crashing the environment.
  • Security and privacy provisions – With greater amounts of sensitive personal information being generated and distributed around an organization, the security and privacy features of a streaming integration solution need to be up to the task of meeting enterprise security requirements. Breakdowns in these areas are not just mistakes. There are often government regulatory, corporate compliance, and public reputation consequences that cause financial damage to organizations.

To learn more about the capabilities of Striim’s enterprise-grade streaming integration platform, visit our Striim Platform Overview product page, schedule a demo with a Striim technologist, or download a free trial of the Striim platform and try it for yourself.

Evolution of Streaming ETL

The Inevitable Evolution from Batch ETL to Streaming ETL (Part 2 of 2)

 

Following on from Part 1 of this two-part blog series on the evolution of the traditional batch ETL to a real-time streaming ETL, I would like discuss how Striim, a patented streaming data integration software, supports this shift by offering fast-to-deploy real-time streaming ETL solutions for on-premises and cloud environments.

Striim Overview - Streaming ETLIn the first part of the blog post, we talked about the requirement for data integration solutions to work seamlessly with the new data platforms in the cloud, and to offer both the transformation and enrichment capabilities of traditional ETL and the low-latency data capture and distribution capabilities of logical replication. Striim’s streaming data integration software was designed to support these requirements by enabling:

  • Real-time low-impact data ingestion from relational databases using CDC, and from non-relational database sources such as log files, IoT sensors, Hadoop, MongoDB and Kafka
  • In-memory, SQL-based in-line data transformations, such as denormalization, and enrichments before delivering the data in a consumable format and with sub-second latency
  • Delivery to relational databases, Kafka and other messaging solutions, Hadoop, NoSQL, files, on-premises or in the cloud in real time
  • Continuous monitoring of data pipelines and delivery validation for zero data loss
  • Enterprise-grade solutions with built-in exactly once processing, security, high-availability and scalability using a patented distributed architecture
  • Fast time to market with wizards-based, drag and drop UI

As a result of this shift to real-time streaming ETL, businesses can now offload their high-value, operational workloads to cloud because they can easily stream their on-premises and cloud-based data to cloud-based analytics solutions in real time and in the right format. They can also rapidly build real-time analytics applications on Kafka, NoSQL or Kudu that use real-time data from a wide range of sources. These real-time analytics applications help them capture perishable, time-sensitive insights that improve customer experience, manage risks and SLAs effectively, and improve operational efficiencies in their organizations.

From an architecture perspective, streaming ETL with distributed stream processing not only reduces complexity in ensuring high availability but also in enabling consistent data states in the event of a process failure. Because Striim manages data flows end-to-end and in memory, after an outage it is able to ensure data processing and delivery is consistent across all flows so there is no missing data or out-of-sync situation in the target systems.

Change is the only constant in our lives and in the technologies we work with. The evolution from traditional ETL to streaming ETL supports digital businesses to effectively meet their customers’ needs and have a competitive edge. I invite you to have a conversation with us about using streaming ETL in your organization by scheduling a technical demo with one of our experts. You can also download Striim for free to have a first-hand experience of its broad range of capabilities.

Evolution of Real-Time ETL

The Inevitable Evolution from Batch ETL to Real-Time ETL (Part 1 of 2)

 

 

Traditional extract, transform, load (ETL) solutions have, by necessity, evolved into real-time ETL solutions as digital businesses have increased both the speed in executing transactions, and the need to share larger volumes of data across systems faster. In this two-part blog post series, I will describe the transition from traditional ETL to a streaming, real-time ETL and how that shift benefits today’s data-driven organizations.

The Evolution of Real-Time ETLData integration has been the cornerstone of the digital innovation for the last several decades, enabling the movement and processing of data across the enterprise to support data-driven decision making. In decades past, when businesses collected and shared data primarily for strategic decision making, batch-based ETL solutions served these organizations well. A traditional ETL solution extracts data from databases (typically at the end of the day), transforms the data extensively on disk in a middle-tier server to a consumable form for analytics, and then loads it in batch to a target data warehouse with a significantly different schema to enable various reporting and analytics solutions.

As consumers demanded faster transaction processing, personalized experience, and self-service with up-to-date data access, the data integration approach had to adapt to collect and distribute data to customer-facing applications and analytical applications more efficiently and with lower latency. In response, two decades ago, logical data replication with change data capture (CDC) capabilities emerged. CDC moves only the change data in real time, as opposed to all available data as a snapshot, and delivers data to various databases.

These “new” technologies enabled businesses to create real-time replicas of their databases to support customer applications, migrate databases without downtime, and allow real-time operational decision making. Because CDC was not designed for extensive transformations of the data, logical replication and CDC tools lead to an “extract, load, and transform” (ELT) approach where significant transformations and enrichment would be required on the target system to put the data in the desired form for analytical processing. Many of the original logical replication offerings are also architected to run single processes on one node, which creates a single point of failure and requires an orchestration layer to achieve true high availability.

The next wind of change came with the analytical solutions shifting from traditional on-premises data warehousing on relational databases to Hadoop and NoSQL environments and Kafka-based streaming data platforms, deployed heavily in the cloud. Traditional ETL had to now evolve further to a real-time ETL solution that works seamlessly with the data platforms both on-premises and in the cloud, and combines the robust transformation and enrichment capabilities of traditional ETL with low-latency data capture and distribution capabilities of logical replication and CDC.

In Part 2 of this blog post, I will discuss these real-time ETL solutions in more detail, particularly focusing on Striim’s streaming data integration software which moves data across cloud and on-premises environments with in-memory stream processing before delivering data in milliseconds to target data platforms. In the meantime, please check out our product page to learn more about Striim’s real-time ETL capabilities.

Feel free to Schedule a technical demo with one of our lead technologists, or download or provision Striim for free to experience first-hand its broad range of capabilities.