Advancement of Data Movement Techologies

Advancement of Data Movement Technologies: Whiteboard Wednesdays

 

In this Whiteboard Wednesday video, Irem Radzik, Head of Product Marketing at Striim, looks at how data movement technologies have evolved in response to changing user demands. Read on, or watch the 8-minute video:

Today we’re going to talk about the advancement of data movement technologies. We’re going to look at the ETL technologies that we started seeing in ‘90s, then the CDC (Change Data Capture)/Logical Replication solutions that we started seeing a couple of decades ago, and then streaming data integration solutions that we more commonly see today.

ETL

Let’s look at ETL technologies. ETL is known for its batch extract, then bringing the data into the transformation step in the middle tier server, and then loading the target in bulk again, typically for next-day reporting. You end up having high latency with these types of solutions. That was good enough for the ‘90s, but then we started demanding more fresh data for operational decision making. Latency became an issue with ETL solutions.

Data Movement - ETL

The other issue with ETL was the batch-window dependency. Because of the high impact on the production sources, there had to be a dedicated time for these batch extracts when the main users wouldn’t be able to use the production database. The batch window that was available for data extract became shorter and shorter as business demanded continuous access to the OLTP system.

The data volumes increased at the same time. You ended up not having enough time to move all the data you needed. That became a pain point for ETL users, driving them to look into other solutions.

Change Data Capture/Logical Replication

Change Data Capture/Logical Replication solutions addressed several of the key concerns that ETL had. Change Data Capture basically means that you continuously capture new transactions happening in the source database and deliver it to the target in real time.

Data Movement - CDC / Logical ReplicationThat obviously helps with the data latency problem. You end up having real-time, up to date data in the target for your operational decision making. The other plus of CDC is the source impact.

When it’s using logs (database logs) to capture the data, it has negligible impact. The source production system is available for transaction users. There is no batch window needed and no limitations for how much time you have to extract and move the data.

The CDC/Logical Replication solutions handle some of the key concerns of ETL users. They are made more for the E and L steps. What ends up happening with these solutions is that you need to do transformations within the database or with another tool, in order to complete the transformation step for end users.

The transformation happening there creates an E L T architecture and requires another product, another step, another network hub in your architecture, which complicates the process.

When there’s an outage, when there is a process disruption, reconciling your data and recovering becomes more complicated. That’s the shortcoming CDC users have been facing. These solutions were mainly made for databases.

Once the cloud and big data solutions became popular, the CDC providers had to come up with new products for cloud and big data targets. These are add-ons, not part of the main platform.

Another shortcoming that we’ve seen with CDC/Logical Replication solutions is their single node architecture, which translates into a single point of failure. This is a shortcoming, especially for mission-critical systems that need continuous availability of the data integration processes.

Streaming Data Integration

In recent years, streaming data integration came about to address the issues that CDC/Logical Replication products raised. It is becoming increasingly common. With streaming data integration, you’re not limited to just database sources.

Data Movement Streaming Data IntegrationYou can have your files, log data, your machine data, your system log files for example, all moving in a real-time fashion. Your cloud sources, your service bus or your messaging systems can be your source. Your sensor data can be moved in real time, in a streaming fashion to multiple targets. Again, not limited to just databases.

You can have cloud databases or other cloud services as your target. You can, in addition to databases, have messaging systems as your target, on-premises or in cloud, your big data solutions, on-premises or cloud. You can also deliver in file format.

Everything is like it was in a logical replication solution. It is continuous, in real time, and Change Data Capture is still a big component of the streaming data integration.

It’s built on top of the Change Data Capture technologies and brings additional data sources and additional data targets. Another important difference, and handling one of the challenges of logical replication, is the transformation piece. As we discussed, a transformation needs to happen and where it happens makes a big difference.

With streaming data integration, it’s happening in-flight. While the data is moving, you can have stream processing without adding more latency to your data. While the data is moving, it can be filtered, it can be aggregated, it can be masked and encrypted, and enriched with reference data, all in flight before it’s delivered to your target, so that it’s available in a consumable format. This streamlines your architecture, simplifies it, and makes all the recovery steps easier. It’s also delivering the data in the format that your users need.

Another important thing to highlight is the distributed architecture. This natively clustered environment helps with a single point of failure risk. When one node fails, the other one takes over immediately, so you have a highly available data pipeline. This distributed clustered environment also helps you to scale out very easily, add more servers as you have more data to process and move.

These solutions now come with a monitoring part. The real time monitoring of the pipelines gives you an understanding of what’s happening with your integration flows. If there is an issue, if there is high data latency or process issue, you get immediate alerts so you can trust that everything is running.

Data reliability is really critical, whole pipeline reliability is very critical. To make sure that there is no data loss or duplicates, there is data delivery validation that can be included in some of these solutions. You can also make sure, with the right solution, that everything is processed exactly once, and you are not repeating or dropping data. There are checkpointing mechanisms to be able to do that.

As you see, the new streaming data integration solutions handle some of the challenges that we have seen in the past with outdated data movement technologies. To learn more about streaming data integration, please visit our Real-time Data Integration solution page, schedule a demo with a Striim expert, or download the Striim platform to get started.

 

5 Streaming Cloud Integration Use Cases: Whiteboard Wednesdays

 

 

Today we’re going to talk about five streaming cloud integration use cases. Streaming cloud integration moves data continuously in real time between heterogeneous databases, with in-flight data processing. Read on, or watch the 9-minute video:

Let’s focus on how to use streaming data integration in cloud initiatives, and the five common scenarios that we see.

Use Case #1 – Online Migration/Cloud Adoption

Let’s start with the first one. It is basically adopting the cloud or getting to the cloud. When you want to move your data to the cloud, streaming cloud integration helps you with online database migration. You have your legacy database and you want to move it to the cloud. If this is a critical database, you do not want to pause it during this migration, you want it to be operational to support your business.

Streaming data integration offers Change Data Capture technology. This captures all new transactions, change transactions, as soon as they happen and delivers them to the target. While you’re doing the initial load to the cloud database, you can start Change Data Capture, keep the system open to transactions, and capture all the new transactions happening with the CDC feature.

Once the initial load is done here, you can apply the change data to the system so that these two are in-sync. Because the system, this database, is open to transactions you basically have no database downtime. This is available for users and once the data is applied here you also have the ability to validate that these two databases are in-sync, and there is no data loss during this migration process. There are tools that provide this validation, to give you no data loss during the migration.

Because this database has production data and is up-to-date and the other one is still functional, you have unlimited time to test the new database. You can control the tests and be comfortable before you point any users or any applications to this cloud database. This unlimited testing minimizes your risks. The time pressure is gone, and you can be comfortable with your move to the cloud.

You also have the ability to perform phased migration. Bi-directional data flow between the legacy system and the cloud system allows you to have users on both sides. You can move some of your users to the cloud database and some of them are still in the legacy.

The streaming cloud integration solution can apply the changes happening in the cloud to the legacy and the changes happening in the legacy back to the cloud, so they stay in-sync. You can gradually move your users to the cloud database when you feel comfortable. Phased migration is another way to minimize your risk of moving your mission-critical systems to the cloud.

Use Case #2 – Hybrid Cloud Architecture for Analytics

We have discussed how to ease your cloud adoption, but once you’re in the cloud and you have adopted a cloud solution, you also need to treat it as part of your data center and build continuous data movement between your existing data sources and the new cloud solution.Hybrid cloud architecture

We see quite a bit of cloud analytics solutions and that’s our second use case. Many organizations these days offload their analytics to cloud solutions. Modern cloud solutions give them tons of new features to modernize their analytics environment and transform their business.

We help with moving all kinds of enterprise data. This can be your databases, your machine data, all kinds of log files (security files and system log files), your existing cloud sources, your messaging systems, and your sensor data. All of them can be moved in real time continuously to your cloud analytics solution.

I would like to add that some streaming cloud integration solutions give you the ability to do in-flight data processing. Transformations happen in-flight so that you deliver the data, without adding latency, to the target system in a consumable format that it needs. You end up having data flowing in the right format for your cloud analytics solution.

The main value from this is that you can now run operational workloads, high value or high operational value producing analytics applications in your analytics solution. You can influence the decisions, operational decisions, happening in your business. That will help you gain faster business transformation throughout your enterprise.

Use Case #3 – Building New Applications in the Cloud

Building applications in the cloudWe talked about the analytics use case, here is another similar one. As part of your hybrid cloud architecture, you might be building new applications in the cloud. You still need data coming from your enterprise data sources to your cloud environment. By moving all this diverse set of data in real time to your cloud messaging systems or cloud databases or storage solutions, you are able to easily build applications in the cloud.

These modern applications move your business forward because the data is available. You can make better use of these cloud applications if you have this real-time bridge between your existing data center and your new cloud environment. Streaming Integration helps you to move your data so you can quickly build new applications for your business, to help it move forward with more modern solutions.

Use Case #4 – Multi-Cloud Integration

We also see multi-cloud use cases. A lot of companies now haveMulti-cloud integration one cloud solution for one purpose, another cloud solution for another purpose, and are working with multiple vendors. You have the option to feed your data to multiple targets. After you capture it once you can feed it to all kinds of different targets, maybe one of them for analytics and one of them for supporting new applications. You have the ability to distribute your data to multiple cloud solutions.

Use Case #5 – Inter-Cloud Integration

Similarly, if you’re working with multiple cloud vendors, you will need to connect these solutions with each other. If you have an operational database in one cloud and you have an analytics solution in Inter-cloud integrationanother cloud, you need to move the data from this cloud solution to the other one in real time, so you can have operational reporting or operational analytics solutions in this cloud.

Streaming cloud integration gives you the agility and the ability to move your data wherever you want. Cloud can be easily a part of your data center, seamlessly part of your data infrastructure, by moving your data to that environment.

You can use streaming cloud integration to ease your migration to the cloud and adoption of cloud solutions by minimizing your risk and business disruption. You can also maintain your hybrid cloud architecture and multi-cloud architecture with a continuous data flow from your existing data sources.

To learn more about streaming data integration for your cloud solutions, please visit our Hybrid Cloud Integration solution page, schedule a demo with a Striim expert, or download the Striim platform to get started.

 

Evaluating Streaming Data Integration Platforms

Evaluating Streaming Data Integration Platforms: Whiteboard Wednesdays

 

 

In today’s Whiteboard Wednesday video, Steve Wilkes, founder and CTO of Striim, looks at what you need to consider when evaluating streaming data integration platforms. Read on, or watch the 15-minute video:

We’ve already gone through what the components of a streaming integration platform are. Today we’re going to talk about how you go about evaluating streaming data integration platforms based on these components.

Just to reiterate, you need the platform to be able to:

  • Do real-time continuous data collection
  • Move that data continuously from where it’s collected to where it’s going
  • Support delivery to all the different targets that you care about
  • Process the data as it’s moving, so stream processing
  • This all needs to be enterprise grade so that it is scalable and reliable, and all those other things that you care about for mission-critical data
  • Get insights and alerts on that data movement

Let’s think about the things that you need to consider in order to actually achieve this when you’re evaluating such platforms.

Data Collection & Delivery

For data collection and delivery, you care about quite a few different things. Firstly, it needs to be low latency. If it’s a streaming data integration platform, then just doing bulk loads or micro batch may not be sufficient. You want to be able to collect the data the instant it’s created, within milliseconds typically. You need low-latency data collection.

Evaluating Streaming Integration Platforms - Data CollectionIt needs to be able to support all the sources that you care about. If you’re looking for a streaming integration platform, then you’re thinking of more than just one use case. You’re thinking “what platform is going to support all of the streaming data integration needs within my organization?” Supporting just one data source or a couple of data sources isn’t enough.

You need to be able to support all the sources that you care about now and may care about in the future. That could be databases, files, or messaging systems. It could even be IoT. So think about that when you’re evaluating whether the platform has all the sources that you need. Think about how it can deal with those sources in a number of different ways.

For databases, you may need to be able to do bulk loads into a streaming infrastructure, as well as doing Change Data Capture. This is important for collecting real-time change as it’s happening in a database, the inserts, updates, and deletes. For files, you may need to do bulk files, files that exist already, but also files as they’re created, streaming out the data as it’s being written. Supporting both bulk and change data is equally important.

You also need to consider whether the adapters are actually part of the platform or are they third party. If they are part of the platform and the platform is built well, then it means that they will be able to handle all the different requirements of the platform – scalability, reliability, and recoverability. All of those things are integrated end to end because the adapters are part of the platform.

If they’re third party, then that may not be the case. If you have to plug in third party components into your infrastructure, then you can have areas of brittleness where things may not work properly or problematic interfaces when things change. Try to avoid third party adapters wherever you can.

Data collection and data delivery need to be able to support the end to end recovery and reliability that is part of being enterprise grade. That means that from a database perspective, for example, you may need to be able to support maintaining a database transaction context from one end to the other. You need to be able to pick up from where you left off and make sure that data that is collected is delivered to all of the appropriate targets. These could be variable and different.

You might be delivering some data on-premise and some data to the cloud, but you still need to be able to make sure that all the data has made it there. You need to be able to validate that the data is being written to all the different sources and targets that the platform is supporting.

If it’s part of a platform and they’re not third party, you would expect that to be there. If they are third party, then you have to investigate whether all of those things are supported. Data collection and data delivery is the first part of how you evaluate the platform.

Data Movement

The next part is how does it do data movement? This is crucial to maintaining the kind of high throughput and low latency that you’d expect. Data movement is a number of different things. It’s between processing steps. Between your source collection and your data delivery.

Between source collection, maybe some in-memory processing or maybe some enrichment and data delivery. Or it could be an even a more complex pipeline with multiple steps in it. You’re moving data between each step.

It’s also between nodes. If you have a clustered platform and that platform is moving data between nodes for different processing steps, or maybe between source and target because the target is closer to one of the nodes than other nodes. You need to be able to ensure that the data movement happens efficiently, with high throughput and low latency, between nodes.

You also need to be able to support collecting data on-premise and delivering it into cloud environments, or collecting it from cloud environments and delivering it to on-premise, or moving between clouds. Supporting all these different typologies is all part of data movement.

Ideally as much of the data movement as possible should be in memory only. Try to avoid having to write to disk or do any kind of IO in between processing steps. The reason for this is that each processing step needs to perform optimally in order to get high throughput.

If you are persisting data, that can add latency. Ideally when you’re doing multiple processing steps in a pipeline, you’re doing all of that data movement in memory only, between the steps or just between nodes. You’re not persisting to disk.

You should only use persistent data movement or persistent data streams where needed. There are a couple of really good use cases for this. One is if you have data sources that you can’t rewind into for recoverability, you may want to use a persistent data stream as the first step in the process, but everything downstream can be in memory only.

If you’re collecting data in real time, but you have multiple applications all running at their own speeds against that data, you may want to think about having persistent data streams between different steps. Typically, you want to minimize the amount of persistent data streams that you have and use in-memory only data streams wherever possible. That will really aid in reducing your latency and increasing your throughput.

Stream Processing

The next thing that you need to be able to do is stream processing. Stream processing obviously has to be able to support all of the different types of processing that you want to do. For example, it needs to be able to support complex transformations. If it doesn’t support the transformations that you want, you should be able to add in your own components or your own user defined functions to do the transformations.

It needs to be able to combine and enrich data. This requires a lot of different constructs for stream processing. When you are combining data together from multiple data streams, they run at high speed and typically events aren’t going to happen at the same time.

You need a flexible windowing structure that can maintain a set of events from different data streams to combine together, in order to be able to produce a combined output stream that has the last data from every stream apart from the current data from the current one.

When you’re enriching data, you need to be able to join streaming data with reference data. You can’t go back to a database or go back to the original source of the reference data for every event on a data stream. It’s just too slow. You need to be able to load, cache, and remember the data you are using for enrichment in memory so you can join it really efficiently, in order to keep and maintain the throughput that you’re looking for from the overall system.

You want the stream processing to be optimized. It should really run as fast as if you’d written it yourself manually. It also needs to be easy to use. We recommend that you look for SQL-based stream processing because SQL is the language of data. There are very few people that work with data that don’t understand SQL. It allows you to do filtering, transformation, and data enrichment through natural SQL constructs.

Obviously if you want to do more complex things, you should also be allowed to import your own transformations and work with those. For SQL-based transformations, it enables anyone that knows data to be able to build and understand what the transformations are. You also want building pipelines to be as easily accessible as possible to all the people that want to work with the data.

You need to have a good UI for building the data pipelines and have as much of the process as possible automated through wizards and other UI based assistance. You need to be able to build multi-step stream processing, not just a single source into single target or a single source into single piece of processing into single target. Potentially with fan in and fan out. Multiple data sources coming in, going into multiple processing steps in a staged environment, where they go step by step by step, to potentially multiple targets coming out at the other end.

This all needs to be coordinated, well-maintained, and deployable across a cluster in order to be scalable. Your stream processing should be very rich, very capable, and also very high throughput.

Enterprise Grade

You also need to think about the enterprise-grade qualities of the platform. I’ve mentioned before, for it to be enterprise grade it needs to be scalable. You need to be able to handle increasing the throughput, increasing the number of sources, increasing the number of targets, and increasing the volume of data being generated from each one of those.

When you’re evaluating platforms and evaluating for a production scenario, you should test the platform with a reasonable throughput that corresponds to what you’re expecting in order to see how it behaves and how it scales, and measure the throughput and the latency from end to end as you’re evaluating the platform.

You also need it to be reliable. You need to be able to ensure that you have guaranteed delivery from source all the way to target. Even if something fails, if a network fails, if the source or the target goes down, if any of the processing nodes in the cluster go down or the whole cluster goes down, you need to be able to ensure that it picks up from where it left off and doesn’t miss any messages.

It has to be able to recover from failures as well. Guaranteed delivery in the normal “I’m always running” case so you don’t miss any messages, just because they disappeared into the ether somewhere. But also, that if you have a failure, you should recover and not lose any messages, not lose any events that come from the source into the target.

Of course, security is also paramount. You can secure the data while it’s moving in transit, so it’s encrypted as it goes across the network. But also that you can secure who has access to the data, who can work with individual data streams, who can see the data on individual data streams, who can build applications, who can view the results of building applications.

You need security that works across the whole end to end and deals with every single component, so that you can secure them and lock them down and make sure that only the people that need to work with data, can.

Insights & Alerts

Finally, you need to make sure that the platform gives you visibility into your data, that you can monitor the data flows and see what’s going on in real time, that you get alerts when anything happens. This could be when CPU or memory usage on any of the nodes goes above certain criteria. It could be when applications crash, or data flows crash. It could be when volume goes above or below what you expect, and doing that in a granular fashion. For example, when an individual database table goes above or below what you expect.

You need to be able to work with insights into the data flows that help you operationalize this and make sure that it’s working full time, 24/7, when you actually put it into production. You may even want to get insights on the data itself, drill down into the actual data that’s flowing, and do some analytics on that. If your streaming integration platform can also give you those valuable insights on the streaming data, then that’s the icing on the cake.

Just to summarize, when you’re evaluating streaming data integration platforms, you need to make sure that the platform can do everything that you need, to get your data from where it’s generated to where it needs to be, in order to get real value out of your data.

 

To learn more about streaming data integration, please visit our Real-time Data Integration solution page, schedule a demo with a Striim expert, or download the Striim platform to get started.

Use Cases for Streaming Data Integration

The Top 4 Use Cases for Streaming Data Integration: Whiteboard Wednesdays

 

 

Today we are talking about the top four use cases for streaming data integration. If you’re not familiar with streaming data integration, please check out our channel for a deeper dive into the technology. In this 7-minute video, let’s focus on the use cases.

 

Use Case #1 Cloud Adoption – Online Database Migration

The first one is cloud adoption – specifically online database migration. When you have your legacy database and you want to move it to the cloud and modernize your data infrastructure, if it’s a critical database, you don’t want to experience downtime. The streaming data integration solution helps with that. When you’re doing an initial load from the legacy system to the cloud, the Change Data Capture (CDC) feature captures all the new transactions happening in this database as it’s happening. Once this database is loaded and ready, all the changes that happened in the legacy database can be applied in the cloud. During the migration, your legacy system is open for transactions – you don’t have to pause it.

While the migration is happening, CDC helps you to keep these two databases continuously in-sync by moving the real-time data between the systems. Because the system is open to transactions, there is no business interruption. And if this technology is designed for both validating the delivery and checkpointing the systems, you will also not experience any data loss.

Because this cloud database has production data, is open to transactions, and is continuously updated, you can take your time to test it before you move your users. So you have basically unlimited testing time, which helps you minimize your risks during such a major transition. Once the system is completely in-sync and you have checked it and tested it, you can point your applications and run your cloud database.

This is a single switch-over scenario. But streaming data integration gives you the ability to move the data bi-directionally. You can have both systems open to transactions. Once you test this, you can run some of your users in the cloud and some of you users in the legacy database.

All the changes happening with these users can be moved between databases, synchronized so that they’re constantly in-sync. You can gradually move your users to the cloud database to further minimize your risk. Phased migration is a very popular use case, especially for mission-critical systems that cannot tolerate risk and downtime.

Cloud adoptionUse Case #2 Hybrid Cloud Architecture

Once you’re in the cloud and you have a hybrid cloud architecture, you need to maintain it. You need to connect it with the rest of your enterprise. It needs to be a natural extension of your data center. Continuous real-time data moment with streaming data integration allows you to have your cloud databases and services as part of your data center.

The important thing is that these workloads in the cloud can be operational workloads because there’s fresh information (ie, continuously updated information) available. Your databases, your machine data, your log files, your other cloud sources, messaging systems, and sensors can move continuously to enable operational workloads.

What do we see in hybrid cloud architectures? Heavy use of cloud analytics solutions. If you want operational reporting or operational intelligence, you want comprehensive data delivered continuously so that you can trust that’s up-to-date, and gain operational intelligence from your analytics solutions.

You can also connect your data sources with the messaging systems in the cloud to support event distribution for your new apps that you’re running in the cloud so that they are completely part of your data center. If you’re adopting multi-cloud solutions, you can again connect your new cloud systems with existing cloud systems, or send data to multiple cloud destinations.

Hybrid Cloud ArchitectureUse Case #3 Real-Time Modern Applications

A third use case is real-time modern applications. Cloud is a big trend right now, but not everything is necessarily in the cloud. You can have modern applications on-premises. So, if you’re building any real-time app and modern new system that needs timely information, you need to have continuous real-time data pipelines. Streaming data integration enables you run real-time apps with real-time data.

Use Case #4 Hot Cache

Last, but not least, when you have an in-memory data grid to help with your data retrieval performance, you need to make sure it is continuously up-to-date so that you can rely on that data – it’s something that users can depend on. If the source system is updated, but your cache is not updated, it can create business problems. By continuously moving real-time data using CDC technology, streaming data integration helps you to keep your data grid up-to-date. It can serve as your hot cache to support your business with fresh data.

 

To learn more about streaming data integration use cases, please visit our Solutions section, schedule a demo with a Striim expert, or download the Striim platform to get started.

 

How to Migrate Oracle Database to Google Cloud SQL for PostgreSQL with Streaming Data Integration

 

 

For those who need to migrate an Oracle database to Google Cloud, the ability to move mission-critical data in real-time between on-premises and cloud environments without either database downtime or data loss data is paramount. In this video Alok Pareek, Founder and EVP of Products at Striim demonstrates how the Striim platform enables Google Cloud users to build streaming data pipelines from their on-premises databases into their Cloud SQL environment with reliability, security, and scalability. The full 8-minute video is available to watch below:

Striim offers an easy-to-use platform that maximizes the value gained from cloud initiatives; including cloud adoption, hybrid cloud data integration, and in-memory stream processing. This demonstration illustrates how Striim feeds real-time data from mission-critical applications from a variety of on-prem and cloud-based sources to Google Cloud without interruption of critical business operations.

Oracle database to Google Cloud

Through different interactive views, Striim users can develop Apps to build data pipelines to Google Cloud, create custom Dashboards to visualize their data, and Preview the Source data as it streams to ensure they’re getting the data they need. For this demonstration, Apps is the starting point from which to build the data pipeline.

There are two critical phases in this zero-downtime data migration scenario. The first involves the initial load of data from the on-premise Oracle database into the Cloud SQL Postgres database. The second is the synchronization phase, achieved through specialized readers to keep the source and target consistent.

Oracle database to Google Cloud
Striim Flow Designer

The pipeline from the source to the target is built using a flow designer that easily creates and modifies streaming data pipelines. The data can also be transformed while in motion, to be realigned or delivered in a different format. Through the interface, the properties of the Oracle database can also be configured – allowing users extensive flexibility in how the data is moved.

Once the application is started, the data can be previewed, and progress monitored. While in-motion, data can be filtered, transformed, aggregated, enriched, and analyzed before delivery. With up-to-the-second visibility of the data pipeline, users can quickly and easily verify the ingestion, processing, and delivery of their streaming data.

Oracle database to Google Cloud

During the time of initial load, the source data in the database is continually changing. Striim keeps the Cloud SQL Postgres database up-to-date with the on-premises Oracle database using change data capture (CDC). By reading the database transactions in the Oracle redo logs, Striim collects the insert, update, and delete operations as soon as the transactions commit, and makes only the changes to the target, This is done without impacting the performance of source systems, while avoiding any outage to the production database.

By generating DML activity using a simulator, the demonstration shows how inserts, updates, and deletes are managed. Running DMLS operations against the orders table, the preview shows not only the data being captured, but also metadata including the transaction ID, the system commit number, the table name, and the operation type. When you log into the orders table, the data is present in the table.

The initial upload of data from the source to the target, followed by change data capture to ensure source and target remain in-sync, allows businesses to move data from on-premises databases into Google Cloud with the peace of mind that there will be no data loss and no interruption of mission-critical applications.

Additional Resources:

To learn more about Striim’s capabilities to support the data integration requirements for a Google hybrid cloud architecture, check out all of Striim’s solutions for Google Cloud Platform.

To read more about real-time data integration, please visit our Real-Time Data Integration solutions page.

To learn more about how Striim can help you migrate Oracle database to Google Cloud, we invite you to schedule a demo with a Striim technologist.

 

Continuously Move Data to Snowflake

Enterprises must continuously move data to Snowflake to take full advantage of this data warehouse built for the cloud.

You chose Snowflake to provide rapid insights into your data on a massive scale, on AWS or Azure. However, most of your source data resides elsewhere – in a wide variety of on-premise or cloud sources. How do you continually move data to Snowflake in real-time, processing it along the way, so that your fast analytics and insights are reporting on timely data?

Snowflake was built for the cloud, and built for speed. By separating compute from storage you can easily scale up and down as needed. This gives you instant elasticity supporting any amount of data, and high speed queries for any number of users, coupled with the peace of mind provided by secure data sharing. The per-second pricing and support for multiple clouds allows you to choose your infrastructure and only pay when you are using the data warehouse.

However, residing in cloud means you have to determine how to most effectively move data to Snowflake. This could be migrating an existing Teradata or Exadata Data Warehouse, or continually populating Snowflake with newly generated on-premises data from operational databases, logs, or device information. In order for the warehouse to provide up-to-date information, there should be as little latency as possible between the original data creation and its delivery to Snowflake.

The Striim platform can help with all these requirements and more. Our database adapters support change data capture, or CDC, from enterprise or cloud databases. CDC directly intercepts database activity and collects all the inserts, updates, and deletes as they happen, ready to stream into Snowflake. Adapters for machine logs and other files read at the end of multiple files in parallel to stream out data as it is written, removing the inherent latency of batch. While data from devices and messaging systems can be collected easily, independent of their format, through a variety of high-speed adapters and parsers.

After being collected continuously, the streaming data can be delivered directly into Snowflake with very low latency, or pushed through a data pipeline where it can be pre-processed through filtering, transformation, enrichment, and correlation using SQL-based queries, before delivery into Snowflake. This enables such things as data denormalization, change detection, de-duplication, and quality checking before the data is ever stored.

In addition to this, because Striim is an enterprise grade platform, it can scale with Snowflake and reliably guarantee delivery of source data while also providing built-in dashboards and verification of data pipelines for operational monitoring purposes.

The Striim wizard-based UI enables users to rapidly create a new data flow to move data to Snowflake. In this example, real-time change data from Oracle is being continually delivered to Snowflake. The wizard walks you through all the configuration steps, checking that everything is set up properly, and results in a data flow application. This data flow can be enhanced to filter, transform and enrich the data through SQL-based queries. In the video, we add a name and email address from a cache, based on an ID present in the original data.

When the application is started, data flows in real-time from Oracle to Snowflake. Making changes in Oracle results in the transformed data being written continually to Snowflake, visible through the Snowflake UI.

Striim and Snowflake can change the way you do analytics, with Snowflake providing rapid insight to the real-time data provided by Striim. The data warehouse that is built for the cloud needs data delivered to the cloud, and Striim can continuously move data to Snowflake to support your business operations and decision-making.

To learn more about how Striim makes it easy to continuously move data to Snowflake, visit our Striim for Snowflake product page, schedule a demo with a Striim technologist, or download the platform and try it for yourself. 

The Power of Streaming SQL

The Power of Streaming SQL

You’ve heard about streaming integration, the need for stream processing, and often hear the term streaming SQL. But what is streaming SQL, and why is it so essential to real-world real-time solutions?

IBM created the Structured Query Language, or SQL, in the 1970s as a declarative mechanism for working with relational data. It has been used for four decades as a way of creating, modifying and querying data in almost every database on the planet. However, because databases store data before it is available for querying, this data is invariably old.

In the world of real-time data and streaming systems there is also a need to work with data, and Striim chose 5 years ago to use a variant of SQL for stream processing. This streaming SQL looks very much like the static database variant, but needs new constructs to deal with the differences between stored and real-time continuous data.

Database SQL works against an existing set of data and produces a result set. If the data changes, the SQL needs to be run again. Streaming SQL receives a continuous and never-ending amount of data, and continually produces new results as new data arrives.

The simplest things that can be done with this data are filtering and transformation. These operations work event-by-event with every input potentially creating zero or one output.

For example, if we want to limit data moving from one stream to another to a certain location, we could write a simple WHERE clause.

SELECT *
FROM OrderStream
WHERE zip = 94301

And if we want to combine first and last names into full name, we can use concatenation, with other, more complex, functions of course available.

SELECT *,
       FirstName + ‘ ‘ + LastName as FullName
FROM OrderStream
WHERE zip = 94301

However, because streaming queries receive events one-by-one, additional constructs are required for aggregate queries that work against a set of data, so windows and event tables need to be introduced.

A window contains a set of events bounded by some criteria. This could be the last 5 minutes worth of data, last 100 events, or hold events until no more arrive within a certain time. Windows can also be partitioned, so the sets are based on the criteria per some data value, for example last 100 actions carried out per customer. Event tables hold the last event that occurred for some key, for example the last temperature reading per room.

Streaming SQL can work against windows and event tables and will output results whenever there is any change. Aggregate queries against windows will recalculate whenever the window is updated, giving running counts, sums over micro-batches, or activity within a session.

For example to create a running count and sum of purchases per item in the last hour, from a stream of orders, you would use a window, and the familiar group by clause.

CREATE WINDOW OrderWindow
OVER OrderStream
KEEP WITHIN 1 HOUR
PARTITION BY itemId
 
SELECT itemId, itemName,
       COUNT(*) as itemCount,
       SUM(price) as totalAmount
FROM OrderWindow
GROUP BY itemId

Enriching data is just as easy, it uses the standard notion of a JOIN. The Striim platform supports all types of joins familiar to database users including inner, outer, cross and self-joins through nested queries.  Striim enables users to load large amounts of data into in-memory caches and event tables from databases, files, hdfs and other sources. This can be reference, context or historical data, and can be updated through the incorporation of CDC.

For example, if we want to enrich the orders stream to include details about customer and location, we can join with reference data loaded into caches from the customer table and location database.

SELECT o.orderid, o.itemname,
       o.custid, o.price, o.quantity,
       c.name, c.age, c.gender, c.zip,
       z.city, z.state, z.country
FROM OrderStream o,
     CustInfo c, ZipInfo z
WHERE o.custid = c.id
AND   c.zip = z.zip

Of course, this just scratches the surface of what can be achieved through Streaming SQL. Production queries can be much more complex, utilizing case statements and even pattern matching syntax.

To learn more about the power of streaming SQL, visit Striim Platform Overview product page, schedule a demo with a Striim technologist, or download a free trial of the platform and try it for yourself! 

Enabling Real-Time Data Warehousing with Azure SQL Data Warehouse

In this post, we will discuss how to enable real-time data warehousing for modern analytics through streaming integration with Striim to Azure SQL Data Warehouse.

Azure SQL Data Warehouse provides a fully managed, fast, flexible, and scalable cloud analytics platform. It enables massive parallel processing and elasticity working with Azure Data Lake Store and other Azure services to load raw and processed data. However, much of your data may currently reside elsewhere, for example, locked up on-premises, in a variety of clouds, in Oracle Exadata, Teradata, Amazon Redshift, operational databases, and other locations.

A requirement for real-time data warehousing and modern analytics is to continuously integrate data into Azure cloud analytics so that you are always acting on current information. This new hybrid cloud integration strategy must enable the continuous movement of enterprise data – to, from, and between clouds – providing continuous ingestion, storage, preparation, and serving of enterprise data in real time, not batch. Data from on-prem and cloud sources need to be delivered into multiple Azure endpoints, including a one-time load and continuous change delivery, with in-flight processing to ensure up-to-the-second information for analytics.

Striim is a next-generation streaming integration and intelligence platform that supports your hybrid cloud initiatives, enabling integration with multiple Azure cloud technologies. Please watch the embedded video to see how Striim can provide continuous data integration into Azure SQL Data Warehouse via Azure Data Lake Store through a pipeline for the ingestion, storage, preparation, and serving of enterprise data.

  • Ingest. Striim makes it easy to continuously and non-intrusively ingest all your enterprise data from a variety of sources in real time. In the video example, Striim collects live transactions from Oracle Exadata orders table.
  • Store. Striim can continuously deliver data to a variety of Azure targets including Azure Data Lake Store. Striim can be used to pre-process your data in real time as it is being delivered into the store to speed downstream activities.
  • Prep & Train. Azure DataBricks uses the data that Striim writes to Azure Data Lake Store for machine learning and transformations. Results can be loaded into Azure SQL Data Warehouse, and the machine learning model could be used by Striim for live scoring.
  • Model & Serve. Striim orchestrates the process to ensure fast, reliable, and scalable poly-based delivery to Azure SQL Data Warehouse from Azure Data Lake Store, enabling analytics applications to always be up-to-date.

See how Striim can enable your hybrid cloud initiatives and accelerate the adoption of Azure SQL Data Warehouse for flexible and scalable cloud analytics. Read more about Striim for Azure SQL Data Warehouse. Get started with Striim now with a trial download on our website, or via Striim’s integration offerings in the Azure Marketplace.

Striim Joins Microsoft, Statistica, Fujitsu, Dell at Hannover Messe

The Striim Team is excited to join with key partners including Microsoft, Statistica, Dell and Fujitsu at Hannover Messe 2017. Through joint demos, presentations and interactive experiences, Striim is showcasing a wide variety of real-time IoT integration and analysis solutions to address the needs of Industrie 4.0.

Taking place April 24-28, 2017 in Hannover, Germany, Hannover Messe is the world’s leading industrial trade show. This year’s lead theme is Integrated Industry, and features over 500 Industrie 4.0 solutions. Look for Striim at the Striim + Statistica Booth – Digital Factory, Hall 6, Booth G52.

Participation with Microsoft

We’ve joined with Microsoft to highlight a demo of our integrated solution enabling the continuous exchange and analysis of IoT data across all levels of an IoT infrastructure. This solution, which provides an edge-to-cloud smart data architecture, helps fulfill the Industrie 4.0 promise of enabling industries to be more intelligent, efficient and secure. To learn more, please click on the following links to watch a short video and read the related press release. Or stop by the Microsoft booth in the Digital Factory, Hall 7, Booth C40 to see the demo.

For more information regarding integration of the Striim platform with Microsoft IoT Azure technologies, check out the Striim solutions on the Microsoft Azure Marketplace.

Presentation in Microsoft Booth

Steve Wilkes, founder and CTO of Striim, will present the following session in Microsoft’s booth:

Ensure Manufacturing Quality, Safety and Security
Through Digital Transformation at the Edge
Wednesday, April 27
10:00am local time (GMT+2)
Digital Factory, Hall 7, Microsoft Booth C40

Participation with Microsoft, Statistica, Dell

Striim, Statistica, Microsoft and Dell have joined their IoT hardware and software to enable digital transformation through IoT, integrating machines, devices, sensors and people. A live and interactive demo at the Striim booth will feature a fully functional, model-sized factory floor that provides true-to-life sensor readings and events, feeding an end-to-end solution for real-time data processing, analytics, visualization and statistical analysis/model building. Stop by the Striim + Statistica booth in the Digital Factory, Hall 6, Booth G52 to experience first-hand the relationship between a factory’s IoT systems, and the real-time integration and analysis of the IoT data. To learn more, click here to view a short video.

Participation with Fujitsu

Furthermore, we’ve joined forces with Fujitsu to bring an advanced security appliance for discrete manufacturing companies using IoT edge analytics. The Striim platform, powered by Fujitsu servers, analyzes machine log data with sensor data from physical devices for reliable and timely assessment of any potential breach affecting the factory floor. To learn more about the Striim Fujitsu Cybersecurity Appliance, click here to watch a short video, or stop by the Striim + Statistica booth.

Please reach out if you are interested in scheduling a demo or a briefing during Hannover Messe, or would like additional materials to learn more.

You may also wish to download Gartner’s 2017 Market Guide to In-Memory Computing Technologies, and see why Striim is one of only a few vendors to address 4 out of 5 areas of In-Memory Computing, and the only vendor to do so in a single, end-to-end platform.

Hazelcast Striim Hot Cache

Introducing Hazelcast Striim Hot Cache

Today, we are thrilled to announce the availability of Hazelcast Striim Hot Cache. This joint solution with Hazelcast’s in-memory data grid uses Striim’s Change Data Capture to solve the cache consistency problem.

With Hazelcast Striim Hot Cache, you can reduce the latency of propagation of data from your backend database into your Hazelcast cache to milliseconds. Now you have the flexibility to run multiple applications off a single database, keeping Hazelcast cache refreshes up-to-date while adhering to low latency SLAs.

 

Check out this 5-minute Introduction and Demo of Hazelcast Striim Hot Cache:

https://www.youtube.com/watch?v=B1PYcIQmya4

 

Imagine that you have an application that works by retrieving and storing information in a database. To get faster response times, you utilize a Hazelcast in-memory cache for rapid access to data.

However, other applications also make database updates which leads to inconsistent data in the cache. When this happens, suddenly the application is showing out-of-date or invalid information.

Hazelcast Striim Hot Cache solves this by using streaming change data capture to synchronize the cache with the database in real time. This ensures that both the cache and associated application always have the most up-to-date data.

Through CDC, Striim is able to recognize which tables and key values have changed. Striim immediately captures these changes with their table and key, and, using the Hazelcast Striim writer, pushes those changes into the cache.

We make it easy to leverage Striim’s change data capture functionality by providing CDC Wizards. These Wizards help you quickly configure the capture of change data from enterprise databases – including Oracle, MS SQL Server, MySQL and HPE NonStop – and propagate that data to a Hazelcast cache.

You can also use Striim to facilitate the initial load of the cache.

To learn more, please read the full press release, visit the Hazelcast Striim Hot Cache product page, or jump right in and download a fully loaded evaluation copy of Striim for Hazelcast Hot Cache.

Make Your Data Strategy Work Through Streaming

Make Your Data Strategy Work Through Streaming

Data strategy transcends all use cases. Are you tasked with enhancing customer experience or ensuring SLAs? Do you monitor infrastructure, equipment, or replication between databases? Do you need to detect fraud or security issues in real time? Or does your business generate extreme volumes of IoT data?

There is a general approach for all of these use cases which Steve Wilkes, Striim Founder and CTO, covered in his recent Strata+Hadoop World presentation.
https://youtu.be/c8JsXX909_o
 

In general, you need current, accurate and complete data in order to make sound decisions. If there is a single takeaway from Steve’s talk, he’d like people to “Think Streaming First!” because if you go batch, you can’t go back. Don’t boil the ocean and try to cover everything in one go. And please reconsider dumping all your data into a data lake, because that’s a recipe for failure.

If you plan ahead and do it methodically – taking one stream at a time no matter what use case you are trying to solve – you will be able to reap the benefits of streaming data. Here are the five steps Steve recommends to make your data strategy work through streaming.

Step 0 – One Use-Case

There’s a business reason for all use cases. People need to determine if the benefits to the business outweigh how much it will cost to implement. Next, if results are generated, is the business in a position to actually act on those results? The business needs to determine if this is strategically important for growing and improving. Lastly, the company needs to determine if it’s even achievable or is it just a pie-in-the-sky dream. All of these considerations, some basic rules, and software are required for it all to work smoothly.

Before you begin, take time to determine what you want to achieve from your data strategy, and what questions you want answered with data. Ask yourself, what form are the answers to those questions going to take. From there, you should be able to work out what sources the answers can possibly come from, and how you are going to get that data. Finally, you can consider how you will process, manipulate, enhance, and enrich the data in order to give you the information you need to fulfill your use case.

Make Your Data Strategy Work Through Streaming

Step 1 – Collect Streaming Data

Once you finalize the business side from step 0 and get the technology set up, it’s time to think about streaming data. You can’t think about streaming data after the fact. You can’t do batch processes, then stream data. You can, however, stream data and then do batches.

Data collection can come from various sources: message queues, sensors, log files, Change Data Capture (CDC), etc. With sensors, you can’t possibly process and store all of the data being generated in a central location. You must do edge processing in this instance. With log files, they come in batches and you can wait hours, meaning the data is outdated. You must stream all the log data in real-time and in parallel across all of your machines.

Lastly, with databases, they aren’t inherently streaming and you can’t use SQL to get data out of production databases. DBAs won’t allow you to run massive table scans, so you have to do Change Data Capture. CDC is approved and recommended by most DBAs as a way of getting enterprise data out of production databases. They use it all the time, behind the scenes, for creating database replicas or delivering data into an operational data store (ODS). The same is true for getting data from your databases into a duplicate Hadoop, Kafka or Cloud environment. You must utilize CDC for this.

Collect Streaming Data

Step 2 – Process

Once your data is streaming, how do you shape it into the right form in order to answer the questions you had from step 0?

The first thing you can do is filter out any unnecessary data. You can manipulate that data and transform it into a format that is easily queryable. This will depend on your use case, where you are landing the data, and what formats you need for queries.

You may need to aggregate data or remove redundancy. This is where viewing trends over time is valuable. IoT data and devices can send you lots of similar data all the time. An example is a Nest thermostat sending the same temperature every second for an hour. That’s 3,600 data points with only one piece of information. You don’t need all that data which is where edge processing can help.

Add Process

Step 3 – Deliver As Use-Case Dictates

Next is delivering data which really depends on your use case. You may need to deliver into a database or file system. Many people move onto Kafka and utilize it as an enterprise data bus. Maybe you are delivering into the Cloud and you need elastic storage and scalability. Or, maybe you are delivering into a data lake.

Deliver As Use-Case Dictates

Step 4 – Add Context

Oftentimes, data that’s processed and delivered in step #3 doesn’t contain enough context to make decisions. This is where it is important to enrich the data to make it more queryable.
An example of this is with a normalized database with tables that you are trying to query. Imagine using Change Data Capture and putting all of that into Hadoop. What you’re going to get is a lot of IDs and foreign keys. If those land in your data lake, there’s not much value in querying IDs, and you can’t go back and ask important questions. You would have to create big enormous queries.

Instead, as your data is moving through your data streams, you enrich it with external data to give it the context that you need. That’s the only way you can start answering some of these questions you started with.

Add Context

It’s really important to think about streaming as part of your data strategy. If you start with batch, you will not be able to move to real-time queries later.
Lastly, it’s recommended to build analytics into the process. You can search for patterns, find anomalies, correlate time and space patterns, alert on issues, visualize and analyze, and trigger business workflows.

Build Anlytics on Streaming Integration

For more information, please visit the following resources:

Change Data Capture – capturing change in production databases for analysis and visualization

Replication Monitoring – providing visibility into replication lag and in-flight transactions

SLA Monitoring – proactively act before infringements occuroactively react before infringements happenproactively react before infringements happen

Security & Risk – leveraging streaming data to prevent fraud an minimize exposure

IoT Monitoring – real-time analysis of sensor data in context

You may also wish to take a look at the Striim Overview data sheet,  request a demo with one of our lead technologists, or download the Striim platform.

Real-Time Financial Transaction Monitoring

 

 

Financial Monitoring Application

Building complex, financial transaction monitoring applications used to be a time-consuming task. Once you had the business case worked out, you needed to work with a team of analysts, DBAs and engineers to design the system, source the data, build, test, and rollout the software. Typically it wouldn’t be correct the first time, so rinse and repeat.

Not so with Striim. In this video you will see a financial transaction monitoring application that was built and deployed in four days. The main use case is to spot increases in the rate at which customer transactions are declined, and alert on that. But a whole host of additional monitoring capabilities were also built into the application. Increasing decline rates often indicate issues with the underlying ATM and Point of Sale networks, and need to be resolved quickly to prevent potential penalties and decline in customer satisfaction.

The application consists of a real-time streaming dashboard, with multiple drill-downs, coupled with a continuous back-end dataflow that is performing the analytics, driving the dashboard and generating alerts. Streaming data is sourced in real time from a SQL Server database using Change Data Capture (CDC), and used to drive a number of analytics pipelines.

The processing logic is all implemented using in-memory continuous queries written in our easy to work with SQL-like language, and the entire application was built using our UI and dashboard builder. The initial CDC data collection goes through some initial data preparation, and is then fed into parallel processing flows. Each flow is analyzing the data in different ways, and storing the results of the processing in our built-in results store to facilitate deeper analysis later.

If you want to learn how to build complex monitoring and analytics applications quickly, take 6 minutes to watch this video.

 

ATM Remote Device Monitoring with Predictive Analytics

Watch this video to learn how Striim can monitor ATM components in order to avoid downtime and keep customers happy. This application is a specific example of the remote device monitoring solution developed by Striim which alerts technicians to possible machine outages and cash shortages. Key ATM components are monitored in realtime, including: CPU, printer, card reader, temperature, and cash balances. The App also acquires streaming data about current ATM transactions and historical data to compare past behavior. With visibility into data streams you can predict ATM component failures and service your machine as needed, rather than when scheduled.

  • a summary that shows an overview of all locations being monitored on a global level
  • a location page that shows machines on a local level
  • an ATM page highlighting activity, issues, and component metric prediction charts – shows data collected in 20 minute windows, and provides predictions for the next 10 minute window.

The Striim Platform processes all types of data in innovative ways so users can react to diverse sets of information in massive volumes and maximize the value of big data within their enterprise. The ATM Component Monitor App efficiently acquires and processes streaming data in-memory on commodity hardware, enabling instant alerts and real-time visualizations of your ATM needs. The same approach can be used for other applications, such as monitoring slot machines.

 

Top 10 Challenges in Delivering High-Velocity Big Data Analytics

In 10 minutes Steve Wilkes, WebAction Cofounder and CTO, tackles the what is needed to deliver the Top 10 Challenges in Delivering High-Velocity Big Data Analytics. Watch the video from the Strata +Hadoop conference in San Jose, CA to learn about the challenges and solutions in using stream analytics to get the most from your Big Data.

“It’s good if you can see what’s happening, it’s better if you can see what’s happening in realtime – as it happens and get deep insight…to investigate your data as it happens,” Steve Wilkes, WebAction Cofounder and CTO.

10. Big Data Plays Both Fast and Loose
9. Timing is Everything
8. Data Waits For No-one*
7. Failure is Not an Option
6. Security is Paramount
5. Scale-Out is the New Scale-up
4. You Need to be Alert
3. Don’t Dis “Integrate!”
2. Requirements Change as Fast as Data

and the #1 Challenge in Delivering High-Velocity Big Data Analytics is…

1. A Moving Picture is worth 1M Data Points

Streaming Multi-log Correlation Using the Striim Platform

Learn how the Striim Multi-log micro application correlates interesting events across multiple data streams in real time. This video is a high-level demonstration of the Multi-log app correlating two live data streams, a web server log and an app server log. The Multi-log application can be extended to continuously correlate all structured, semi-structured, and transactional data sources allowing businesses to watch events and add context as they unfold.

https://youtu.be/NlAn0m_VRzU

Multi-log Correlation Real-time Use Cases

  • VIP activity identification
  • cross log correlation
  • user activity enrichment
  • hack attempts
  • blacklist cross checks
  • large response times
  • zero content check
  • stream enrichment
  • real-time contextual marketing offers