Advancement of Data Movement Techologies

Advancement of Data Movement Technologies: Whiteboard Wednesdays

 

In this Whiteboard Wednesday video, Irem Radzik, Head of Product Marketing at Striim, looks at how data movement technologies have evolved in response to changing user demands. Read on, or watch the 8-minute video:

Today we’re going to talk about the advancement of data movement technologies. We’re going to look at the ETL technologies that we started seeing in ‘90s, then the CDC (Change Data Capture)/Logical Replication solutions that we started seeing a couple of decades ago, and then streaming data integration solutions that we more commonly see today.

ETL

Let’s look at ETL technologies. ETL is known for its batch extract, then bringing the data into the transformation step in the middle tier server, and then loading the target in bulk again, typically for next-day reporting. You end up having high latency with these types of solutions. That was good enough for the ‘90s, but then we started demanding more fresh data for operational decision making. Latency became an issue with ETL solutions.

Data Movement - ETL

The other issue with ETL was the batch-window dependency. Because of the high impact on the production sources, there had to be a dedicated time for these batch extracts when the main users wouldn’t be able to use the production database. The batch window that was available for data extract became shorter and shorter as business demanded continuous access to the OLTP system.

The data volumes increased at the same time. You ended up not having enough time to move all the data you needed. That became a pain point for ETL users, driving them to look into other solutions.

Change Data Capture/Logical Replication

Change Data Capture/Logical Replication solutions addressed several of the key concerns that ETL had. Change Data Capture basically means that you continuously capture new transactions happening in the source database and deliver it to the target in real time.

Data Movement - CDC / Logical ReplicationThat obviously helps with the data latency problem. You end up having real-time, up to date data in the target for your operational decision making. The other plus of CDC is the source impact.

When it’s using logs (database logs) to capture the data, it has negligible impact. The source production system is available for transaction users. There is no batch window needed and no limitations for how much time you have to extract and move the data.

The CDC/Logical Replication solutions handle some of the key concerns of ETL users. They are made more for the E and L steps. What ends up happening with these solutions is that you need to do transformations within the database or with another tool, in order to complete the transformation step for end users.

The transformation happening there creates an E L T architecture and requires another product, another step, another network hub in your architecture, which complicates the process.

When there’s an outage, when there is a process disruption, reconciling your data and recovering becomes more complicated. That’s the shortcoming CDC users have been facing. These solutions were mainly made for databases.

Once the cloud and big data solutions became popular, the CDC providers had to come up with new products for cloud and big data targets. These are add-ons, not part of the main platform.

Another shortcoming that we’ve seen with CDC/Logical Replication solutions is their single node architecture, which translates into a single point of failure. This is a shortcoming, especially for mission-critical systems that need continuous availability of the data integration processes.

Streaming Data Integration

In recent years, streaming data integration came about to address the issues that CDC/Logical Replication products raised. It is becoming increasingly common. With streaming data integration, you’re not limited to just database sources.

Data Movement Streaming Data IntegrationYou can have your files, log data, your machine data, your system log files for example, all moving in a real-time fashion. Your cloud sources, your service bus or your messaging systems can be your source. Your sensor data can be moved in real time, in a streaming fashion to multiple targets. Again, not limited to just databases.

You can have cloud databases or other cloud services as your target. You can, in addition to databases, have messaging systems as your target, on-premises or in cloud, your big data solutions, on-premises or cloud. You can also deliver in file format.

Everything is like it was in a logical replication solution. It is continuous, in real time, and Change Data Capture is still a big component of the streaming data integration.

It’s built on top of the Change Data Capture technologies and brings additional data sources and additional data targets. Another important difference, and handling one of the challenges of logical replication, is the transformation piece. As we discussed, a transformation needs to happen and where it happens makes a big difference.

With streaming data integration, it’s happening in-flight. While the data is moving, you can have stream processing without adding more latency to your data. While the data is moving, it can be filtered, it can be aggregated, it can be masked and encrypted, and enriched with reference data, all in flight before it’s delivered to your target, so that it’s available in a consumable format. This streamlines your architecture, simplifies it, and makes all the recovery steps easier. It’s also delivering the data in the format that your users need.

Another important thing to highlight is the distributed architecture. This natively clustered environment helps with a single point of failure risk. When one node fails, the other one takes over immediately, so you have a highly available data pipeline. This distributed clustered environment also helps you to scale out very easily, add more servers as you have more data to process and move.

These solutions now come with a monitoring part. The real time monitoring of the pipelines gives you an understanding of what’s happening with your integration flows. If there is an issue, if there is high data latency or process issue, you get immediate alerts so you can trust that everything is running.

Data reliability is really critical, whole pipeline reliability is very critical. To make sure that there is no data loss or duplicates, there is data delivery validation that can be included in some of these solutions. You can also make sure, with the right solution, that everything is processed exactly once, and you are not repeating or dropping data. There are checkpointing mechanisms to be able to do that.

As you see, the new streaming data integration solutions handle some of the challenges that we have seen in the past with outdated data movement technologies. To learn more about streaming data integration, please visit our Real-time Data Integration solution page, schedule a demo with a Striim expert, or download the Striim platform to get started.

 

5 Streaming Cloud Integration Use Cases: Whiteboard Wednesdays

 

 

Today we’re going to talk about five streaming cloud integration use cases. Streaming cloud integration moves data continuously in real time between heterogeneous databases, with in-flight data processing. Read on, or watch the 9-minute video:

Let’s focus on how to use streaming data integration in cloud initiatives, and the five common scenarios that we see.

Use Case #1 – Online Migration/Cloud Adoption

Let’s start with the first one. It is basically adopting the cloud or getting to the cloud. When you want to move your data to the cloud, streaming cloud integration helps you with online database migration. You have your legacy database and you want to move it to the cloud. If this is a critical database, you do not want to pause it during this migration, you want it to be operational to support your business.

Streaming data integration offers Change Data Capture technology. This captures all new transactions, change transactions, as soon as they happen and delivers them to the target. While you’re doing the initial load to the cloud database, you can start Change Data Capture, keep the system open to transactions, and capture all the new transactions happening with the CDC feature.

Once the initial load is done here, you can apply the change data to the system so that these two are in-sync. Because the system, this database, is open to transactions you basically have no database downtime. This is available for users and once the data is applied here you also have the ability to validate that these two databases are in-sync, and there is no data loss during this migration process. There are tools that provide this validation, to give you no data loss during the migration.

Because this database has production data and is up-to-date and the other one is still functional, you have unlimited time to test the new database. You can control the tests and be comfortable before you point any users or any applications to this cloud database. This unlimited testing minimizes your risks. The time pressure is gone, and you can be comfortable with your move to the cloud.

You also have the ability to perform phased migration. Bi-directional data flow between the legacy system and the cloud system allows you to have users on both sides. You can move some of your users to the cloud database and some of them are still in the legacy.

The streaming cloud integration solution can apply the changes happening in the cloud to the legacy and the changes happening in the legacy back to the cloud, so they stay in-sync. You can gradually move your users to the cloud database when you feel comfortable. Phased migration is another way to minimize your risk of moving your mission-critical systems to the cloud.

Use Case #2 – Hybrid Cloud Architecture for Analytics

We have discussed how to ease your cloud adoption, but once you’re in the cloud and you have adopted a cloud solution, you also need to treat it as part of your data center and build continuous data movement between your existing data sources and the new cloud solution.Hybrid cloud architecture

We see quite a bit of cloud analytics solutions and that’s our second use case. Many organizations these days offload their analytics to cloud solutions. Modern cloud solutions give them tons of new features to modernize their analytics environment and transform their business.

We help with moving all kinds of enterprise data. This can be your databases, your machine data, all kinds of log files (security files and system log files), your existing cloud sources, your messaging systems, and your sensor data. All of them can be moved in real time continuously to your cloud analytics solution.

I would like to add that some streaming cloud integration solutions give you the ability to do in-flight data processing. Transformations happen in-flight so that you deliver the data, without adding latency, to the target system in a consumable format that it needs. You end up having data flowing in the right format for your cloud analytics solution.

The main value from this is that you can now run operational workloads, high value or high operational value producing analytics applications in your analytics solution. You can influence the decisions, operational decisions, happening in your business. That will help you gain faster business transformation throughout your enterprise.

Use Case #3 – Building New Applications in the Cloud

Building applications in the cloudWe talked about the analytics use case, here is another similar one. As part of your hybrid cloud architecture, you might be building new applications in the cloud. You still need data coming from your enterprise data sources to your cloud environment. By moving all this diverse set of data in real time to your cloud messaging systems or cloud databases or storage solutions, you are able to easily build applications in the cloud.

These modern applications move your business forward because the data is available. You can make better use of these cloud applications if you have this real-time bridge between your existing data center and your new cloud environment. Streaming Integration helps you to move your data so you can quickly build new applications for your business, to help it move forward with more modern solutions.

Use Case #4 – Multi-Cloud Integration

We also see multi-cloud use cases. A lot of companies now haveMulti-cloud integration one cloud solution for one purpose, another cloud solution for another purpose, and are working with multiple vendors. You have the option to feed your data to multiple targets. After you capture it once you can feed it to all kinds of different targets, maybe one of them for analytics and one of them for supporting new applications. You have the ability to distribute your data to multiple cloud solutions.

Use Case #5 – Inter-Cloud Integration

Similarly, if you’re working with multiple cloud vendors, you will need to connect these solutions with each other. If you have an operational database in one cloud and you have an analytics solution in Inter-cloud integrationanother cloud, you need to move the data from this cloud solution to the other one in real time, so you can have operational reporting or operational analytics solutions in this cloud.

Streaming cloud integration gives you the agility and the ability to move your data wherever you want. Cloud can be easily a part of your data center, seamlessly part of your data infrastructure, by moving your data to that environment.

You can use streaming cloud integration to ease your migration to the cloud and adoption of cloud solutions by minimizing your risk and business disruption. You can also maintain your hybrid cloud architecture and multi-cloud architecture with a continuous data flow from your existing data sources.

To learn more about streaming data integration for your cloud solutions, please visit our Hybrid Cloud Integration solution page, schedule a demo with a Striim expert, or download the Striim platform to get started.

 

Evaluating Streaming Data Integration Platforms

Evaluating Streaming Data Integration Platforms: Whiteboard Wednesdays

 

 

In today’s Whiteboard Wednesday video, Steve Wilkes, founder and CTO of Striim, looks at what you need to consider when evaluating streaming data integration platforms. Read on, or watch the 15-minute video:

We’ve already gone through what the components of a streaming integration platform are. Today we’re going to talk about how you go about evaluating streaming data integration platforms based on these components.

Just to reiterate, you need the platform to be able to:

  • Do real-time continuous data collection
  • Move that data continuously from where it’s collected to where it’s going
  • Support delivery to all the different targets that you care about
  • Process the data as it’s moving, so stream processing
  • This all needs to be enterprise grade so that it is scalable and reliable, and all those other things that you care about for mission-critical data
  • Get insights and alerts on that data movement

Let’s think about the things that you need to consider in order to actually achieve this when you’re evaluating such platforms.

Data Collection & Delivery

For data collection and delivery, you care about quite a few different things. Firstly, it needs to be low latency. If it’s a streaming data integration platform, then just doing bulk loads or micro batch may not be sufficient. You want to be able to collect the data the instant it’s created, within milliseconds typically. You need low-latency data collection.

Evaluating Streaming Integration Platforms - Data CollectionIt needs to be able to support all the sources that you care about. If you’re looking for a streaming integration platform, then you’re thinking of more than just one use case. You’re thinking “what platform is going to support all of the streaming data integration needs within my organization?” Supporting just one data source or a couple of data sources isn’t enough.

You need to be able to support all the sources that you care about now and may care about in the future. That could be databases, files, or messaging systems. It could even be IoT. So think about that when you’re evaluating whether the platform has all the sources that you need. Think about how it can deal with those sources in a number of different ways.

For databases, you may need to be able to do bulk loads into a streaming infrastructure, as well as doing Change Data Capture. This is important for collecting real-time change as it’s happening in a database, the inserts, updates, and deletes. For files, you may need to do bulk files, files that exist already, but also files as they’re created, streaming out the data as it’s being written. Supporting both bulk and change data is equally important.

You also need to consider whether the adapters are actually part of the platform or are they third party. If they are part of the platform and the platform is built well, then it means that they will be able to handle all the different requirements of the platform – scalability, reliability, and recoverability. All of those things are integrated end to end because the adapters are part of the platform.

If they’re third party, then that may not be the case. If you have to plug in third party components into your infrastructure, then you can have areas of brittleness where things may not work properly or problematic interfaces when things change. Try to avoid third party adapters wherever you can.

Data collection and data delivery need to be able to support the end to end recovery and reliability that is part of being enterprise grade. That means that from a database perspective, for example, you may need to be able to support maintaining a database transaction context from one end to the other. You need to be able to pick up from where you left off and make sure that data that is collected is delivered to all of the appropriate targets. These could be variable and different.

You might be delivering some data on-premise and some data to the cloud, but you still need to be able to make sure that all the data has made it there. You need to be able to validate that the data is being written to all the different sources and targets that the platform is supporting.

If it’s part of a platform and they’re not third party, you would expect that to be there. If they are third party, then you have to investigate whether all of those things are supported. Data collection and data delivery is the first part of how you evaluate the platform.

Data Movement

The next part is how does it do data movement? This is crucial to maintaining the kind of high throughput and low latency that you’d expect. Data movement is a number of different things. It’s between processing steps. Between your source collection and your data delivery.

Between source collection, maybe some in-memory processing or maybe some enrichment and data delivery. Or it could be an even a more complex pipeline with multiple steps in it. You’re moving data between each step.

It’s also between nodes. If you have a clustered platform and that platform is moving data between nodes for different processing steps, or maybe between source and target because the target is closer to one of the nodes than other nodes. You need to be able to ensure that the data movement happens efficiently, with high throughput and low latency, between nodes.

You also need to be able to support collecting data on-premise and delivering it into cloud environments, or collecting it from cloud environments and delivering it to on-premise, or moving between clouds. Supporting all these different typologies is all part of data movement.

Ideally as much of the data movement as possible should be in memory only. Try to avoid having to write to disk or do any kind of IO in between processing steps. The reason for this is that each processing step needs to perform optimally in order to get high throughput.

If you are persisting data, that can add latency. Ideally when you’re doing multiple processing steps in a pipeline, you’re doing all of that data movement in memory only, between the steps or just between nodes. You’re not persisting to disk.

You should only use persistent data movement or persistent data streams where needed. There are a couple of really good use cases for this. One is if you have data sources that you can’t rewind into for recoverability, you may want to use a persistent data stream as the first step in the process, but everything downstream can be in memory only.

If you’re collecting data in real time, but you have multiple applications all running at their own speeds against that data, you may want to think about having persistent data streams between different steps. Typically, you want to minimize the amount of persistent data streams that you have and use in-memory only data streams wherever possible. That will really aid in reducing your latency and increasing your throughput.

Stream Processing

The next thing that you need to be able to do is stream processing. Stream processing obviously has to be able to support all of the different types of processing that you want to do. For example, it needs to be able to support complex transformations. If it doesn’t support the transformations that you want, you should be able to add in your own components or your own user defined functions to do the transformations.

It needs to be able to combine and enrich data. This requires a lot of different constructs for stream processing. When you are combining data together from multiple data streams, they run at high speed and typically events aren’t going to happen at the same time.

You need a flexible windowing structure that can maintain a set of events from different data streams to combine together, in order to be able to produce a combined output stream that has the last data from every stream apart from the current data from the current one.

When you’re enriching data, you need to be able to join streaming data with reference data. You can’t go back to a database or go back to the original source of the reference data for every event on a data stream. It’s just too slow. You need to be able to load, cache, and remember the data you are using for enrichment in memory so you can join it really efficiently, in order to keep and maintain the throughput that you’re looking for from the overall system.

You want the stream processing to be optimized. It should really run as fast as if you’d written it yourself manually. It also needs to be easy to use. We recommend that you look for SQL-based stream processing because SQL is the language of data. There are very few people that work with data that don’t understand SQL. It allows you to do filtering, transformation, and data enrichment through natural SQL constructs.

Obviously if you want to do more complex things, you should also be allowed to import your own transformations and work with those. For SQL-based transformations, it enables anyone that knows data to be able to build and understand what the transformations are. You also want building pipelines to be as easily accessible as possible to all the people that want to work with the data.

You need to have a good UI for building the data pipelines and have as much of the process as possible automated through wizards and other UI based assistance. You need to be able to build multi-step stream processing, not just a single source into single target or a single source into single piece of processing into single target. Potentially with fan in and fan out. Multiple data sources coming in, going into multiple processing steps in a staged environment, where they go step by step by step, to potentially multiple targets coming out at the other end.

This all needs to be coordinated, well-maintained, and deployable across a cluster in order to be scalable. Your stream processing should be very rich, very capable, and also very high throughput.

Enterprise Grade

You also need to think about the enterprise-grade qualities of the platform. I’ve mentioned before, for it to be enterprise grade it needs to be scalable. You need to be able to handle increasing the throughput, increasing the number of sources, increasing the number of targets, and increasing the volume of data being generated from each one of those.

When you’re evaluating platforms and evaluating for a production scenario, you should test the platform with a reasonable throughput that corresponds to what you’re expecting in order to see how it behaves and how it scales, and measure the throughput and the latency from end to end as you’re evaluating the platform.

You also need it to be reliable. You need to be able to ensure that you have guaranteed delivery from source all the way to target. Even if something fails, if a network fails, if the source or the target goes down, if any of the processing nodes in the cluster go down or the whole cluster goes down, you need to be able to ensure that it picks up from where it left off and doesn’t miss any messages.

It has to be able to recover from failures as well. Guaranteed delivery in the normal “I’m always running” case so you don’t miss any messages, just because they disappeared into the ether somewhere. But also, that if you have a failure, you should recover and not lose any messages, not lose any events that come from the source into the target.

Of course, security is also paramount. You can secure the data while it’s moving in transit, so it’s encrypted as it goes across the network. But also that you can secure who has access to the data, who can work with individual data streams, who can see the data on individual data streams, who can build applications, who can view the results of building applications.

You need security that works across the whole end to end and deals with every single component, so that you can secure them and lock them down and make sure that only the people that need to work with data, can.

Insights & Alerts

Finally, you need to make sure that the platform gives you visibility into your data, that you can monitor the data flows and see what’s going on in real time, that you get alerts when anything happens. This could be when CPU or memory usage on any of the nodes goes above certain criteria. It could be when applications crash, or data flows crash. It could be when volume goes above or below what you expect, and doing that in a granular fashion. For example, when an individual database table goes above or below what you expect.

You need to be able to work with insights into the data flows that help you operationalize this and make sure that it’s working full time, 24/7, when you actually put it into production. You may even want to get insights on the data itself, drill down into the actual data that’s flowing, and do some analytics on that. If your streaming integration platform can also give you those valuable insights on the streaming data, then that’s the icing on the cake.

Just to summarize, when you’re evaluating streaming data integration platforms, you need to make sure that the platform can do everything that you need, to get your data from where it’s generated to where it needs to be, in order to get real value out of your data.

 

To learn more about streaming data integration, please visit our Real-time Data Integration solution page, schedule a demo with a Striim expert, or download the Striim platform to get started.

Use Cases for Streaming Data Integration

The Top 4 Use Cases for Streaming Data Integration: Whiteboard Wednesdays

 

 

Today we are talking about the top four use cases for streaming data integration. If you’re not familiar with streaming data integration, please check out our channel for a deeper dive into the technology. In this 7-minute video, let’s focus on the use cases.

 

Use Case #1 Cloud Adoption – Online Database Migration

The first one is cloud adoption – specifically online database migration. When you have your legacy database and you want to move it to the cloud and modernize your data infrastructure, if it’s a critical database, you don’t want to experience downtime. The streaming data integration solution helps with that. When you’re doing an initial load from the legacy system to the cloud, the Change Data Capture (CDC) feature captures all the new transactions happening in this database as it’s happening. Once this database is loaded and ready, all the changes that happened in the legacy database can be applied in the cloud. During the migration, your legacy system is open for transactions – you don’t have to pause it.

While the migration is happening, CDC helps you to keep these two databases continuously in-sync by moving the real-time data between the systems. Because the system is open to transactions, there is no business interruption. And if this technology is designed for both validating the delivery and checkpointing the systems, you will also not experience any data loss.

Because this cloud database has production data, is open to transactions, and is continuously updated, you can take your time to test it before you move your users. So you have basically unlimited testing time, which helps you minimize your risks during such a major transition. Once the system is completely in-sync and you have checked it and tested it, you can point your applications and run your cloud database.

This is a single switch-over scenario. But streaming data integration gives you the ability to move the data bi-directionally. You can have both systems open to transactions. Once you test this, you can run some of your users in the cloud and some of you users in the legacy database.

All the changes happening with these users can be moved between databases, synchronized so that they’re constantly in-sync. You can gradually move your users to the cloud database to further minimize your risk. Phased migration is a very popular use case, especially for mission-critical systems that cannot tolerate risk and downtime.

Cloud adoptionUse Case #2 Hybrid Cloud Architecture

Once you’re in the cloud and you have a hybrid cloud architecture, you need to maintain it. You need to connect it with the rest of your enterprise. It needs to be a natural extension of your data center. Continuous real-time data moment with streaming data integration allows you to have your cloud databases and services as part of your data center.

The important thing is that these workloads in the cloud can be operational workloads because there’s fresh information (ie, continuously updated information) available. Your databases, your machine data, your log files, your other cloud sources, messaging systems, and sensors can move continuously to enable operational workloads.

What do we see in hybrid cloud architectures? Heavy use of cloud analytics solutions. If you want operational reporting or operational intelligence, you want comprehensive data delivered continuously so that you can trust that’s up-to-date, and gain operational intelligence from your analytics solutions.

You can also connect your data sources with the messaging systems in the cloud to support event distribution for your new apps that you’re running in the cloud so that they are completely part of your data center. If you’re adopting multi-cloud solutions, you can again connect your new cloud systems with existing cloud systems, or send data to multiple cloud destinations.

Hybrid Cloud ArchitectureUse Case #3 Real-Time Modern Applications

A third use case is real-time modern applications. Cloud is a big trend right now, but not everything is necessarily in the cloud. You can have modern applications on-premises. So, if you’re building any real-time app and modern new system that needs timely information, you need to have continuous real-time data pipelines. Streaming data integration enables you run real-time apps with real-time data.

Use Case #4 Hot Cache

Last, but not least, when you have an in-memory data grid to help with your data retrieval performance, you need to make sure it is continuously up-to-date so that you can rely on that data – it’s something that users can depend on. If the source system is updated, but your cache is not updated, it can create business problems. By continuously moving real-time data using CDC technology, streaming data integration helps you to keep your data grid up-to-date. It can serve as your hot cache to support your business with fresh data.

 

To learn more about streaming data integration use cases, please visit our Solutions section, schedule a demo with a Striim expert, or download the Striim platform to get started.

 

Build vs. Buy for Streaming Data Integration

Build vs. Buy for Streaming Data Integration: Whiteboard Wednesdays

In this Whiteboard Wednesday video, Steve Wilkes, founder and CTO of Striim, asks the question, “Is it better to build or buy a Streaming Data Integration platform?”  Read on, or watch the 10-minute video:

You want to use streaming data integration to move data from existing sources, that generate lots of data, to targets. But is it better to build a streaming data integration platform from lots of readily available open source components, or buy one and have that work done for me?

Build

First, you need all the components that we’ve mentioned in previous videos that go into making a streaming data integration platform. You need to build data collectors and manage data delivery – which is not just one single thing. You need many adapters that support databases, files, maybe IOT cloud targets, essentially a whole bunch of different technologies to get the data from where it is now to where you want it to be.

You need to think about data movement? What messaging system do I use. Do I use multiple ways of doing data movement? How am I going to do stream processing? What is the engine for that? How do I define it? Is there a UI in open source? Not everything has a UI. How do I define the stream processing? Does it support SQL, which in a previous video we mentioned is the best approach working with streaming data. How do I enrich data? Does the platform include an in-memory data grid or other mechanism by which I can enrich data at high speed? How do I ensure that all of this is enterprise grade? Is it scalable, reliable, and secure? How do I get insights and alerts? How do I view the data flows and understand what’s going on if I’m piecing this together myself? This is essential for a lot of operational reasons.

Build vs. Buy for Streaming Data Integration

So, assume that you are going to try and build this from lots of different open source pieces. For every piece that goes into making up this platform, including all the framework pieces and the adapters that you will need on either end, you need to go through a process. This process involves designing the overall platform – how do all the pieces fit together. Then for each piece you need to look at what is available, and evaluate each option, not just on their own, but how they fit in with the other pieces that you are looking at. Once you have done that for each piece, you need to try and integrate all the pieces together. That integration involves building a lot of the glue code on top of the pieces that you have chosen to abstract it so that it is easier for people to work with and ensure that it is enterprise grade; it scales together, it is reliable, and it is secure. Assuming that you have built all of this, you can then start to build your applications. 

Maintenance

Now when it comes to maintenance, open source isn’t always maintained forever. People can stop supporting it. In that case you have to identify another piece of open source that can take its place and will fit in. Then you have to integrate again. Even if it’s just upgraded, that could mean the APIs have changed or the way that it functions has changed. You’re going to have to perform the integration and test everything again to make sure that everything is working. Once that is complete you are going to have to test your applications on the platform that you’ve just built again and make sure that’s working too.

Support

If you have issues, you need to go to support. Some open source platforms offer support, some don’t. There may be pieces that you’re supporting yourself or because you’ve chosen to put multiple things together. In the case of Change Data Capture (CDC) there may be multiple vendors providing support. That becomes a headache because different organizations tend to like to blame each other. You need to have support for all the pieces. If you need to upgrade, then you need to reintegrate again. Maintenance can become a big issue.

Buy

Now the difference between doing all of that and using a prebuilt STI solution is that you’re not doing any of these pieces in the middle. In going the buy route, I am going to replace all those pieces of open source that I have put together myself with the STI solution straight out of the box. You simply download the solution and start building your data flows. That is going to massively reduce the amount of time for you to start building your data integrations.

Build vs. Buy - Streaming Data Integration

The Differences Between Build vs. Buy

When it comes to summarizing all of this, what are the differences between the costs and risks of build vs. buy?

Development Costs

We went through all the steps of building a solution. Each one of those steps are going to involve engineers. They’re going to involve people that understand different technologies and there is going to have to be a team around that to manage building the platform. It can be a pretty high development cost and often that development cost can outweigh the cost of licensing a platform that you purchase. 

Time to Value

There’s a long time to market because you have to go through all those steps for each component. If components are upgraded or deprecated, you have to repeat all of those steps. So the time it takes from you saying, “I want to move data from my on-premise databases into my cloud data warehouse” or “I want to move data from IOT on premise into storage so I can do machine learning”. The time it takes for you to go from thinking “I want to do that to being able to do that” is as long as it takes you to build your framework and infrastructure, build your own SDI platform.

Whereas if you’re buying it, the time it takes is how long does it take to download it, which is much faster. The time to a solution is massively reduced because you’re not going through all the steps it takes to build it and during that time not focusing on building business value.

A lot of organizations out there are not software development companies, they are companies with a different purpose. They are a bank, or a finance organization, or a healthcare organization. Value to the organization is massively increased if you’re not spending huge amounts of time building something that someone else has already built. Focusing on business value enables you to build out solutions quickly and bring value to your business much more quickly – utilizing a prebuilt solution rather than building it yourself.

Build vs. Buy - Streaming Data Integration

Support

I’ve mentioned that open source can become obsolescent. There are cases where people invested in open source technologies which people just stopped developing because it was no longer the hot new thing. They now had to try and maintain it themselves and understand someone else’s code in order to maintain the open source that they’d heavily invested in.

If you are buying a streaming data integration solution from a vendor, then that vendor is going to support you. It requires a wide range of specialized skills. Each one of the pieces of open source requires a different skill set in order to understand it and work with it. It may be a different programming language; it may be a different environment that you need to set up. There’s also a meta skill required in understanding how all those components fit together. This is an architectural role that has to understand all the different APIs and different components and work at how to add all the enterprise grade features so that they all scale, are reliable, and all work together. If you’re buying the costs really are the license cost and limited visibility into the code, into the engine that makes it work.

Depending on who you are, that may or may not be a big issue. As long as what you’ve built and the concepts around what you’ve built are transportable then investing in streaming data integration is also transportable.

What are the differences between build vs. buy? It really comes down to, do you want to get going with building your solution and providing business value straight away? Or do you want to focus on being a software development organization building a platform that can then build the applications that will provide business value.

To learn more about build vs. buy for streaming data integration, please visit our Real-time Data Integration solution page, schedule a demo with a Striim expert, or download the Striim platform to get started.

Whiteboard Wednesdays - Streaming Data Integration

Streaming Data Integration: Whiteboard Wednesdays

 

 

In this Whiteboard Wednesday video, Steve Wilkes, founder and CTO of Striim, takes a look at Streaming Data Integration – what it is, what it is used for, and most importantly, what is needed set it up and manage it. Read on, or watch the 10-minute video:

Like all enterprises, you have a lot of data sources that generate data – databases, machine logs, or other files that are being produced in your systems. You may also have messaging systems sensors or IoT sensors generating data. This data, generated all of the time, may not be in the format you need it or where you need it. This limits your ability to get value from it or make decision using it.

In terms of moving it somewhere else, maybe you want to put it into cloud storage or other types of storage. Maybe you want to put it into different databases or different data warehouses. Maybe want to put it onto messaging systems to build applications for analytics and reporting that will enable analysts, data scientists, decision makers, and your customers can get value out of it. Streaming data integration is about being able to collect data in real time from the systems that are generating it, and moving it to where it needs to be to get value out of it.

Data Collection

Continuous data collection means being able to get the data in real time. Each type of data source requires a different way of collecting the data. Databases are quite often associated with records of what has happened in the past. Running queries against a database can provide bulk responses of historical data. In order capture real time data – that is, the data as it’s being generated – you need to use a technology called change data capture or CDC.

CDC will allow you to collect the inserts, updates and deletes that are happening in the database in real time as they’re being generated. Similarly, you need specific technologies to get the data you want from other sources of data such as files or messaging systems. In the case of files, you may need to be able to tail the file, read at the end of the file, or maybe collect multiple files across multiple systems in parallel. Once you’ve got that data, you need to be able deliver it to the target systems.

Streaming Data Integration - Sources

Data Delivery

Data delivery depends on the type of target system. If you are using database change data capture on the source side, the delivery process also needs to be change aware so an insert or an update or a delete is being managed correctly in the target system. For example, if you are synchronizing an on-premise database with a cloud database, then you need to initially load the data, but you also need to apply change to it in order to keep it synchronized. The delivery process needs to understand that, instead of just inserting all the events into the database, it treats an insert as an insert, an update as an update, and delete as a delete. This needs to be applied in a heterogeneous fashion.

You also need to be able to work with lots of different APIs. Again, just as you need to be able to work with databases, with storage, and with messaging, you also need to be able to support different cloud systems. The security mechanisms used by the different cloud vendors are all different, so you need to be able to manage that.

Data Movement

Streaming data integration needs to continuously to move the data that’s being collected into the appropriate target. It may be required to collect data from multiple sources, combine them together and push them out to a single target, or take data from a single source and move it and deliver it continuously to multiple targets.

Data movement has to be smart. It has to be able to move data between processes, between nodes, in a clustered environment. It has to be able to move data from on-premise to cloud, or from cloud to on-premise, or across different networks and even across different clouds. Also, this data movement has to be at really high speeds and low latency. You need to be able to deliver the data from one place to another really quickly.

In a lot of use cases, you’re not just dealing with the raw data. If you’re doing a database replication, then you can collect change on one side and apply that change exactly as is to the target. But in a lot of cases you want to manipulate the data in some way. The format, structure, or content of the data you collect may not be enough, it may not be correct, or it may not be sufficient for where you want to put it, you may need to do some kind of stream processing.

Streaming Data Integration - TargetsData Processing

Data processing is about being able to take the data that’s on the data stream and manipulate it. You may want to filter it. Maybe everything you collect doesn’t need to be written to everything you deliver to. Maybe only a subset is written to storage; maybe only a subset goes into messaging. You need to be able to filter that data and choose what goes where.

You also need to be able to transform it, to give it the structure that you want, and to apply functions to it in order to make it look like you need.

You also need to worry about security. Maybe you need to mask some of the data, obfuscate it in some way. If you’re doing cloud analytics and you have Personally Identifiable Information (PII) coming from your sources, then it’s probably a good idea to anonymize that data before you put it into the cloud for analytics. Being able to do that as the data streams to a target is really important.

Maybe you need to be able to encrypt the data or combine data together. You can take data from multiple sources and combine it into a single data stream that you then apply into a target. Maybe you need to aggregate or summarize the data before you write it into a data warehouse. Maybe it doesn’t contain enough information by itself. Maybe you need to denormalize it or enrich it before it is delivered somewhere else. For streaming integration platforms, you need to be able to do all of these different types of processing on the data in order to get it in the form that you want before you deliver it.

Data Insights

You also need to be able to get insights on the data, to see how the data’s flowing, whether there are any issues in the data flow, maybe even investigate the data itself if there are any issues. This can be done ad hoc, but from an operational perspective, it’s also important that you get alerts if anything is failing or if there is an unusual volume of data coming from one particular place. There suddenly could be an unusual volume coming from your audit table, or maybe there isn’t sufficient volume and it’s fallen behind what you’d expect for that time of day. Being able to get alerts on the data flows themselves can be really, really important.

Streaming Data Integration - Enterprise Grade Enterprise-Grade Integration

You can play about with streaming data integration components in the lab. But if you’re going to put things into production, streaming data integration needs to be enterprise-grade. Everything needs to be able to scale. Typically, you’re talking about large volumes of data being generated continuously for streaming integration. You need to be able to deal with scalability and adding additional scale over time.

You are probably looking at a cluster distributed environment where you can add new nodes into that environment as you need to in order to be able to handle scale. Of course, for mission-critical platforms for mission-critical data flows, it needs to be reliable.

You need to be able to say, if I collect data, if data is generated on this source side, then I have to be able to guarantee that it makes it into my targets. If I am replicating a database that’s on-premise into a data warehouse that is in the cloud, I need to make sure that every single operation makes it from one end to the other, and that I can investigate that and validate that that has actually happened.

For streaming data integration to be enterprise grade and work with mission-critical applications, you need to be able to ensure that it is reliable. At a minimum, there should be at-least-once processing, and that everything that’s generated on the source makes it to the target. In certain cases, you need “exactly once processing.” This means that not only does it make it to the target, but you are guaranteed that even in the cases of failure, it only ever makes it the target once.

Security is important. We’ve talked about data security in terms of being able to mask and encrypt data as it flows. As things are moving across networks, all of that data should be encrypted so that people can’t tap into the network flow. From a security perspective, you need to be able to lock down who has access to what in the system that is doing all of these things. If you’re generating a data stream on-premise and it has personally identifiable information in it, you need to be able to lock down which employees can access that data stream, even through the streaming platform. Being able to do authentication and authorization within the streaming platform itself is also essential.

In summary, streaming data integration is about continually collecting data from enterprise and cloud sources, being able to process that data while it’s moving, delivering it into any target that you want, whether it’s on-premise or cloud, being able to get insights into how it’s moving, and do all of this in a enterprise-grade way that is scalable, reliable, and secure.

To learn more about streaming data integration, please visit our Real-time Data Integration solution page, schedule a demo with a Striim expert, or download the Striim platform to get started.