Change Data Capture Methods
In databases, change data capture (CDC) is a set of software design patterns used to determine and track the data that has changed so that action can be taken using the changed data. Companies use change data capture for several use cases such as cloud adoption and enabling real-time data warehousing. There are multiple common change data capture methods that you can implement depending on your application requirements and tolerance for performance overhead.
In high-velocity data environments where time-sensitive decisions are made, change data capture is an excellent fit to achieve low-latency, reliable, and scalable data integration. With over 80% of companies planning on implementing multi-cloud strategies by 2025, picking the right change data capture method for your business is more critical than ever given the need to replicate data across multiple environments.
The business transactions captured in relational databases are critical to understanding the state of business operations. Traditional batch-based approaches to move data once or several times a day introduce latency and reduce the operational value to the organization. Change Data Capture provides real-time or near real-time movement of data by moving and processing data continuously as new database events occur.
There are several change data capture methods to identify changes that need to be captured and moved. Here are the common methods, how they work, and their advantages as well as shortcomings.
By using existing “LAST_UPDATED” or “DATE_MODIFIED” columns, or by adding one if not available in the application, you can create your own change data capture solution at the application level. This approach retrieves only the rows that have been changed since the data was last extracted.
The CDC logic for the technique would be
- Get the maximum value of both the target (blue) table’s ‘Created_Time’ and ‘Updated_Time’ columns
2. Select all the rows from the data source with ‘Created_Time’ greater than (>) the target table’s ‘Updated_Time’ , which are all the newly created rows since the last CDC process was executed.
3. Select all rows from the source table that have a ‘Updated_Time’ greater than (>) the target table’s ‘Updated_Time’ but less than (<) its ‘Updated_Time’. The reason for the exclusion of rows less than the maximum target create date is that they were included in step 2.
4. Insert new rows from step 2 or modify existing rows from step 3 in the target.
Pros of this method
- It can be built with native application logic
- It doesn’t require any external tooling
Cons of this method
- Adds additional overhead to the database
- DML statements such as deletes will not be propagated to the target without additional scripts to track deletes
- Error prone and likely to cause issues with data consistency
This approach also requires CPU resources to scan the tables for the changed data and maintenance resources to ensure that the DATE_MODIFIED column is applied reliably across all source tables.
You can use table delta or ‘tablediff’ utilities to compare the data in two tables for non-convergence. Then you can use additional scripts to apply the deltas from the source table to the target as another approach to change data capture. There are several examples of SQL scripts that can find the difference of two tables.
Advantages of this approach:
- It provides an accurate view of changed data while only using native SQL scripts
Disadvantage of this approach:
- Demand for storage significantly increases because you need three copies of the data sources that are being used in this technique: the original data, previous snapshot, and current snapshot
- It does not scale well in applications with heavy transactional workloads
Although this works better for managing deleted rows, the CPU resources required to identify the differences is significant and the overhead increases linearly with the volume of data. The diff method also introduces latency and cannot be performed in real time. Some log-based change data capture tools come with the ability to analyze different tables to ensure replication consistency.
Another method for building change data capture at the application level is defining triggers and creating your own change log in shadow tables. Triggers fire before or after INSERT, UPDATE, or DELETE commands (that indicate a change) and are used to create a change log. Operating at the SQL level, some users prefer this approach. Some databases even have native support for triggers.
However, triggers are required for each table in the source database, and they have greater overhead associated with running triggers on operational tables while the changes are being made. In addition to having a significant impact on the performance of the application, maintaining the triggers as the application change leads to management burden.
Advantages of this approach:
- Shadow tables can provide an immutable, detailed log of all transactions
- Directly supported in the SQL API for some databases
Disadvantage of this approach:
- Significantly reduces the performance of the database by requiring multiple writes to a database every time a row is inserted, updated, or deleted
Many application users do not want to risk the application behavior by introducing triggers to operational tables. DBAs and data architects should always heavily test the performance of any triggers added into their environment and decide if they can tolerate the additional overhead.
Log-Based Change Data Capture
Databases contain transaction (sometimes called redo) logs that store all database events allowing for the database to be recovered in the event of a crash. With log-based change data capture, new database transactions – including inserts, updates, and deletes – are read from source databases’ native transaction or redo logs.
The changes are captured without making application level changes and without having to scan operational tables, both of which add additional workload and reduce source systems’ performance.
Advantages of this approach
- Minimal impact on production database system – no additional queries required for each transaction
- Can maintain ACID reliability across multiple systems
- No requirement to change the production database system’s schemas or the need to add additional tables
Challenges of this approach
- Parsing the internal logging format of a database is complex – most databases do not document the format nor do they announce changes to it in new releases. This would potentially require you to change your database log parsing logic with each new database release.
- Would need system to manage the database change events metadata
- Additional log levels required to produce scannable redo logs can add marginal performance overhead
Data integration platforms that natively perform change data capture can handle the complexity mentioned above by automatically mining the database change logs while managing additional metadata to ensure the replication between two or more systems is reliable.