Change Data Capture (CDC)
TapData Cloud offers you cloud services that are suitable for scenarios requiring rapid deployment and low initial investment, helping you focus more on business development rather than infrastructure management. Free trial with TapData Cloud.TapData Enterprise can be deployed in your local data center, making it suitable for scenarios with strict requirements on data sensitivity or network isolation. It can serve to build real-time data warehouses, enable real-time data exchange, data migration, and more.TapData Community is an open-source data integration platform that provides basic data synchronization and transformation capabilities. This helps you quickly explore and implement data integration projects. As your project or business grows, you can seamlessly upgrade to TapData Cloud or TapData Enterprise to access more advanced features and service support.Change Data Capture (CDC) is a method for capturing and tracking data changes in a database. It plays a crucial role in data synchronization and integration, enabling incremental data synchronization. This document provides a detailed overview of the various CDC methods, helping you understand their working principles, advantages, and disadvantages, and offering specific usage instructions.
CDC Methodsβ
- Database Log API
- Database Log File
- Field Polling
- Database Trigger
Database Log API-Based CDC is a commonly used data change capture technique that captures incremental data changes by reading and parsing the database's transaction logs. These logs are a key component used by the database management system to ensure data integrity and recoverability, recording every detailed operation of the database.
For example, in MySQL, administrators can enable Binlog by modifying the database configuration file (mysql.cnf
) to record all data modification operations and capture data change details.
server_id = 223344
log_bin = mysql-bin
expire_logs_days = 7
binlog_format = row
binlog_row_image = full
After completing permission granting and data source connection, you can configure it as a data source in Tapdata's task configuration to achieve full and incremental data synchronization (default).
This solution requires the additional installation of a log parsing plugin. For example, with Oracle, after contacting Tapdata technical support to complete the plugin deployment, you can choose the log plugin as bridge when configuring the Oracle connection. Then, fill in the IP address of the raw log service, with the default service port of 8190.
For example, in MySQL, suppose there is a table orders
where the last_updated
field records the last update time of each record. Tapdata periodically polls the last_updated
field to detect data changes, as shown in the following example SQL:
-- Query records that have changed since the last poll
SELECT * FROM orders WHERE last_updated > '2024-06-01 00:00:00';
After completing permission granting and data source connection, you can set the incremental synchronization method to Polling for the source node and select the target field (last_updated
) in Tapdata when configuring the data transformation task.
For example, in MySQL, we first need to manually create a trigger to record changes in the orders
table:
-- Create a table to record changes
CREATE TABLE orders_changes (
id INT AUTO_INCREMENT PRIMARY KEY,
order_id INT,
change_type VARCHAR(10),
change_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- Create triggers to track INSERT, UPDATE, and DELETE
CREATE TRIGGER after_orders_insert
AFTER INSERT ON orders
FOR EACH ROW
BEGIN
INSERT INTO orders_changes (order_id, change_type) VALUES (NEW.id, 'INSERT');
END;
CREATE TRIGGER after_orders_update
AFTER UPDATE ON orders
FOR EACH ROW
BEGIN
INSERT INTO orders_changes (order_id, change_type) VALUES (NEW.id, 'UPDATE');
END;
CREATE TRIGGER after_orders_delete
AFTER DELETE ON orders
FOR EACH ROW
BEGIN
INSERT INTO orders_changes (order_id, change_type) VALUES (OLD.id, 'DELETE');
END;
After completing the setup, you can choose to develop your program to parse and process these changes, or use Debezium and Apache Kafka for more efficient incremental data processing.
This method is not optimal and increases maintenance costs, so Tapdata does not adopt this approach.
CDC Method Comparisonβ
- Feature Comparison
- Pros and Cons Comparison
Category | Database Log API | Database Log File | Field Polling | Database Trigger |
---|---|---|---|---|
Distinguishes Insert/Update Operations | β | β | β | β |
Monitors Delete Operations | β | β | β | β |
Real-time Collection | β | β (Ultra-high performance) | β | β |
Business Intrusion | π’ Low | π’ Low | π΄ High | π‘ Medium |
DBA Maintenance Cost | π‘ Medium | π΄ High (Requires additional components) | π’ Low | π΄ High (Trigger management is complex) |
System Overhead Cost | π’ Low | π’ Low | π΄ High | π΄ High |
CDC Method | Advantages | Disadvantages |
---|---|---|
Database Log API | β Utilizes existing logs with minimal impact on performance β Easily tracks schema changes | β Requires sufficient storage space and expiration time for logs |
Database Log File | β Directly parses native binary logs, reducing intermediates β Higher collection performance, RPS over 20,000 | β Requires additional component deployment and maintenance β Requires broader account permissions |
Field Polling | β Simple implementation, not dependent on logs or triggers β Highly versatile, applicable to various databases | β Lower real-time performance, dependent on polling frequency β Business intrusion, involves table structure changes β Cannot track deletions or schema changes |
Database Trigger | β Operates at SQL level, simple implementation β Reliable and detailed, accurately captures data changes | β Data changes require multiple writes β Multiple triggers can severely impact performance β Requires creating triggers on each table, high maintenance cost β Triggers may be disabled under certain circumstances |
FAQsβ
Q: Which data sources does Tapdata support CDC capture for?
A: Please refer to the tables in Supported Data Sources. If incremental data is supported as a data source, CDC information can be obtained.
Q: If my data source supports CDC, how do I choose the CDC collection method?
A: To maximize compatibility and collection performance, Tapdata supports the following CDC collection methods:
- Database Log API: The default collection method, supported by most databases. If permission restrictions prevent log access or for certain SaaS data sources, choose the Field Polling method.
- Database Log File: Currently supported only for Oracle and Db2 data sources.
- Field Polling: Set the incremental synchronization method for the source node in Tapdata when configuring the data transformation task.