Skip to main content

MongoDB

Applicable EditionsTapData CloudTapData Cloud offers you cloud services that are suitable for scenarios requiring rapid deployment and low initial investment, helping you focus more on business development rather than infrastructure management. Free trial with TapData Cloud.TapData EnterpriseTapData Enterprise can be deployed in your local data center, making it suitable for scenarios with strict requirements on data sensitivity or network isolation. It can serve to build real-time data warehouses, enable real-time data exchange, data migration, and more.TapData CommunityTapData Community is an open-source data integration platform that provides basic data synchronization and transformation capabilities. This helps you quickly explore and implement data integration projects. As your project or business grows, you can seamlessly upgrade to TapData Cloud or TapData Enterprise to access more advanced features and service support.

MongoDB is a popular open-source NoSQL database that stores and retrieves data in a flexible and scalable manner. TapData supports the integration of MongoDB as both the source and target database for building data pipelines. This article provides a comprehensive guide on how to add MongoDB to TapData, enabling you to leverage its scalability, flexibility, querying, and indexing capabilities for your data processing needs.

Supported Versions and Architectures​

CategoryDescription
VersionMongoDB 4.0 and above. For MongoDB 3.4 and below, please choose the data source named MongoDB Below 3.4.
Architecture● As a source: Supports replica set and sharded cluster architectures. Additionally, full and incremental data synchronization can be performed from the secondary nodes of a sharded cluster.
● As a target: Supports single node, replica set, and sharded cluster architectures.

Supported Data Types​

CategoryData Types
Strings and CodeString, JavaScript, Symbol
Numeric TypesDouble, Int32, Int64, Decimal128
Document and ArrayDocument, Array
Binary and ObjectIdBinary, ObjectId
Boolean TypeBoolean
Date and TimestampDate, Timestamp
Special TypesMin Key, Max Key, Null

SQL Operations for Sync​

DML: INSERT, UPDATE, DELETE

tip

When MongoDB is used as a target database, you can select the write strategy through the advanced settings of the task node: in case of insert conflicts, you can choose to convert to an update or discard the record; in case of update failures, you can choose to convert to an insert or just log the issue.

Prerequisites​

As a Source Database​

  1. Make sure that the schema of the source database is a replica set or a sharding cluster. If it is standalone, you can configure it as a single-member replica set to open Oplog. For more information, see Convert a Standalone to a Replica Set.

  2. To ensure sufficient storage space for the Oplog, it is important to configure it to accommodate at least 24 hours' worth of data. For detailed instructions, see Change the Size of the Oplog.

  3. To create an account and grant permissions according to permission management requirements, follow the necessary steps.

    use admin
    db.createUser(
    {
    user: "tapdata",
    pwd: "my_password",
    roles: [
    { role: "read", db: "database_name" },
    { role: "read", db: "local" },
    { role: "read", db: "config" },
    { role: "clusterMonitor", db: "admin" },
    ]
    }
    )
    tip

    In shard cluster architectures, the shard server is unable to retrieve user permissions from the config database. Therefore, it is necessary to create corresponding users and grant permissions on the master nodes of each shard.

  4. When the source database is a cluster, in order to improve data synchronization performance, TapData Cloud will create a thread for each shard and read the data. Before configuring data synchronization/development tasks, you also need to perform the following operations.

  5. For sharded cluster architectures, to improve data synchronization performance, TapData will create a thread for each shard to read data. Before configuring data synchronization/development tasks, you also need to perform the following operations:

    1. Stop the Balancer to avoid data inconsistency caused by chunk migrations.
    2. Clean up orphaned documents to avoid _id conflicts.

As a Target Database​

Grant write role to specified database (e.g. demodata) and clusterMonitor role for data validation, e.g.:

use admin
db.createUser(
{
user: "tapdata",
pwd: "my_password",
roles: [
{ role: "readWrite", db: "demodata" },
{ role: "clusterMonitor", db: "admin" },
]
}
)

Connect to MongoDB​

  1. Log in to TapData platform.

  2. In the left navigation bar, click Connections.

  3. Click Create on the right side of the page.

  4. In the pop-up dialog, search and select MongoDB.

  5. On the redirected page, fill in the MongoDB connection information as described below.

    MongoDB Connection Example

    • Connection Settings
      • Name: Fill in a unique name that has business significance.
      • Type: Supports MongoDB as a source or target database.
      • Connection Mode: Choose how you want to connect:
        • URI Mode: After selecting this mode, you will be required to provide the necessary information for the database URI connection. The connection string should include the username and password, which are concatenated in the format. For example, the connection string may look like: mongodb://admin:password@192.168.0.100:27017/mydb?replicaSet=xxx&authSource=admin.
        • Standard mode: After selecting this mode, you need to fill in the database address, name, account number, password and other connection string parameters.
    • Advanced Settings
      • Use TLS/SSL Connection: Choose according to your business needs:
        • TLS/SSL Connection: TapData will connect to a separate server in the network that provides a TLS/SSL channel to the database. If your database is located in an inaccessible subnet, try this method and upload the client private key file, provide the private key password, and choose whether to validate the server certificate.
        • Direct Connection: TapData Cloud will connect directly to the database and you need to set up security rules to allow access.
      • Number of Sampling Records for Loading Model: Specifies the number of records to sample when loading the schema to ensure that the generated schema structure matches the source data. Default is 1000.
      • Fields Load Limit For Each Collection: Limits the maximum number of fields loaded per collection to avoid slow schema generation due to excessive fields. Default is 1024.
      • CDC Log Caching: Mining the source database's incremental logs, this feature allows multiple tasks to share incremental logs from the source database, avoiding redundant reads and thus significantly reducing the load on the source database during incremental synchronization. Upon enabling this feature, an external storage should be selected to store the incremental log.
      • Contain Table: The default option is All, which includes all tables. Alternatively, you can select Custom and manually specify the desired tables by separating their names with commas (,).
      • Exclude Tables: Once the switch is enabled, you have the option to specify tables to be excluded. You can do this by listing the table names separated by commas (,) in case there are multiple tables to be excluded.
      • Agent Settings: Defaults to Platform automatic allocation, you can also manually specify an agent.
      • Model Load Time: If there are less than 10,000 models in the data source, their schema will be updated every hour. But if the number of models exceeds 10,000, the refresh will take place daily at the time you have specified.
      • Enable Heartbeat Table: When the connection type is Source&Target or Source, you can enable this switch. TapData will create a _tapdata_heartbeat_table heartbeat table in the source database and update it every 10 seconds (requires appropriate permissions) to monitor the health of the data source connection and tasks. The heartbeat task starts automatically after the data replication/development task starts, and you can view the heartbeat task in the data source editing page.
  6. Click Test. Once it passes, click Save.

    tip

    If the connection test fails, please follow the prompts on the page to troubleshoot and resolve the issue.

Node Advanced Features​

When configuring data synchronization or transformation tasks with Mongo

DB as the source or target node, TapData provides more advanced features to better meet complex business needs and maximize performance. You can configure these features based on your requirements:

MongoDB Node Advanced Features

ConfigurationDescription
Document PreimagesDisabled by default. Document Pre-Image refers to the document before it is replaced, updated, or deleted. It is only supported in MongoDB 6.0 and above. When enabled, the UPDATE/DELETE event will record the pre-image, which TapData will use for data synchronization.
Filling Modified DataEnabled by default. When enabled, it automatically fills in complete fields for UPDATE events.
Skip Deleted Events On FillingEnabled by default. Disabling this feature may cause incremental data issues in the following cases: tasks using non-_id fields for business logic; source field type is a sub-document and its attributes are modified; source field type is an embedded array and the PULL operation is executed.
No Cursor TimeoutDisabled by default. When enabled, MongoDB will disable cursor timeout to prevent timeout errors during data synchronization.
Write ConcernSets the write concern level. In a sharded cluster, mongos will pass the write concern to the shards. The default value is w1. Options are as follows:
● w0 / Unacknowledged: Does not wait for write acknowledgment. Fastest, but cannot confirm whether the write was successful.
● w1 / Acknowledged: Acknowledges successful write to the primary node. Suitable for most scenarios, balancing performance and safety.
● w3: Acknowledges successful write to 3 nodes, increasing data safety.
● majority: Acknowledges successful write to most nodes. High security, but may impact performance if there are many nodes.