Skip to main content

Doris

Applicable EditionsTapData CloudTapData Cloud offers you cloud services that are suitable for scenarios requiring rapid deployment and low initial investment, helping you focus more on business development rather than infrastructure management. Free trial with TapData Cloud.TapData EnterpriseTapData Enterprise can be deployed in your local data center, making it suitable for scenarios with strict requirements on data sensitivity or network isolation. It can serve to build real-time data warehouses, enable real-time data exchange, data migration, and more.TapData CommunityTapData Community is an open-source data integration platform that provides basic data synchronization and transformation capabilities. This helps you quickly explore and implement data integration projects. As your project or business grows, you can seamlessly upgrade to TapData Cloud or TapData Enterprise to access more advanced features and service support.

Apache Doris is an MPP-based real-time data warehouse known for its high query speed. It can be used for report analysis, ad-hoc queries, unified data warehouse, and data lake query acceleration. Tapdata supports using Doris as both a source and a target database to build data pipelines, helping you quickly handle data flows for big data analysis scenarios. In this article, we will introduce how to connect to Doris on the Tapdata platform.

Supported Versions and Architectures​

Doris 1.2 to 3.0 (no restrictions on deployment architecture)

Supported Data Types​

CategoryData Types
StringCHAR, VARCHAR, STRING, TEXT
BooleanBOOLEAN
IntegerTINYINT, SMALLINT, INT, BIGINT, LARGEINT
NumericDECIMAL, DECIMALV3, FLOAT, DOUBLE
Date/TimeDATE, DATEV2, DATETIME, DATETIMEV2
AggregationHLL

SQL Operations for Sync​

  • DML: INSERT, UPDATE, DELETE
  • DDL (supported only as a target): ADD COLUMN, CHANGE COLUMN, DROP COLUMN, RENAME COLUMN
tip
  • When used as a source database, incremental data synchronization needs to be implemented through field polling and does not support DDL operations. For more details, see Change Data Capture (CDC).
  • When used as a target database, you can also configure DML write strategies through the advanced settings of the task node, such as whether to convert insert conflicts to updates.

Considerations​

  • Due to the inherent limitations of Doris, frequent transactional operations (such as frequent updates and deletes) should be avoided when using it as a target database to prevent performance issues during data writes.
  • Currently, Tapdata supports data writing via the Stream Load method. Therefore, for tables created using the Duplicate Key Model or Aggregation Key Model, update and delete operations are not fully supported.
  • When using Doris as a target database for large-scale data ingestion, it is recommended to configure batch sizes between 10,000 to 100,000 records, based on the size of individual records, to improve performance. Avoid setting the batch size too large to prevent OOM issues.
  • To reduce the impact on query performance, it is advisable to perform large-scale data ingestion into Doris during off-peak business hours to avoid consuming excessive database I/O resources.

Preparations​

  1. To create an account, log in to the Doris database and run the following commands.

    CREATE USER 'username'@'host' IDENTIFIED BY 'password';
    • username: Enter user name.
    • password: Enter password.
    • host: Which host can be accessed by the account, percent (%) means to allow all host.

    Example: Create an account named tapdata.

    CREATE USER 'tapdata'@'%' IDENTIFIED BY 'Tap@123456';
  2. Grant permissions to the account we just created, we recommend setting more granular permissions control based on business needs.

-- Replace the catalog_name, database_name, and username follow the tips below
GRANT SELECT_PRIV ON catalog_name.database_name.* TO 'username'@'%';
tip

Please replace the username, password, and host in the command above.

  • catalog_name: The name of the data catalog. The default name is internal. You can view the created data catalog through the SHOW CATALOGS command. For more information, see Multi Catalog.
  • database_name: Enter database name.
  • username: Enter user name.

Connect to Doris​

  1. Log in to Tapdata platform.

  2. In the left navigation bar, click Connections.

  3. On the right side of the page, click Create.

  4. In the pop-up dialog, search for and select Doris.

  5. Fill in the Doris connection information as described below.

    Connect to Doris

    • Connection Settings
      • Name: Enter a meaningful and unique name.
      • Type: Supports using Doris as a source or target database.
      • DB Address: The Doris connection address.
      • Port: The Doris query service port. The default port is 9030.
      • Enable HTTPS: Choose whether to enable the HTTPS connection without certificates.
      • HTTP/HTTPS Address: The HTTP/HTTPS protocol access address for the FE service, including address and port information. The default port is 8030.
      • DB Name: One connection corresponds to one database. If there are multiple databases, you need to create multiple connections.
      • User and Password: Enter the database username and password, respectively.
    • Advanced Settings
      • Doris Catalog: The Doris catalog, which is a higher level than the database. If using the default catalog, this can be left empty. For more details, see Multi-Catalog.
      • Additional Connection Parameters: Any additional connection parameters, default is empty.
      • Timezone: The default timezone is 0 UTC. Changing to a different timezone will affect the synchronization of fields without timezone information (such as DATETIME, DATETIMEV2), while fields like DATE and DATE2 are not affected.
      • Agent Settings: Defaults to Platform automatic allocation, you can also manually specify an agent.
      • Model Load Time: If there are less than 10,000 models in the data source, their schema will be updated every hour. But if the number of models exceeds 10,000, the refresh will take place daily at the time you have specified.
      • Enable Heartbeat Table: When the connection type is Source&Target or Source, you can enable this switch. TapData will create a _tapdata_heartbeat_table heartbeat table in the source database and update it every 10 seconds (requires appropriate permissions) to monitor the health of the data source connection and tasks. The heartbeat task starts automatically after the data replication/development task starts, and you can view the heartbeat task in the data source editing page.
  6. Click Test at the bottom of the page. After passing the test, click Save.

    tip

    If the connection test fails, please follow the prompts on the page to resolve the issue.

Node Advanced Features​

When configuring data synchronization or transformation tasks with Doris as the target node, Tapdata provides advanced features to better meet complex business requirements and maximize performance. You can configure these features based on your business needs:

Doris Node Advanced Settings

ConfigurationDescription
Key TypeChoose the table key type: Unique (default, Unique Key Model), Aggregate (Aggregation Key Model), or Duplicate (Duplicate Key Model). For more information, see Data Models.
When the key type is Duplicate and using append mode for writes, since there are no update conditions, you need to specify the sort fields.
Partition KeyRefers to the DISTRIBUTED BY HASH partition key used when creating the table. If a partition key is manually set, it will be prioritized. If not set, the primary key or update condition fields will be used by default. If neither is specified, all fields will be used (not recommended).
Number of BucketsChoose the number of buckets when creating the table, based on actual business needs. For more information, see Bucket Design Recommendations.
Table PropertiesSupports specifying table properties during table creation based on business requirements.
Write Buffer SizeDefault is 10,240 KB. Increasing the size can improve performance for bulk data inserts, while decreasing it can limit memory usage and prevent OOM.
Write FormatSupports JSON (default) and CSV formats. JSON is more secure and suitable for complex data types but offers slightly lower performance. Choose based on data security and performance requirements.