Databricks Writer
Databricks Writer writes to Delta Lake tables in Databricks on AWS or Azure. For more information, see:
Databricks on AWS and Databricks documentation for Amazon Web Services on databricks.com and Databricks on AWS on aws.amazon.com
Azure Databricks on databricks.com and Azure Databricks and Azure Databricks documentation on microsoft.com
The required JDBC driver is bundled with Striim.
Delta Lake is an open-source tabular storage. It includes a transaction log that supports features such as ACID transactions and optimistic concurrency control typically associated with relational databases. For more information, see What is Delta Lake? for AWS or What is Delta Lake? for Azure.
Limitations
Writing to Databricks requires a staging area. The native Databricks File System (DBFS) has as a 2 GB cap on storage, which can cause file corruption. To work around that limitation, we strongly recommend using an external stage instead: Azure Data Lake Storage (ADLS) Gen2 for Azure Databricks or Amazon S3 for Databricks on AWS. To use an external stage, your Databricks instance must use Databricks Runtime 10.4 or later.
If you will use MERGE mode, we strongly recommend partitioning your target tables as this will significantly improve performance (see Partitions | Databricks on AWS or Learn / Azure / Azure Databricks / Partitions.
Data is written in batch mode. Streaming mode is not supported in this release because it is not supported by Databricks Connect (see Databricks Connect - Limitations).
Known issue DEV-29579: in this release, Databricks Writer cannot be used when Striim is running in Microsoft Windows.
Creating a Databricks target using a template
Note
In this release, Auto Schema Creation is not supported when you are using Databricks' Unity Catalog.
When you create a Databricks Writer target using a wizard template (see Creating apps using templates), you must specify three properties: Connection URL, Hostname, and Personal Access Token. The Tables property value will be set based on your selections in the wizard.
Databricks does not have schemas. When the source database uses schemas, the tables will be mapped as <source_database>.<source_schema>.<table>,<target_database>.<table>
, for example, mydb.myschema.%,mydb.%
. Each schema in the source will be mapped to a database in the target. If the databases do not exist in the target, Striim will create them.
Databricks Writer properties
When creating a Databricks Writer target in TQL, you must specify values for the Connection URL, Hostname, Personal Access Token, and Tables properties. If not specified, the other properties will use their default values.
property | type | default value | notes |
---|---|---|---|
CDDL Action | enum | Process | |
Connection Retry Policy | String | initialRetryDelay=10s, retryDelayMultiplier=2, maxRetryDelay=1m, maxAttempts=5, totalTimeout=10m | Do not change unless instructed to by Striim support. |
Connection URL | String | Provide the JDBC URL from the JDBC/ODBC tab of the Databricks cluster's Advanced options (see Get connection details for a cluster). If the URL starts with | |
External Stage Type | enum |
| With the default value (not recommended), events are staged to DBFS storage at the path specified in Stage Location. To use an external stage, your Databricks instance should be using Databricks Runtime 11.0 or later. If running Databricks on AWS, set to If running Azure Databricks, set to |
Hostname | String | the Server Hostname from the JDBC/ODBC tab of the Databricks cluster's Advanced options (see Get connection details for a cluster) | |
Ignorable Exception Code | String | Set to TABLE_NOT_FOUND to prevent the application from terminating when Striim tries to write to a table that does not exist in the target. See Handling "table not found" errors for more information. Ignored exceptions will be written to the application's exception store (see CREATE EXCEPTIONSTORE). | |
Mode | enum | AppendOnly | With the default value AppendOnly:
Set to Merge to handle updates and deletes as updates and deletes instead. In Merge mode:
|
Optimized Merge | Boolean | false | In Flow Designer, this property will be displayed only when Mode is Merge. Set to True only when Mode is MERGE and the target's input stream is the output of an HP NonStop reader, MySQL Reader, or Oracle Reader source and the source events will include partial records. For example, with Oracle Reader, when supplemental logging has not been enabled for all columns, partial records are sent for updates. When the source events will always include full records, leave this set to false. |
Parallel Threads | Integer | Not supported when Mode is Merge. | |
Personal Access Token | encrypted password | Used to authenticate with the Databricks cluster (see Generate a personal access token). The user associated with the token must have read and write access to DBFS (see Important information about DBFS permissions). If table access control has been enabled, the user must also have MODIFY and READ_METADATA (see Data object privileges - Data governance model). | |
Stage Location | String |
| When the External Stage Type is DBFSROOT, the path to the staging area in DBFS, for example, |
Tables | String | The name(s) of the table(s) to write to. The table(s) must exist in the database. If not using Databricks' Unity Catalog, specify target table names in uppercase as When the target's input stream is a user-defined event, specify a single table. The only special character allowed in target table names is underscore ( When the input stream of the writer is the output of a DatabaseReader, IncrementalBatchReader, or SQL CDC source it can write to multiple tables. In this case, specify the names of both the source and target tables. You may use the % wildcard for the table names, but not for the schema or database. If the reader uses three-part names, you must use them here as well. SQL Server source table names must be specified in three parts when the source is Database Reader or Incremental Batch Reader ( source.emp,target_database.emp source_schema.%,target_catalog.target_database.% source_database.source_schema.%,target_database.% source_database.source_schema.%,target_catalog.target_database.% MySQL and Oracle names are case-sensitive, SQL Server names are not. Specify names as See Mapping columns for additional options. | |
Upload Policy | String | eventcount:10000, interval:60s | The upload policy may include eventcount and/or interval (see Setting output names and rollover / upload policies for syntax). Buffered data is written to the storage account every time any of the specified values is exceeded. With the default value, data will be written every 60 seconds or sooner if the buffer contains 10,000 events. When the app is quiesced, any data remaining in the buffer is written to the storage account; when the app is undeployed, any data remaining in the buffer is discarded. |
Azure Data Lake Storage (ADLS) Gen2 properties for Databricks Writer
To use ADLS Gen2, your Databricks instance should be using Databricks Runtime 11.0 or later.
property | type | default value | notes |
---|---|---|---|
Azure Account Access Key | encrypted password | the account access key from Storage accounts > <account name> > Access keys | |
Azure Account Name | String | the name of the Azure storage account for the blob container | |
Azure Container Name | String | striim-deltalakewriter-container | the blob container name from Storage accounts > <account name> > Containers If it does not exist, it will be created. |
Amazon S3 properties for Databricks Writer
To use Amazon S3, your Databricks instance should be using Databricks Runtime 11.0 or later.
property | type | default value | notes |
---|---|---|---|
S3 Access Key | String | an AWS access key ID (created on the AWS Security Credentials page) for a user with read and write permissions on the bucket | |
S3 Bucket Name | String | striim-deltalake-bucket | Specify the S3 bucket to be used for staging. If it does not exist, it will be created. |
S3 Region | String | us-west-1 | the AWS region of the bucket |
S3 Secret Access Key | encrypted password | the secret access key for the access key |
Sample TQL application using Databricks Writer
Sample TQL in AppendOnly mode:
CREATE TARGET DatabricksAppendOnly USING DeltaLakeWriter ( personalAccessToken: '*************************', hostname:'adb-xxxx.xx.azuredatabricks.net', tables: 'mydb.employee,mydatabase.employee', stageLocation: '/StriimStage/', connectionUrl:'jdbc:xxx.xx;transportMode=http;ssl=1;httpPath=xxx;AuthMech=3;UID=token;' ) INPUT FROM ns1.sourceStream;
Sample TQL in Merge mode with Optimized Merge set to True:
CREATE TARGET DatabricksAppendOnly USING DeltaLakeWriter ( personalAccessToken: '*************************', hostname:'adb-xxxx.xx.azuredatabricks.net', tables: 'mydb.employee,mydatabase.employee', stageLocation: '/StriimStage/', connectionUrl:'jdbc:xxx.xx;transportMode=http;ssl=1;httpPath=xxx;AuthMech=3;UID=token;', mode: 'MERGE', optimizedMerge: 'true' ) INPUT FROM ns1.sourceStream;
Databricks Writer data type support and mapping
TQL type | Delta Lake type |
---|---|
java.lang.Byte | binary |
java.lang.Double | double |
java.lang.Float | float |
java.lang.Integer | int |
java.lang.Long | bigint |
java.lang.Short | smallint |
java.lang.String | string |
org.joda.time.DateTime | timestamp |
For additional data type mappings, see Data type support & mapping for schema conversion & evolution.