Create a Databricks Writer application

Prerequisites for creating a Databricks Writer application

Before creating a Databricks target in a Striim application:

Perform the tasks described in Initial setup for Databricks on AWS or Initial setup for Azure Databricks.
Perform any setup tasks required for the application's sources (for example, for Oracle CDC, the tasks described in Configuring Oracle to use Oracle Reader or Configuring Oracle and Striim to use OJet).Configuring Oracle to use OJet
Know the property values required for the mode of authentication you plan to use:
- If you plan to use a personal access token for authentication, know the personal access token you configured as described in Initial setup for Databricks on AWS or Initial setup for Azure Databricks.
- If you plan to use Microsoft Entra (formerly Azure Active Directory) for authentication, know its Client ID, Client Secret, Refresh Token, and Tenant ID.
- If you plan to use OAuth for authentication, see Introducing connection profiles.
Choose which writing mode to use: Append Only or Merge. Note that if you choose Merge mode, the Parallel Threads property will be ignored (see Creating multiple writer instances (parallel threads)).Creating multiple writer instances
If you plan to use MERGE mode, we strongly recommend partitioning your target tables as this will significantly improve Databricks Writer's performance (for AWS see Documentation > Databricks reference documentation > Language-specific introductions to Databricks > SQL language reference > Partitions; for Azure see Learn / Partitions).
If you plan to Create a Databricks Writer application using TQL and do not plan to use schema evolution (see Handling schema evolution), create the target tables in Databricks. This is not necessary if you use schema evolution or will Create a Databricks Writer application using a wizard with Auto Schema Conversion, as in those cases Striim will create the tables automatically. In this release, Auto Schema Conversion is not supported when you are using Databricks' Unity Catalog.Handling schema evolution
Create an appropriate staging area (see Initial setup for Azure Databricks or Initial setup for Databricks on AWS).

Choose which writing mode to use

Append Only (default)

In Append Only mode, inserts, updates, and deletes from a Database Reader, Incremental Batch Reader, or SQL CDC source are all handled as inserts in the target. This allows you to use Databricks to query past data that no longer exists in the source database(s), for example, for month-over-month or year-over-year reports.

Primary key updates result in two records in the target, one with the previous value and one with the new value. If the Tables setting has a ColumnMap that includes @METADATA(OperationName), the operation name for the first event will be DELETE and for the second INSERT.

To use this mode, set Mode to APPENDONLY.

Merge

In Merge mode, inserts, updates, and deletes from Database Reader, Incremental Batch Reader, and SQL CDC sources are handled as inserts, updates, and deletes in the target. The data in Databricks thus duplicates the data in the source database(s).

In Merge mode:

Since Delta Lake tables do not have primary keys, you may include the keycolumns option in the Tables property to specify a column in the target table that will contain a unique identifier for each row: for example, Tables:'SCOTT.EMP,mycatalog.mydatabase.employee keycolumns(emp_num)'. (As of August 2023, primary keys for Databricks in AWS are in public preview; see Documentation > What is Delta Lake? > Constraints on Databricks > Declare primary key and foreign key relationships).
You can specify the source tables using wildcards provided you specify keycolumns for all tables and the target table is specified with its three-part name: for example, Tables:'DEMO.%,mycatalog.mydatabase.% KeyColumns(...)'.
If you do not specify keycolumns, Striim will use the keycolumns specified in the source adapter's Tables property as a unique identifier. If the source has no keycolumns, Striim will concatenate all column values and use that as a unique identifier.

For more information on keycolumns, see Defining relations between source and target using ColumnMap and KeyColumns.

To use this mode:

Set Mode to Merge.
When Databricks Writer's input stream is the output of an HP NonStop reader, MySQL Reader, or Oracle Reader source and the source events will include partial records, set Optimized Merge to True.

Limitation: in Merge mode, the Parallel Threads property is ignored.

Create the target tables in Databricks; partitioning tables

Databricks Writer writes only to existing tables. If a source table specified in the Tables property does not exist in the target, the application will halt.

When you create your application using a wizard with Auto Schema Conversion, Striim will create the tables for you automatically (see Using Auto Schema Conversion). However, those tables will not be partitioned.

When using Databricks Writer's Merge mode, specifying partition columns when creating the target tables can significantly improve performance by reducing the need for full-table scans. For AWS see Documentation > Databricks reference documentation > Language-specific introductions to Databricks > SQL language reference > Partitions; for Azure see Learn / Partitions). You cannot partition an existing table.

Choose which staging area to use

Writing to Databricks requires a staging area. See the discussion in Databricks Writer initial setup.

To use the Databricks File System as your staging area

As discussed in Databricks Writer initial setup, DBFS should not be used for production or sensitive data.

Set External Stage Type to DBFSROOT.
Set Stage Location to the the path to the staging area in DBFS, for example, /StriimStage/.

To use an Amazon S3 bucket as your staging area

Create an S3 bucket and configure it as described in Create an Amazon S3 bucket to use as a staging location for Databricks on AWS.
Set External Stage Type to S3.

Set the following properties:

property	type	default value	notes
S3 Access Key	String		an AWS access key ID (created on the AWS Security Credentials page) for a user with read and write permissions on the bucket If the Striim host has default credentials stored in the `.aws` directory, you may leave this blank.
S3 Bucket Name	String	striim-deltalake-bucket	Specify the S3 bucket to be used for staging. If it does not exist, it will be created.
S3 Region	String	us-west-1	the AWS region of the bucket
S3 Secret Access Key	encrypted password		the secret access key for the access key If the Striim host has default credentials stored in the `.aws` directory, you may leave this blank.

To use a Databricks on AWS personal staging location

Caution

Support for personal staging locations has been deprecated by Databricks (see (Optional) Create metastore-level storage) and Microsoft (see (Optional) Create metastore-level storage).

To use an Azure Data Lake Storage Gen2 bucket as your staging area

Create an ADLS Gen2 bucket as described in Set up Azure Data Lake Storage Gen2 as the staging location for Azure Databricks.
Set External Stage Type to ADLSGen2.

Set the following properties:

property	type	default value	notes
Azure Account Access Key	encrypted password		When Authentication Type is set to ServiceAccountKey, specify the account access key from Storage accounts > <account name> > Access keys. When Authentication Type is set to AzureAD, this property is ignored in TQL and not displayed in the Flow Designer.
Azure Account Name	String		the name of the Azure storage account for the blob container
Azure Container Name	String	striim-deltalakewriter-container	the blob container name from Storage accounts > <account name> > Containers If it does not exist, it will be created.

property

type

default value

notes

Azure Account Access Key

encrypted password

When Authentication Type is set to ServiceAccountKey, specify the account access key from Storage accounts > <account name> > Access keys.

When Authentication Type is set to AzureAD, this property is ignored in TQL and not displayed in the Flow Designer.

Azure Account Name

String

the name of the Azure storage account for the blob container

Azure Container Name

String

striim-deltalakewriter-container

the blob container name from Storage accounts > <account name> > Containers

If it does not exist, it will be created.

Create a Databricks Writer application using a wizard

Caution

Support for personal staging locations has been deprecated by Databricks (see (Optional) Create metastore-level storage) and Microsoft (see (Optional) Create metastore-level storage).

When you create a Databricks Writer target using a wizard (see Creating apps using wizards), you must specify two properties: Connection URL and Personal Access Token. The Tables property value will be set based on your selections in the wizard.Creating apps using templates

Databricks does not use schemas, so its tables have only two-part names (<database>.<table>). When the source database has three-part names (<<schema>.database>.<table>), the tables will be mapped as <source_database>.<source_schema>.<table>,<target_database>.<table>, for example, mydb.myschema.%,mydb.%. Each database in the source will be mapped to a separate database in the target. If the databases do not exist in the target, Striim will create them.

Create a Databricks Writer application using TQL

Sample TQL in AppendOnly mode:

CREATE TARGET DatabricksAppendOnly USING DeltaLakeWriter ( 
  personalAccessToken: '*************************', 
  tables: 'mydb.employee,mydatabase.employee', 
  stageLocation: '/StriimStage/', 
  connectionUrl:'jdbc:xxx.xx;transportMode=http;ssl=1;httpPath=xxx;AuthMech=3;UID=token;'
)
INPUT FROM ns1.sourceStream;

Sample TQL in Merge mode with Optimized Merge set to True:

CREATE TARGET DatabricksAppendOnly USING DeltaLakeWriter ( 
  personalAccessToken: '*************************', 
  tables: 'mydb.employee,mydatabase.employee', 
  stageLocation: '/StriimStage/', 
  connectionUrl:'jdbc:xxx.xx;transportMode=http;ssl=1;httpPath=xxx;AuthMech=3;UID=token;',
  mode: 'MERGE',
  optimizedMerge: 'true'
)
INPUT FROM ns1.sourceStream;

Create a Databricks Writer application

Prerequisites for creating a Databricks Writer application

Choose which writing mode to use

Create the target tables in Databricks; partitioning tables

Choose which staging area to use

Caution

Create a Databricks Writer application using a wizard

Caution

Create a Databricks Writer application using TQL

Search results