Create a Databricks Writer application
Prerequisites for creating a Databricks Writer application
Before creating a Databricks target in a Striim application:
Perform the tasks described in Initial setup for Databricks on AWS or Initial setup for Azure Databricks.
Perform any setup tasks required for the application's sources (for example, for Oracle CDC, the tasks described in Configuring Oracle to use Oracle Reader or Configuring Oracle and Striim to use OJet).
Know the property values required for the mode of authentication you plan to use:
If you plan to use a personal access token for authentication, know the personal access token you configured as described in Initial setup for Databricks on AWS or Initial setup for Azure Databricks.
If you plan to use Microsoft Entra (formerly Azure Active Directory) for authentication, know its Client ID, Client Secret, Refresh Token, and Tenant ID.
If you plan to use OAuth for authentication, see Introducing connection profiles.
Choose which writing mode to use: Append Only or Merge. Note that if you choose Merge mode, the Parallel Threads property will be ignored (see Creating multiple writer instances (parallel threads)).
If you plan to use MERGE mode, we strongly recommend partitioning your target tables as this will significantly improve Databricks Writer's performance (for AWS see Documentation > Databricks reference documentation > Language-specific introductions to Databricks > SQL language reference > Partitions; for Azure see Learn / Partitions).
If you plan to Create a Databricks Writer application using TQL and do not plan to use schema evolution (see Handling schema evolution), create the target tables in Databricks. This is not necessary if you use schema evolution or will Create a Databricks Writer application using a wizard with Auto Schema Conversion, as in those cases Striim will create the tables automatically. In this release, Auto Schema Conversion is not supported when you are using Databricks' Unity Catalog.
Create an appropriate staging area (see Initial setup for Azure Databricks or Initial setup for Databricks on AWS).
Choose which writing mode to use
Append Only (default)
In Append Only mode, inserts, updates, and deletes from a Database Reader, Incremental Batch Reader, or SQL CDC source are all handled as inserts in the target. This allows you to use Databricks to query past data that no longer exists in the source database(s), for example, for month-over-month or year-over-year reports.
Primary key updates result in two records in the target, one with the previous value and one with the new value. If the Tables setting has a ColumnMap that includes @METADATA(OperationName)
, the operation name for the first event will be DELETE and for the second INSERT.
To use this mode, set Mode to APPENDONLY
.
Merge
In Merge mode, inserts, updates, and deletes from Database Reader, Incremental Batch Reader, and SQL CDC sources are handled as inserts, updates, and deletes in the target. The data in Databricks thus duplicates the data in the source database(s).
In Merge mode:
Since Delta Lake tables do not have primary keys, you may include the
keycolumns
option in the Tables property to specify a column in the target table that will contain a unique identifier for each row: for example,Tables:'SCOTT.EMP,mycatalog.mydatabase.employee keycolumns(emp_num)'
. (As of August 2023, primary keys for Databricks in AWS are in public preview; see Documentation > What is Delta Lake? > Constraints on Databricks > Declare primary key and foreign key relationships).You can specify the source tables using wildcards provided you specify
keycolumns
for all tables and the target table is specified with its three-part name: for example,Tables:'DEMO.%,mycatalog.mydatabase.% KeyColumns(...)'
.If you do not specify
keycolumns
, Striim will use thekeycolumns
specified in the source adapter's Tables property as a unique identifier. If the source has nokeycolumns
, Striim will concatenate all column values and use that as a unique identifier.
For more information on keycolumns
, see Defining relations between source and target using ColumnMap and KeyColumns.
To use this mode:
Set Mode to
Merge
.When Databricks Writer's input stream is the output of an HP NonStop reader, MySQL Reader, or Oracle Reader source and the source events will include partial records, set Optimized Merge to True.
Limitation: in Merge mode, the Parallel Threads property is ignored.
Create the target tables in Databricks; partitioning tables
Databricks Writer writes only to existing tables. If a source table specified in the Tables property does not exist in the target, the application will halt.
When you create your application using a wizard with Auto Schema Conversion, Striim will create the tables for you automatically (see Using Auto Schema Conversion). However, those tables will not be partitioned.
When using Databricks Writer's Merge mode, specifying partition columns when creating the target tables can significantly improve performance by reducing the need for full-table scans. For AWS see Documentation > Databricks reference documentation > Language-specific introductions to Databricks > SQL language reference > Partitions; for Azure see Learn / Partitions). You cannot partition an existing table.
Choose which staging area to use
Writing to Databricks requires a staging area. See the discussion in Databricks Writer initial setup.
To use the Databricks File System as your staging area
As discussed in Databricks Writer initial setup, DBFS should not be used for production or sensitive data.
Set External Stage Type to
DBFSROOT
.Set Stage Location to the the path to the staging area in DBFS, for example,
/StriimStage/
.
To use an Amazon S3 bucket as your staging area
Create an S3 bucket and configure it as described in Create an Amazon S3 bucket to use as a staging location for Databricks on AWS.
Set External Stage Type to
S3
.Set the following properties:
property
type
default value
notes
S3 Access Key
String
an AWS access key ID (created on the AWS Security Credentials page) for a user with read and write permissions on the bucket
If the Striim host has default credentials stored in the
.aws
directory, you may leave this blank.S3 Bucket Name
String
striim-deltalake-bucket
Specify the S3 bucket to be used for staging. If it does not exist, it will be created.
S3 Region
String
us-west-1
the AWS region of the bucket
S3 Secret Access Key
encrypted password
the secret access key for the access key
If the Striim host has default credentials stored in the
.aws
directory, you may leave this blank.
To use a Databricks on AWS personal staging location
Caution
Support for personal staging locations has been deprecated by Databricks (see (Optional) Create metastore-level storage) and Microsoft (see (Optional) Create metastore-level storage).
To use an Azure Data Lake Storage Gen2 bucket as your staging area
Create an ADLS Gen2 bucket as described in Set up Azure Data Lake Storage Gen2 as the staging location for Azure Databricks.
Set External Stage Type to
ADLSGen2
.Set the following properties:
property
type
default value
notes
Azure Account Access Key
encrypted password
When Authentication Type is set to ServiceAccountKey, specify the account access key from Storage accounts > <account name> > Access keys.
When Authentication Type is set to AzureAD, this property is ignored in TQL and not displayed in the Flow Designer.
Azure Account Name
String
the name of the Azure storage account for the blob container
Azure Container Name
String
striim-deltalakewriter-container
the blob container name from Storage accounts > <account name> > Containers
If it does not exist, it will be created.
Create a Databricks Writer application using a wizard
Caution
Support for personal staging locations has been deprecated by Databricks (see (Optional) Create metastore-level storage) and Microsoft (see (Optional) Create metastore-level storage).
When you create a Databricks Writer target using a wizard (see Creating apps using wizards), you must specify two properties: Connection URL and Personal Access Token. The Tables property value will be set based on your selections in the wizard.
Databricks does not use schemas, so its tables have only two-part names (<database>.<table>
). When the source database has three-part names (<<schema>.database>.<table>
), the tables will be mapped as <source_database>.<source_schema>.<table>,<target_database>.<table>
, for example, mydb.myschema.%,mydb.%
. Each database in the source will be mapped to a separate database in the target. If the databases do not exist in the target, Striim will create them.
Create a Databricks Writer application using TQL
Sample TQL in AppendOnly mode:
CREATE TARGET DatabricksAppendOnly USING DeltaLakeWriter ( personalAccessToken: '*************************', tables: 'mydb.employee,mydatabase.employee', stageLocation: '/StriimStage/', connectionUrl:'jdbc:xxx.xx;transportMode=http;ssl=1;httpPath=xxx;AuthMech=3;UID=token;' ) INPUT FROM ns1.sourceStream;
Sample TQL in Merge mode with Optimized Merge set to True:
CREATE TARGET DatabricksAppendOnly USING DeltaLakeWriter ( personalAccessToken: '*************************', tables: 'mydb.employee,mydatabase.employee', stageLocation: '/StriimStage/', connectionUrl:'jdbc:xxx.xx;transportMode=http;ssl=1;httpPath=xxx;AuthMech=3;UID=token;', mode: 'MERGE', optimizedMerge: 'true' ) INPUT FROM ns1.sourceStream;