Skip to main content

Create a Databricks Writer application

Prerequisites for creating a Databricks Writer application

Before creating a Databricks target in a Striim application:

Choose which writing mode to use

Append Only (default)

In Append Only mode, inserts, updates, and deletes from a Database Reader, Incremental Batch Reader, or SQL CDC source are all handled as inserts in the target. This allows you to use Databricks to query past data that no longer exists in the source database(s), for example, for month-over-month or year-over-year reports.

Primary key updates result in two records in the target, one with the previous value and one with the new value. If the Tables setting has a ColumnMap that includes @METADATA(OperationName), the operation name for the first event will be DELETE and for the second INSERT.

To use this mode, set Mode to APPENDONLY.

Merge

In Merge mode, inserts, updates, and deletes from Database Reader, Incremental Batch Reader, and SQL CDC sources are handled as inserts, updates, and deletes in the target. The data in Databricks thus duplicates the data in the source database(s).

In Merge mode:

  • Since Delta Lake tables do not have primary keys, you may include the keycolumns option in the Tables property to specify a column in the target table that will contain a unique identifier for each row: for example, Tables:'SCOTT.EMP,mycatalog.mydatabase.employee keycolumns(emp_num)'. (As of August 2023, primary keys for Databricks in AWS are in public preview; see Documentation > What is Delta Lake? > Constraints on Databricks > Declare primary key and foreign key relationships).

  • You can specify the source tables using wildcards provided you specify keycolumns for all tables and the target table is specified with its three-part name: for example, Tables:'DEMO.%,mycatalog.mydatabase.% KeyColumns(...)'.

  • If you do not specify keycolumns, Striim will use the keycolumns specified in the source adapter's Tables property as a unique identifier. If the source has no keycolumns, Striim will concatenate all column values and use that as a unique identifier.

For more information on keycolumns, see Defining relations between source and target using ColumnMap and KeyColumns.

To use this mode:

  • Set Mode to Merge.

  • When Databricks Writer's input stream is the output of an HP NonStop reader, MySQL Reader, or Oracle Reader source and the source events will include partial records, set Optimized Merge to True.

Limitation: in Merge mode, the Parallel Threads property is ignored.

Create the target tables in Databricks; partitioning tables

Databricks Writer writes only to existing tables. If a source table specified in the Tables property does not exist in the target, the application will halt.

When you create your application using a wizard with Auto Schema Conversion, Striim will create the tables for you automatically (see Using Auto Schema Conversion). However, those tables will not be partitioned.

When using Databricks Writer's Merge mode, specifying partition columns when creating the target tables can significantly improve performance by reducing the need for full-table scans. For AWS see Documentation > Databricks reference documentation > Language-specific introductions to Databricks > SQL language reference > Partitions; for Azure see Learn / Partitions). You cannot partition an existing table.

Choose which staging area to use

Writing to Databricks requires a staging area. See the discussion in Databricks Writer initial setup.

To use the Databricks File System as your staging area

As discussed in Databricks Writer initial setup, DBFS should not be used for production or sensitive data.

  • Set External Stage Type to DBFSROOT.

  • Set Stage Location to the the path to the staging area in DBFS, for example, /StriimStage/.

To use an Amazon S3 bucket as your staging area

  • Create an S3 bucket and configure it as described in Create an Amazon S3 bucket to use as a staging location for Databricks on AWS.

  • Set External Stage Type to S3.

  • Set the following properties:

    property

    type

    default value

    notes

    S3 Access Key

    String

    an AWS access key ID (created on the AWS Security Credentials page) for a user with read and write permissions on the bucket

    If the Striim host has default credentials stored in the .aws directory, you may leave this blank.

    S3 Bucket Name

    String

    striim-deltalake-bucket

    Specify the S3 bucket to be used for staging. If it does not exist, it will be created.

    S3 Region

    String

    us-west-1

    the AWS region of the bucket

    S3 Secret Access Key

    encrypted password

    the secret access key for the access key

    If the Striim host has default credentials stored in the .aws directory, you may leave this blank.

To use a Databricks on AWS personal staging location

Caution

Support for personal staging locations has been deprecated by Databricks (see (Optional) Create metastore-level storage) and Microsoft (see (Optional) Create metastore-level storage).

To use an Azure Data Lake Storage Gen2 bucket as your staging area

  • Create an ADLS Gen2 bucket as described in Set up Azure Data Lake Storage Gen2 as the staging location for Azure Databricks.

  • Set External Stage Type to ADLSGen2.

  • Set the following properties:

    property

    type

    default value

    notes

    Azure Account Access Key

    encrypted password

    When Authentication Type is set to ServiceAccountKey, specify the account access key from Storage accounts > <account name> > Access keys.

    When Authentication Type is set to AzureAD, this property is ignored in TQL and not displayed in the Flow Designer.

    Azure Account Name

    String

    the name of the Azure storage account for the blob container

    Azure Container Name

    String

    striim-deltalakewriter-container

    the blob container name from Storage accounts > <account name> > Containers

    If it does not exist, it will be created.

Create a Databricks Writer application using a wizard

Caution

Support for personal staging locations has been deprecated by Databricks (see (Optional) Create metastore-level storage) and Microsoft (see (Optional) Create metastore-level storage).

When you create a Databricks Writer target using a wizard (see Creating apps using wizards), you must specify two properties: Connection URL and Personal Access Token. The Tables property value will be set based on your selections in the wizard.Creating apps using templates

Databricks does not use schemas, so its tables have only two-part names (<database>.<table>). When the source database has three-part names (<<schema>.database>.<table>), the tables will be mapped as <source_database>.<source_schema>.<table>,<target_database>.<table>, for example, mydb.myschema.%,mydb.%. Each database in the source will be mapped to a separate database in the target. If the databases do not exist in the target, Striim will create them.

Create a Databricks Writer application using TQL

Sample TQL in AppendOnly mode:

CREATE TARGET DatabricksAppendOnly USING DeltaLakeWriter ( 
  personalAccessToken: '*************************', 
  tables: 'mydb.employee,mydatabase.employee', 
  stageLocation: '/StriimStage/', 
  connectionUrl:'jdbc:xxx.xx;transportMode=http;ssl=1;httpPath=xxx;AuthMech=3;UID=token;'
)
INPUT FROM ns1.sourceStream;

Sample TQL in Merge mode with Optimized Merge set to True:

CREATE TARGET DatabricksAppendOnly USING DeltaLakeWriter ( 
  personalAccessToken: '*************************', 
  tables: 'mydb.employee,mydatabase.employee', 
  stageLocation: '/StriimStage/', 
  connectionUrl:'jdbc:xxx.xx;transportMode=http;ssl=1;httpPath=xxx;AuthMech=3;UID=token;',
  mode: 'MERGE',
  optimizedMerge: 'true'
)
INPUT FROM ns1.sourceStream;