Skip to main content

Striim for Databricks Documentation

Connect to Databricks

If you have already created one or more pipelines, you can select an existing Databricks connection to write to the same Databricks database. When you create your first pipeline, or if you want to use a different service account or connect to a different Databricks database, you must do the following.

  1. In the JDBC URL field, enter the JDBC URL for your cluster. Replace <personal access token> with the token for the Azure Databricks user Striim will use to connect to the target (see Configure Azure Databricks).

  2. In the Personal Access Token field, enter the token for the Azure Databricks user Striim will use to connect to the target (see Configure Azure Databricks).

  3. If you are using Databricks' Unity Catalog, specify the catalog name.

  4. Specify a name for this Databricks connection.

  5. Select how you want to write to Databricks:

    • Write continuous changes as audit records (default; also known as APPEND ONLY mode): Databricks retains a record of every operation in the source. For example, if you insert a row, then update it, then delete it, Databricks will have three records, one for each operation in the source (INSERT, UPDATE, and DELETE). This is appropriate when you want to be able to see the state of the data at various points in the past, for example, to compare activity for the current month with activity for the same month last year.

      With this setting, Striim will add two additional columns to each table, STRIIM_OPTIME, a timestamp for the operation, and STRIIM_OPTYPE, the event type, INSERT, UPDATE, or DELETE. Note: on initial sync with SQL Server, all STRIIM_OPTYPE values are SELECT.

    • Write continuous changes directly (also known as MERGE mode): Databricks tables are synchronized with the source tables. For example, if you insert a row, then update it, Databricks will have only the updated data. If you then delete the row from the source table, Databricks will no longer have any record of that row.

  6. Choose where you want to stage your data.

    • Databricks File System (default): With this default value (not recommended), events are staged to the native Databricks File System (DBFS) This has as a 2 GB cap on storage, which can cause file corruption. To work around that limitation, we strongly recommend using Azure Data Lake Storage instead.

    • Azure data lake storage: Select this to use Azure Data Lake Storage (ADLS) Gen2. Specify the following properties:

      • Azure Account Access Key: the account access key from Storage accounts > <account name> > Access keys

      • Azure Account Name: the name of the Azure storage account for the blob container

      • Azure Container Name (optional): the blob container name from Storage accounts > <account name> > Containers. If it does not exist, it will be created. If you leave this blank, the container name will be striim-deltalakewriter-container.

  7. Click Next. Striim will verify that all the necessary permissions are available. If it reports any are missing, reconfigure the Datbricks user account associated with the personal access token, then try again.