Striim 4.0.4 documentation

BigQuery Writer

BigQuery is a serverless, highly scalable, and cost-effective cloud data warehouse offered by Google. Striim’s BigQueryWriter writes the data from various supported sources into Google’s BigQuery data warehouse to support real time data warehousing and reporting.

BigQuery Writer can be used to move data from transactional databases such as Oracle, Amazon RDS, Azure SQL DB, PostgreSQL, Microsoft SQL Server, MySQL, Google Spanner, and other supported databases into BigQuery with low latency and in real time. Striim supports complete table loads, change feeds (CDC) , and incremental batch feeds into BigQuery. BigQuery Writer properties can be configured to support authentication, object mappings, batching, performance, and failure handling.

See Release notes and third-party software notices for the google-cloud-bigquery client for Java API version used in this release.

BigQuery upload methods: Streaming vs. Load

You have a choice of two methods, using different parts of this API, for BigQuery Writer to use to write to its target tables. You cannot switch between these methods while an application is running.

  • Streaming: Incoming data is buffered locally as one memory buffer per target table. Once the upload condition is met, BigQuery Writer uses InsertAllResponse to stream the content of each memory buffer into its target table. We recommend using this method when you need low latency.

  • Load: Incoming data is buffered locally as one CSV file per target table. Once the upload condition for a file is met, BigQuery Writer uses TableDataWriteChannel to upload the content of the file to BigQuery, which writes it to the target table. This was the only method supported by BigQuery Writer in Striim 3.9.x and earlier releases. This method may be a good fit if your uploads are infrequent (for example, once an hour).

If you have BigQuery Writer applications using the Load method and are spending a lot of time tuning those applications' batch policies or are running up against BigQuery's quotas, the streaming method may work better for you.

Improving performance by partitioning BigQuery tables

When using BigQuery Writer's MERGE mode, partitioning the target tables can significantly improve performance by reducing the need for full-table scans. Partition columns must be specified when the tables are created; you cannot partition an existing table. See Introduction to partitioned tables for more information.

When BigQuery Writer's input stream is of type WAEvent and a database source column is mapped to a target table's partition column:

  • for INSERT events, the partition column value must be in the WAEvent data array for every INSERT event in the batch

  • for UPDATE events, the partition column value must be in the WAEvent before array for every UPDATE event in the batch

When this is not the case, a full-table scan will be required for the batch, and performance will be reduced accordingly.

If the partition column values will be updated, specify the source table primary key and partition column names in KeyColumns in the Tables property. For example, If the id is the primary key in the source table and purchase_date is the partition column is BigQuery, the Tables property value would be <schema name>.<table name>, <dataset name.<table name> KeyColumns(id, purchase_date). See the notes for Mode in BigQuery Writer properties for additional discussion of KeyColumns.


Typical BigQuery workflow

The most typical workflow when using BigQuery Writer is for streaming integration. Briefly, it works like this:

  1. Create tables in the target system corresponding to those in the source. You may do this using Striim's schema conversion utility for this task (Using the schema conversion utility or any other tool you prefer.

  2. Create a Striim initial load application using Database Reader and BigQuery Writer. This application will use the load method and, to avoid potential BigQuery quota limitations, will use large batches. Run it once to load existing data to BigQuery.

  3. Create a Striim streaming integration application using the appropriate CDC reader and BigQuery Writer. This application will use the streaming method to minimize the time it takes for new data to arrive at the target. Streaming also avoids the quota limitation issues of the load method. Typically you will want to enable recovery for this application (see Recovering applications) and perhaps deploy it on a multi-server cluster configured for failover to ensure high availability (see Continuing operation after server failover). Run this application continuously to stream new data to BigQuery (in other words, to synchronize the source and target.Recovering applications