Skip to main content

Iceberg Writer

Apache Iceberg is an open table format to organize large analytic tables as files in data lakes. It is designed to provide enhanced performance and compliance capabilities, such as enhanced ACID compliance, the ability to record transactional data efficiently, and perform SQL operations with improved scalability. For more information, see the Apache Iceberg documentation.

Iceberg Writer is a Striim target adapter capable of writing data in Iceberg format in a data lake. The adapter requires a compute engine and a catalog.

In this release, the only supported data lake is Google Cloud Storage (GCS). This requires the Google Dataproc compute engine and a BigQuery Metastore, Nessie, or Polaris catalog. Hadoop in the GCS data lake is also supported as a catalog, but is recommended only for development and testing, as performance may not be adequate for a production environment.

Iceberg Writer summary

Supported sources

Iceberg Writer can write data from all sources supported by Striim.

Authentication

Iceberg Writer authenticates its connection to GCS and Google Dataproc using service account keys.

Supported write modes

Iceberg Writer supports two write modes:

  • Merge: Records inserted, updated, or deleted from the source database(s) are inserted, updated, or deleted in Iceberg, so the data in Iceberg duplicates the data in the source database(s).

  • Append Only: Insert, update, and delete operations in the source database(s) are all treated as inserts in Iceberg. Thus, you can use Iceberg to query old data that no longer exists in the source database(s), for example, for month-over-month or year-over-year reports.

Additional writing features

  • Supports auto-quiesce after an initial load from Database Reader.

  • Supports schema evolution to detect and propagate DDL changes from supported sources to the Iceberg tables.

Supported staging areas

Iceberg requires a staging area to temporarily hold new data while it is being written to tables. In this release, Iceberg Writer supports only Google Cloud Storage (GCS).

Resilience and recovery

  • Supports connection retry to avoid application halting due to transient connection issues.

  • Supports recovery with at-least-once processing (see Recovering applications).

Performance

In append-only mode, parallel threads (see Creating multiple writer instances (parallel threads)) can increase throughput to the target in certain situations.Creating multiple writer instances

Programmability

  • Flow Designer

  • TQL

  • wizards in the web UI to create applications from supported sources

Metrics and auditing

Key metrics are available through Striim's monitoring features (see Data warehouse monitoring metrics and Iceberg Writer monitoring metrics).

drivers and other third-party libraries

Iceberg Writer uses google-cloud-storage version 2.43.1 and google-cloud-dataproc version 4.48.0. For BigQuery Metastore it uses iceberg-bigquery-catalog version 1.5.2-1.0.1-beta.

Key limitations

  • If partitioned tables are required (see Partitioning in the Iceberg documentation), auto schema creation cannot be used, the tables must be created manually in Iceberg.

  • Time source data types are written as strings in Iceberg.