Skip to main content

Iceberg Writer programmer's reference

Iceberg Writer properties

property

type

default value

notes

Catalog Connection Profile Name

enum

See Introducing connection profiles and Catalog connection profile properties.Introducing connection profiles

Catalog Type

enum

Hadoop

Select the catalog type you are using for Iceberg (see Configure the Iceberg catalog).

CDDL Action

enum

Process

See Handling schema evolution.Handling schema evolution

Compute Connection Profile Name

enum

See Introducing connection profiles and Compute connection profile properties.Introducing connection profiles

Compute Type

enum

Google Dataproc

Leave set to Google Dataproce (the only type supported in this release).

Data Lake Connection Profile Name

enum

See Introducing connection profiles and Data Lake connection profile properties.Introducing connection profiles

Data Lake Type

enum

GCS

Leave set to GCS (the only type supported in this release).

Excluded Tables

string

When a wildcard is specified for Tables, you may specify here any tables you wish to exclude from the query. Specify the value exactly as for Tables. For example, to include data from all tables in mynamespace except the table named ignore:

Tables:'mynamespace.%',
ExcludedTables:'mynamespace.ignore'

External Stage Connection Profile Name

enum

See Introducing connection profiles and Setting Google Cloud Storage (GCS) connection profile properties. If the External Staging Location folder is in the same project as the data lake folder, leave this blank to use the data lake connection profile.Introducing connection profiles

External Staging Location

string

Specify the path to the external stage folder you created as described in Configure Google Cloud Storage (GCS) buckets for Iceberg Writer. The files in this folder will be deleted automatically when Iceberg Writer is stopped.

External Staging Type

enum

GCS

Leave set to GCS (the only type supported in this release).

Iceberg Tables Location

string

Specify the path to the data lake folder you created as described in Configure Google Cloud Storage (GCS) buckets for Iceberg Writer. This folder must already exist when Iceberg Writer is started for the first time and it must be in the GCS instance specified by the Data Lake Connection Profile Name.

Mode

enum

APPENDONLY

See Choose which writing mode to use.

Optimized Merge

Boolean

False

Appears in Flow Designer only when Mode is Merge.

Set to True only when Mode is MERGE and the target's input stream is the output of an HP NonStop reader, MySQL Reader, or Oracle Reader source and the source events will include partial records. For example, with Oracle Reader, when supplemental logging has not been enabled for all columns, partial records are sent for updates. When the source events will always include full records, leave this set to False.

Parallel Threads

integer

Supported only when Mode is APPENDONLY. See Creating multiple writer instances (parallel threads).

Tables

string

The name(s) of the table(s) to write to, in the format <source database>.<source table>,<target namespace>.<target table>.

You can provide multiple mappings with a semicolon as a separator. For example: srcdb.emp1,tarns1.emp1; srcdb.emp2,tarns1.emp2

When the input stream of the target is the output of a DatabaseReader, IncrementalBatchReader, or SQL CDC source, it can write to multiple tables. In this case, specify the names of both the source and target tables. You may use the % wildcard only for tables, not for schemas or databases. If the reader uses three-part names, you must use them here as well. Note that Oracle CDB/PDB source table names must be specified in two parts when the source is Database Reader or Incremental Batch reader (schema.%,namespace.%) but in three parts when the source is Oracle Reader or OJet ((database.schema.%,namespace.%). Note that SQL Server source table names must be specified in three parts when the source is Database Reader or Incremental Batch Reader (database.schema.%,namespace.%) but in two parts when the source is MS SQL Reader or MS Jet (schema.%,namespace.%). Examples:

source.emp,target_namespace.emp
source_schema.%,target_namespace.%
source_database.source_schema.%,target_namespace.%
source_database.source_schema.%,
  target_namespace.%

See Mapping columns and Defining relations between source and target using ColumnMap and KeyColumns for additional options.

Upload Policy

string

eventcount:100000,interval:60s

The upload policy may include eventcount and/or interval (see Setting output names and rollover / upload policies for syntax). Buffered data is written to the storage account every time any of the specified values is exceeded. With the default value, data will be written every 60 seconds or sooner if the buffer contains 100,000 events. When the app is quiesced, any data remaining in the buffer is written to the storage account; when the app is undeployed, any data remaining in the buffer is discarded.

Catalog connection profile properties

In this release, supported catalogs are BigQuery Metastore, Nessie, and Polaris. Hadoop in the same Google Cloud Storage instance as the data lake is also supported as a catalog. This is recommended only for development and testing, as performance may not be adequate for a production environment.

Setting BigQuery Metastore connection profile properties

In the Create or Edit Connection Profile dialog, select BigQuery Metastore as the endpoint type.

  • Project ID: specify the Google Cloud project ID of the BigQuery metastore

  • Region: specify the region of the BigQuery metastore

Setting Nessie connection profile properties

In the Create or Edit Connection Profile dialog, select Nessie as the endpoint type.

  • Authentication Type: Select Google Open ID if the Nessie server uses OAuth. In that case, provide the Host URI and Nessie Branch Name, then click Sign in using Google Open ID.

    If the Nessie server has no authentication mechanism, select None.

    If the Nessie server uses some other authentication mechanism, use Additional Configuration to provide the required properties.

  • Host URI: specify the Nessie REST API endpoint in the format http[s]://<IP address>:<port>/api/v1

  • Nessie Branch Name: specify the Nessie branch to use for the Iceberg catalog

  • Additional Configuration: if your Nessie instance requires additional properties (see Nessie client configuration options, click Add Property to add them. For example, if Nessie is configured for OAuth, you would specify something like this:

    NessieAdditionalConfiguration.png

Setting Polaris connection profile properties

In the Create or Edit Connection Profile dialog, select Polaris as the endpoint type.

  • Host URI: specify the Polaris REST API endpoint in the format http[s]://<IP address>:<port>/api/catalog

  • Catalog Name: specify the name of the Polaris catalog for Iceberg

  • Client ID: specify the assigned client identifier

  • Client Secret: specify the confidential value used to authenticate the client

  • Additional Configuration: optionally specify Spark configuration properties; for example, to specify the client region:

    PolarisAdditionalConfiguration.png

Compute connection profile properties

In this release, the only compute engine supported is Google Dataproc.

Setting Google Dataproc connection profile properties

In the Create or Edit Connection Profile dialog, select Google Dataproc as the endpoint type.

Data Lake connection profile properties

In this release, the only data lake supported is Google Cloud Storage (GCS).

Setting Google Cloud Storage (GCS) connection profile properties

In the Create or Edit Connection Profile dialog, select GCS as the endpoint type.