Skip to main content

GCS Reader

Google Cloud Storage is a file storage web service for storing and accessing customer data. It provides unified file storage for data. You can use GCS Reader to read data from a Google Cloud Storage bucket.

Summary

APIs used/data supported

  • Google Cloud Storage APIs: List and Get

  • Audit APIs: List

Supported parsers

AAL (Apache access log), Avro, Binary, DSV, Free Form Text, JSON, NVP (name-value pair), Parquet, XML

Supported targets

All targets supported by Striim. See Writers overview.

Security and authentication

GCS Reader supports private endpoints. See Using Private Service Connect with Google Cloud adapters.

Access to Google Cloud Storage requires:

  • Connecting to Google Cloud Storage requires using a Google credentials JSON file.

  • Accessing the Google Cloud Audit log service requires a Google credentials JSON file.

Operations / modes supported

  • Initial load and incremental load.

  • Supports multiple modes (Directory Listing, Audit Log) to detect new files.

  • Processes files in the sequence of modified date (only if audit log enabled).

Modes for fetching data

  • Streaming mode (default and recommended)

  • Download mode (for Parquet files only)

Resilience / recovery

  • Supports recovery with at-least-once processing (A1P) by recording checkpoints of the names and offset information of processed files. Upon restart, the reader uses the details from the last checkpoint to resume reading from the GCS bucket. In some cases, data may have been read after the checkpoint was taken, in which case there may be duplicates in the target (see Recovering applications).

  • Auto retries based on the Connection Retry Policy settings. Any API call to the GCS bucket is retried on a connection failure based on the Connection Retry Policy property (see GCS Reader properties). If the adapter is still unable to connect to the GCS bucket beyond the configured retry, then the app will halt with an appropriate error message.

Performance

  • You can choose between two file detection modes that determine when a full or incremental metadata fetch occurs. For GCSDirectoryListing mode, a full metadata fetch happens when the adapter starts and for every subsequent polling fetch. For GCSAuditLogNotification mode, a full metadata fetch happens when the reader starts and subsequent polling calls fetch only incremental changes from the audit log, minimizing the load on the Striim server. See Performance optimizations.

  • When the bucket contains a huge number (in the order of millions) of files, the GCSAuditLogNotification mode provides better performance for app recovery after a crash or stop, by fetching only the incremental changes from the audit log.

Programmability

  • Flow Designer

  • TQL

  • Wizards in the web UI to build pipelines to targets such as databases or apps

Metrics and auditing

Key metrics available through Striim monitoring. See Monitoring metrics.

Key limitations

No support for reading encrypted GCS objects.

Typical use case and integration

A typical use case is using Google Cloud Storage as a centralized repository that stores, processes, and secures files of any format. Striim is used to process and enrich the newly ingested data, and send the data to data warehouses for analytics.

GCS Reader overview

Files stored in Google Cloud Storage are grouped into buckets. Buckets are the basic containers that hold your data. Everything that you store in Cloud Storage must be contained in a bucket. You can use buckets to organize and control access to your data.

GCS Reader works by connecting to Google Cloud Storage and fetching file metadata from the bucket that you specify. The reader requires a valid service account key credentials JSON file to access the files in the the GCS bucket. GCS Reader supports different file formats such as JSON and CSV, and supports reading from a single folder and reading data recursively from a folder and its subfolders.

The reader processes the files in the bucket as initial load or incremental loads. There are two different modes for file detection. The GCSDirectoryListing mode performs a full metadata fetch when the adapter starts and for every subsequent polling fetch. The GCSAuditLogNotification mode performs a full metadata fetch when the reader starts and subsequent polling calls fetch only incremental changes from the audit log generated by the GCS bucket.

GCS Reader supports two ways to fetch the data. In the default streaming mode, GCS Reader fetches the file data directly from Google Cloud Storage by opening a remote InputStream, streaming the bytes from the remote file, and processing the incoming stream. Alternatively, you can use the download mode where GCS Reader first downloads the files to a local folder and then processes them.

After a file has been processed successfully, Striim deletes that downloaded file from local storage. The download mode is recommended only when an entire file must be available before beginning processing. Currently this recommendation applies only to Parquet files, as these files have a schema in the file footer. In the event of a system crash, the whole Striim application halts. Upon restart of the application, the contents of the local download folder are cleared and unprocessed files are downloaded again.