Striim 3.9.8 documentation

Recovering applications

Subject to the following limitations, Striim applications can be recovered after planned downtime or most cluster failures with no loss of data:

  • Recovery must have been enabled when the application was created.


    Enabling recovery will have a modest impact on memory and disk requirements and event processing rates, since additional information required by the recovery process is added to each event.

  • All sources and targets to be recovered, as well as any CQs, windows, and other components connecting them, must be in the same application. Alternatively, they may be divided among multiple applications provided the streams connecting those applications are persisted to Kafka (see Persisting a stream to Kafka and Using the Striim Forwarding Agent).

  • Data from a CDC reader with a Tables property that maps a source table to multiple target tables (for example, Tables:'DB1.SOURCE1,DB2.TARGET1;DB1.SOURCE1,DB2.TARGET2') cannot be recovered.

  • Data from time-based windows that use system time rather the ON <timestamp field name> option cannot be recovered.

  • Data from sources using the DatabaseReader, HTTPReader, MultiFileReader, TCPReader, or UDPReader adapters cannot be recovered unless the source's output is a Kafka stream (see Introducing Kafka streams).

  • Standalone sources and WActionStores (see Loading standalone sources, caches, and WActionStores) are not recoverable unless persisted to Kafka (see Persisting a stream to Kafka).

  • Data from sources using an HP NonStop reader can be recovered provided that the AuditTrails property is set to its default value, merged.

  • Caches are reloaded from their sources. If the data in the source has changed in the meantime, the application's output may be different than it would have been.

  • If objects were added or renamed by DDL replication (see Including DDL operations in OracleReader output, and the Tables properties do not use wildcards, you must edit the application to add all new and renamed objects to both the OracleReader and DatabaseWriter Tables properties before restarting the application after a cluster failure.

  • Known issue DEV-13186: Recovery fails when KinesisWriter target has 250 or more shards. The error will include "Timeout while waiting for a remote call on member ..."

In some situations, after recovery there may be duplicate events.

  • Recovered flows that include WActionStores should have no duplicate events. Recovered flows that do not include WActionStores may have some duplicate events from around the time of failure ("at least once processing," also called A1P), except when a target guarantees no duplicate events ("exactly once processing," also called E1P). See Writers overview for details of A1P and E1P support.

  • ADLSWriterGen1, ADLSWriterGen2, AzureBlobWriter, FileWriter, HDFSWriter, and S3Writer restart rollover from the beginning and depending on rollover settings (see Setting output names and rollover / upload policies) may overwrite existing files. For example, if prior to the crash there were file00, file01, and the current file was file02, after recovery writing would restart from file00, and eventually overwrite all three existing files, so you may wish to back up or move the existing files before initiating recovery. After recovery, the target files may include duplicate events; the number of possible duplicates is limited to the Rollover Policy eventcount value.

  • When KafkaWriter is in sync mode (see Setting KafkaWriter's mode property: sync versus async), if the Kafka topic's retention period is shorter than the time that has passed since the cluster failure, after recovery there may be some duplicate events, and striim.server.log will contain a warning, "Start offset of the topic is different from local checkpoint (Possible reason - Retention period of the messages expired or Messages were deleted manually). Updating the start offset ..."

  • After recovery, RedshiftWriter targets may include some duplicate events.

To enable Striim applications to recover from system failures, you must do two things:

1. Enable persistence of all of the application's WActionStores.

2. Specify the RECOVERY option in the CREATE APPLICATION statement. The syntax is:



With some targets, enabling recovery for an application disables parallel threads. See Creating multiple writer instances for details.

For example:


With this setting, Striim will record a recovery checkpoint every ten seconds, provided it has completed recording the previous checkpoint. When recording a checkpoint takes more than ten seconds, Striim will start recording the next checkpoint immediately.

When the PosApp application is restarted after a system failure, it will resume exactly where it left off.


While recovery is in progress, the application status will be RECOVERING SOURCES, indicated by a red arrow in the web UI. The shorter the recovery interval, the less time it will take for Striim to recover from a failure. Longer recovery intervals require fewer disk writes during normal operation. To see detailed recovery status, enter MON <namespace>.<application name> <node> (see Using the MON command).