Setting output names and rollover / upload policies
ADLS Gen1 Writer,, ADLS Gen2 Writer Azure Blob Writer, FileWriter, GCS Writer, HDFS Writer, and S3 Writer support the following options to define the paths and names for output files, when new files are created, and how many files are retained.
Dynamic output names
The blobname
string in Azure Blob Writer, the bucketname
and objectname
strings in GCSWriter and S3Writer, the directory
and filename
strings in ADLS Gen1 Writer, ADLS Gen2 Writer, File Writer, and HDFSWriter, and the foldername
string in AzureBlobWriter, GCS Writer, and S3Writer may include field-name tokens that will be replaced with values from the input stream. For example, if the input stream included yearString, monthString, and dayString fields, directory:'%yearString%/%monthString%/%dayString%'
would create a new directory for each date, grouped by month and year subdirectories. If desirable, you may filter these fields from the output using the members
property in DSVFormatter or JSONFormatter.
Note: If a bucketname
in GCSWriter or S3Writer, directoryname
in ADLS Gen1 Writer or ADLS Gen2 Writer, or foldername
in S3Writer contains two field-name tokens, such as %field1%%field2%
, the second creates a subfolder of the first.
When the target's input is the output of a CDC or DatabaseReader source, values from the WAEvent metadata or userdata map or JSONNodeEvent metadata map may be used in these names using the syntax %@metadata(<field name>)%
or %@userdata(<field name>)%
, for example, %@metadata(TableName)%
. You may combine multiple metadata and/or userdata values, for example, %@metadata(name)%/%@userdata(TableName)%'
. You may also mix field, metadata, and userdata values.
For S3Writer bucket names, do not include punctuation between values. Hyphens, the only punctuation allowed, will be added automatically. For more information, see Amazon's Rules for Bucket Naming.
Rollover and upload policies
rolloverpolicy
or uploadpolicy
trigger output file or blob rollover different ways depending on which parameter you specify:
parameter | rollover trigger | example |
---|---|---|
eventcount (or size) | specified number of events have been accumulated (in Kinesis Writer, this is specified as a size in bytes rather than a number of events) | eventcount:10000 |
filesize | specified file size has been reached (value must be specified in megabytes, the maximum is 2048M) | filesize:1024M |
interval | specified time has elapsed (use | interval:1h |
You may specify both eventcount and interval, in which case rollover is triggered whenever one of the limits is reached. For example, eventcount:10000,interval:1h
will start a new file after one hour has passed since the current file was created or after 10,000 events have been written to it, whichever happens first.
Caution
When the rollover policy includes an interval or eventcount, and the file or blob name includes the creation time, be sure that the writer will never receive events so quickly that it will create a second file with the same name, since that may result in lost data. You may work around this by using nanosecond precision in the creation time or by using a sequence number instead of the creation time.
If file lineage is enabled as described in Enabling file lineage, when the writer is is stopped and restarted, the sequence will continue with the next file name in the series.
If you drop and re-create the application and there are existing files in the output directory, start will fail with a "file ... already exists in the current directory" error. To retain the existing files in that directory, add a sequencestart parameter to the rollover policy, or add %<time format>%, %epochtime%, or %epochtimems% to the file name.
If file lineage is not enabled, when the writer is is stopped and restarted, the sequence will restart from the beginning, overwriting existing files, so you might want to back up the output directory before restarting.
filelimit
By default, there is no limit to how many files are retained (filelimit:-1
). You may change that by including the filelimit
parameter in the rollover policy:
rolloverpolicy:'interval:1h,filelimit:24'
sequencestart, incrementsequenceby
If you prefer to start with a value greater than 0 or increment by a value greater than 1, add the sequencestart
or incrementsequenceby
parameters to rolloverpolicy
or uploadpolicy
, for example:
rolloverpolicy:'eventcount:10000,sequencestart:10,incrementsequenceby:10'
File and blob names
By default, output file or blob names are the base name (specified in filename
in ADLS Gen1 Writer, ADLS Gen2 Writer, File Writer, and HDFS Writer, blobname
in AzureBlobWriter, or objectname
in GCS Writer and S3 Writer) with a counter between the filename and extension. The counter starts at 00 and is incremented by one for each new file. For example, if filename
is myfile.txt
, the output files will be myfile.00.txt
, myfile.01.txt
, and so on. If filelimit
is set to a positive value, the counter will start over at 00 when it reaches that number, overwriting the existing files. If filelimit
is -1
, the counter will increment indefinitely (in which case you might want to create a cron job to delete old log files).
You may customize the output file names using the following tokens. They are allowed only in the file name, not in the extension.
%n...%
Use one n
for each digit you want in the file name. The sequence number may be placed elsewhere than at the end of the file name by including the %<n...>%
token in the writer's filename
property value:
filename:'batch_%nnn%_data.csv'
With that setting, rolled-over files will be named batch_000_data.csv
, batch_001_data.csv
, and so on.
%<time format>%, %epochtime%, %epochtimems%
The formatted file creation time may be included in the file name by including the %<time format>%
token in the writer's filename
property value:
filename:'batch_%yyyy-MM-dd_HH_mm_ss%_data.csv'
With that setting:
a file rolled over at 1:00 pm on June 1, 2016, would be named
batch_2016-06-01_13_00_00_data.csv
one rolled over five minutes later would be named
batch_2016-06-01_13_05_00_data.csv
and so on. You may use any valid Joda DateTimeFormat pattern.
The unformatted epoch file creation time may be included in the file name by using the %epochtime%
token instead. For millisecond resolution, use %epochtimems%
.
Examples combining multiple options
The following are a few examples of the many ways these options can be combined.
properties | behavior |
---|---|
|
|
|
|
|
|
See also Kafka Writer for discussion of how to write to specific partitions based on field values.