Skip to main content

HDFS Reader

Reads files from Hadoop Distributed File System (HDFS ) volumes. You can create HDFSReader sources in the web UI using Source Preview.

See Supported reader-parser combinations) for parsing options.

The output type is WAevent except when using JSONParser.

HDFS Reader properties

property

type

default value

notes

Authentication Policy

String

If the HDFS cluster uses Kerberos authentication, provide credentials in the format Kerberos, Principal:<Kerberos principal name>, KeytabPath:<fully qualified keytab file name>. Otherwise, leave blank. For example: authenticationpolicy:'Kerberos, Principal:nn/ironman@EXAMPLE.COM, KeytabPath:/etc/security/keytabs/nn.service.keytab'

Compression Type

String

Set to gzip when wildcard specifies a file or files in gzip format. Otherwise, leave blank.

Directory

String

optional directory from which the files specified by the wildcard property will be read; otherwise files will be read relative to the Hadoop URL

EOF Delay

Integer

100

milliseconds to wait after reaching the end of a file before starting the next read operation

Hadoop Configuration Path

String

If using Kerberos authentication, specify the path to Hadoop configuration files such as core-site.xml and hdfs-site.xml. If this path is incorrect or the configuration changes, authentication may fail.

Hadoop URL

String

The URI for the HDFS cluster NameNode. See below for an example. The default HDFS NameNode IPC port is 8020 or 9000 (depending on the distribution). Port 50070 is for the web UI and should not be specified here.

For an HDFS cluster with high availability, use the value of the dfs.nameservices property from hdfs-site.xml with the syntax hadoopurl:'hdfs://<value>', for example, hdfs://'mycluster'.  When the current NameNode fails, Striim will automatically connect to the next one.

In MapRFSReader, you may start the URL with hdfs:// or maprfs:/// (there is no functional difference).

Include Subdirectories

Boolean

False

Set to True to read files in subdirectories. 

Position by EOF

Boolean

True

If set to True, reading starts at the end of the file, so only new data is acquired. If set to False, reading starts at the the beginning of the file and then continues with new data.

Rollover Style

String

Default

Do not change.

Skip BOM

Boolean

True

If set to True, when the wildcard value specifies multiple files, Striim will read the Byte Order Mark (BOM) in the first file and skip the BOM in all other files. If set to False, it will read the BOM in every file.

Wildcard

String

name of the file, or a wildcard pattern to match multiple files (for example, *.xml)

HDFS Reader example

CREATE SOURCE CSVSource USING HDFSReader (
  hadoopurl:'hdfs://myserver:9000/',
  WildCard:'posdata.csv',
  positionByEOF:false
)