Skip to main content

HDFS Reader

Reads files from Hadoop Distributed File System (HDFS ) volumes. You can create HDFSReader sources in the web UI using Source Preview.

See Supported reader-parser combinations) for parsing options.

The output type is WAevent except when using JSONParser.

HDFS Reader properties

property	type	default value	notes
Authentication Policy	String		If the HDFS cluster uses Kerberos authentication, provide credentials in the format `Kerberos, Principal:<Kerberos principal name>, KeytabPath:<fully qualified keytab file name>`. Otherwise, leave blank. For example: `authenticationpolicy:'Kerberos, Principal:nn/ironman@EXAMPLE.COM, KeytabPath:/etc/security/keytabs/nn.service.keytab'`
Compression Type	String		Set to `gzip` when `wildcard` specifies a file or files in gzip format. Otherwise, leave blank.
Directory	String		optional directory from which the files specified by the wildcard property will be read; otherwise files will be read relative to the Hadoop URL
EOF Delay	Integer	100	milliseconds to wait after reaching the end of a file before starting the next read operation
Hadoop Configuration Path	String		If using Kerberos authentication, specify the path to Hadoop configuration files such as core-site.xml and hdfs-site.xml. If this path is incorrect or the configuration changes, authentication may fail.
Hadoop URL	String		The URI for the HDFS cluster NameNode. See below for an example. The default HDFS NameNode IPC port is 8020 or 9000 (depending on the distribution). Port 50070 is for the web UI and should not be specified here. For an HDFS cluster with high availability, use the value of the dfs.nameservices property from hdfs-site.xml with the syntax `hadoopurl:'hdfs://<value>'`, for example, `hdfs://'mycluster'`. When the current NameNode fails, Striim will automatically connect to the next one. In MapRFSReader, you may start the URL with `hdfs://` or `maprfs:///` (there is no functional difference).
Include Subdirectories	Boolean	False	Set to True to read files in subdirectories.
Position by EOF	Boolean	True	If set to True, reading starts at the end of the file, so only new data is acquired. If set to False, reading starts at the the beginning of the file and then continues with new data.
Rollover Style	String	Default	Do not change.
Skip BOM	Boolean	True	If set to True, when the wildcard value specifies multiple files, Striim will read the Byte Order Mark (BOM) in the first file and skip the BOM in all other files. If set to False, it will read the BOM in every file.
Wildcard	String		Specify name of the file, or a wildcard pattern to match multiple files (for example, *.xml). Do not modify this property when recovery is enabled for the application.

HDFS Reader example

CREATE SOURCE CSVSource USING HDFSReader (
  hadoopurl:'hdfs://myserver:9000/',
  WildCard:'posdata.csv',
  positionByEOF:false
)

Would you like to provide feedback? Just click here to suggest edits.