Striim 3.10.1 documentation

Hive Writer

Writes to one or more tables in Apache Hive.

When the input stream of a HiveWriter target is of a user-defined type, it can write to Hive tables that use Avro, ORC, Parquet, or text file storage formats, and writes use SQL APPEND or INSERT INTO.

When the input stream is the output stream of a DatabaseReader or CDC source:

• and the Mode is initialload, the storage format may be Avro, ORC, Parquet, or text file, and writes use SQL APPEND or INSERT INTO.

• the Mode is incremental, and the storage format is Avro or Parquet, writes use SQL APPEND or INSERT INTO.

• the Mode is incremental, and the storage format is ORC, Hive ACID transactions must be enabled and writes use SQL MERGE (which your version of Hive must support). In this case, there will be no duplicate events written to Hive ("exactly-once processing") after an application crash and recovery (Recovering applications), as may happen when using SQL APPEND or INSERT INTO.

Limitations:

• When the input stream is the output steam of a CDC reader, the reader's Compression property must be False.

• DDL is not supported. If you need to alter the source tables, quiesce the application, change the source and target tables, and restart.

• Columns specified in the Tables property's keycolumns option may not be updated. Any attempted update will be silently discarded.

• Bucketed or partitioned columns may not be updated. This is a limitation of Hive, not HiveWriter.

• Multiple instances of HiveWriter cannot write to the same table. When a HiveWriter target is deployed on multiple Striim servers, partition the input stream or use an environment variable in table mappings to ensure that they do not write to the same tables.

property

type

default value

notes

Authentication Policy

String

If the HDFS cluster uses Kerberos authentication, provide credentials in the format Kerberos, Principal:<Kerberos principal name>, KeytabPath:<fully qualified keytab file name>. Otherwise, leave blank. For example: authenticationpolicy:'Kerberos, Principal:nn/ironman@EXAMPLE.COM, KeytabPath:/etc/security/keytabs/nn.service.keytab'

Connection URL

String

the JDBC connection URL, for example, ConnectionURL= 'jdbc:hive2:@192.0.2.5:10000'

Directory

String

By default, Striim will create an HDFS directory on the Hive server to use as a staging area. If Striim does not have permission to create the necessary directory, HiveWriter will crash with a "File Not Found" exception. To resolve that issue, create a staging directory manually and specify it here..

String

If using Kerberos authentication, specify the path to Hadoop configuration files such as core-site.xml and hdfs-site.xml. If this path is incorrect or the configuration changes, authentication may fail.

String

The URI for the HDFS cluster NameNode. See below for an example. The default HDFS NameNode IPC port is 8020 or 9000 (depending on the distribution). Port 50070 is for the web UI and should not be specified here.

For an HDFS cluster with high availability, use the value of the dfs.nameservices property from hdfs-site.xml with the syntax hadoopurl:'hdfs://<value>', for example, hdfs://'mycluster'.  When the current NameNode fails, Striim will automatically connect to the next one.

Ignorable Exception Code

Striim

Set to TABLE_NOT_FOUND if you do not want the application to crash when Striim tries to write to a table that does not exist in the target database. See Handling "table not found" errors for more information.

Merge Policy

String

eventcount:10000, interval:5m

With the default setting, events are written every five minutes or sooner if there are 10,000 events.

Mode

String

incremental

With an input stream of a user-defined type, do not change the default. See Replicating Oracle data to Hive.

Integer

Tables

String

The name(s) of the table(s) to write to. The table(s) must exist in Hive.

When the input stream of the HiveWriter target is the output of a CDC reader or DatabaseReader source (that is, when replicating data from one database to another), HiveWriter can write to multiple tables. In this case, specify the names of both the source and target tables.

Since HIve does not have primary keys, you must use the keycolumns option to define a unique identifier for each row in the target table: for example, Tables:'DEMO.EMPLOYEE,employee keycolumns(emp_id)'. If necessary to ensure uniqueness, specify multiple columns with the syntax keycolumns(<column 1>,<column 2>,...). You may use wildcards for the source table provided all the tables have the key columns: for example, ables:'DEMO.Ora%,HIVE.Hiv% KeyColumns(...)'.

See Mapping columns for additional options.

String

A Hive user for the server specified in ConnectionURL. The user must have INSERT, UPDATE, DELETE, TRUNCATE, CREATE, DROP, and ALTER privileges on the specified tables.

The following sample code writes data from PosDataPreview.csv  to Hive  (to run this code, you must first create the target table in Hive):

CREATE SOURCE PosSource USING FileReader (
directory:'Samples/PosApp/AppData',
wildcard:'PosDataPreview.csv',
positionByEOF:false
)
PARSE USING DSVParser (
)
OUTPUT TO RawStream;

CREATE CQ CsvToPosData
INSERT INTO PosDataStream
SELECT TO_STRING(data[1]) as merchantId,
TO_DATEF(data[4],'yyyyMMddHHmmss') as dateTime,
TO_DOUBLE(data[7]) as amount,
TO_STRING(data[9]) as zip
FROM RawStream;

CREATE TARGET HiveSample USING HiveWriter (
ConnectionURL:'jdbc:hive2://192.0.2.76:10000',
Tables:'posdata'
)
FORMAT USING DSVFormatter ()
INPUT FROM PosDataStream;
HiveWriter data type support and correspondence

When the input stream is of a user-defined type:

TQL type

Hive type

java.lang.Byte

BINARY

java.lang.Double

DOUBLE

java.lang.Float

FLOAT

java.lang.Integer

INTEGER

java.lang. Long

BIGINT

java.lang.Short

SMALLINT, TINYINT

java.lang.String

CHAR, DECIMAL, INTERVAL, NUMERIC, STRING, VARCHAR

org.joda.time.DateTime

TIMESTAMP

When the input stream of a HiveWriter target is the output of an Oracle source (DatabaseReader or OracleReader):

Oracle type

Hive type

BINARY_DOUBLE

DOUBLE

BINARY_FLOAT

FLOAT

BLOB

BINARY

CHAR

CHAR, STRING, VARCHAR

CLOB

STRING

DATE

TIMESTAMP

DECIMAL

DECIMAL, DOUBLE, FLOAT

FLOAT

FLOAT

INTEGER

BIGINT, INT, SMALLINT TINYINT

LONG

BIGINT

NCHAR

STRING, VARCHAR

NUMBER

• INT when the scale is 0 and the precision is less than 10

• BIGINT when the scale is 0 and the precision is less than 19

• DECIMAL when the scale is greater than 0 or the precision is greater than 19

NVARCHAR2

STRING, VARCHAR

SMALLINT

SMALLINT

TIMESTAMP

TIMESTAMP

TIMESTAMP WITH LOCAL TIME ZONE

TIMESTAMP

TIMESTAMP WITH TIME ZONE

TIMESTAMP

VARCHAR2

STRING, VARCHAR