Hive Writer

Writes to one or more tables in Apache Hive.

When the input stream of a HiveWriter target is of a user-defined type, it can write to Hive tables that use Avro, ORC, Parquet, or text file storage formats, and writes use SQL APPEND or INSERT INTO.

When the input stream is the output stream of a DatabaseReader or CDC source:

and the Mode is initialload, the storage format may be Avro, ORC, Parquet, or text file, and writes use SQL APPEND or INSERT INTO.
the Mode is incremental, and the storage format is Avro or Parquet, writes use SQL APPEND or INSERT INTO.
the Mode is incremental, and the storage format is ORC, Hive ACID transactions must be enabled and writes use SQL MERGE (which your version of Hive must support). In this case, there will be no duplicate events written to Hive ("exactly-once processing") after recovery (Recovering applications), as may happen when using SQL APPEND or INSERT INTO.

Limitations:

When the input stream is the output steam of a CDC reader, the reader's Compression property must be False.
DDL is not supported. If you need to alter the source tables, quiesce the application, change the source and target tables, and restart.
Columns specified in the Tables property's keycolumns option may not be updated. Any attempted update will be silently discarded.
Bucketed or partitioned columns may not be updated. This is a limitation of Hive, not HiveWriter.
Multiple instances of HiveWriter cannot write to the same table. When a HiveWriter target is deployed on multiple Striim servers, partition the input stream or use an environment variable in table mappings to ensure that they do not write to the same tables.

Hive Writer properties

property	type	default value	notes
Authentication Policy	String		If the HDFS cluster uses Kerberos authentication, provide credentials in the format `Kerberos, Principal:<Kerberos principal name>, KeytabPath:<fully qualified keytab file name>`. Otherwise, leave blank. For example: authenticationpolicy:'Kerberos, Principal:nn/ironman@EXAMPLE.COM, KeytabPath:/etc/security/keytabs/nn.service.keytab'
Connection URL	String		the JDBC connection URL, for example, `ConnectionURL= 'jdbc:hive2:@192.0.2.5:10000'`
Directory	String		By default, Striim will create an HDFS directory on the Hive server to use as a staging area. If Striim does not have permission to create the necessary directory, HiveWriter will terminate with a "File Not Found" exception. To resolve that issue, create a staging directory manually and specify it here..
Hadoop Configuration Path	String		If using Kerberos authentication, specify the path to Hadoop configuration files such as core-site.xml and hdfs-site.xml. If this path is incorrect or the configuration changes, authentication may fail.
Hadoop URL	String		The URI for the HDFS cluster NameNode. See below for an example. The default HDFS NameNode IPC port is 8020 or 9000 (depending on the distribution). Port 50070 is for the web UI and should not be specified here. For an HDFS cluster with high availability, use the value of the dfs.nameservices property from hdfs-site.xml with the syntax `hadoopurl:'hdfs://<value>'`, for example, `hdfs://'mycluster'`. When the current NameNode fails, Striim will automatically connect to the next one.
Ignorable Exception Code	Striim		Set to TABLE_NOT_FOUND to prevent the application from terminating when Striim tries to write to a table that does not exist in the target. See Handling "table not found" errors for more information. Ignored exceptions will be written to the application's exception store (see CREATE EXCEPTIONSTORE).
Merge Policy	String	eventcount:10000, interval:5m	With the default setting, events are written every five minutes or sooner if there are 10,000 events.
Mode	String	incremental	With an input stream of a user-defined type, do not change the default. See Replicating Oracle data to Hive.
Password	encrypted password		The password for the specified user. See Encrypted passwords.
Parallel Threads	Integer		See Creating multiple writer instances (parallel threads).
Tables	String		The name(s) of the table(s) to write to. The table(s) must exist in Hive. When the input stream of the target is the output of a DatabaseReader, IncrementalBatchReader, or SQL CDC source (that is, when replicating data from one database to another), it can write to multiple tables. In this case, specify the names of both the source and target tables. You may use the `%` wildcard only for tables, not for schemas or databases. If the reader uses three-part names, you must use them here as well. Note that Oracle CDB/PDB source table names must be specified in two parts when the source is Database Reader or Incremental Batch reader (`schema.%,schema.%`) but in three parts when the source is Oracle Reader or OJet ((`database.schema.%,schema.%`). Note that SQL Server source table names must be specified in three parts when the source is Database Reader or Incremental Batch Reader (`database.schema.%,schema.%`) but in two parts when the source is MS SQL Reader or MS Jet (`schema.%,schema.%`). Examples: source.emp,target.emp source.db1,target.db1;source.db2,target.db2 source.%,target.% source.mydatabase.emp%,target.mydb.% source1.%,target1.%;source2.%,target2.% Since HIve does not have primary keys, you must use the `keycolumns` option to define a unique identifier for each row in the target table: for example, `Tables:'DEMO.EMPLOYEE,employee keycolumns(emp_id)'`. If necessary to ensure uniqueness, specify multiple columns with the syntax `keycolumns(<column 1>,<column 2>,...)`. You may use wildcards for the source table provided all the tables have the key columns: for example, `ables:'DEMO.Ora%,HIVE.Hiv% KeyColumns(...)'`. For more information, see Defining relations between source and target using ColumnMap and KeyColumns. See Mapping columns for additional options.
Username	String		A Hive user for the server specified in ConnectionURL. The user must have INSERT, UPDATE, DELETE, TRUNCATE, CREATE, DROP, and ALTER privileges on the specified tables.

Hive Writer sample application

The following sample code writes data from PosDataPreview.csv to Hive (to run this code, you must first create the target table in Hive):

CREATE SOURCE PosSource USING FileReader (
  directory:'Samples/PosApp/AppData',
  wildcard:'PosDataPreview.csv',
  positionByEOF:false
)
PARSE USING DSVParser (
  header:yes
)
OUTPUT TO RawStream;

CREATE CQ CsvToPosData
INSERT INTO PosDataStream
SELECT TO_STRING(data[1]) as merchantId,
  TO_DATEF(data[4],'yyyyMMddHHmmss') as dateTime,
  TO_DOUBLE(data[7]) as amount,
  TO_STRING(data[9]) as zip
FROM RawStream;

CREATE TARGET HiveSample USING HiveWriter (
  ConnectionURL:'jdbc:hive2://192.0.2.76:10000',
  Username:'hiveuser', 
  Password:'********',
  hadoopurl:'hdfs://192.0.2.76:9000/',
  Tables:'posdata'
)
FORMAT USING DSVFormatter ()
INPUT FROM PosDataStream;

HiveWriter data type support and correspondence

When the input stream is of a user-defined type:

TQL type	Hive type
java.lang.Byte	BINARY
java.lang.Double	DOUBLE
java.lang.Float	FLOAT
java.lang.Integer	INTEGER
java.lang. Long	BIGINT
java.lang.Short	SMALLINT, TINYINT
java.lang.String	CHAR, DECIMAL, INTERVAL, NUMERIC, STRING, VARCHAR
org.joda.time.DateTime	TIMESTAMP

When the input stream of a HiveWriter target is the output of an Oracle source (DatabaseReader or OracleReader):

Oracle type	Hive type
BINARY_DOUBLE	DOUBLE
BINARY_FLOAT	FLOAT
BLOB	BINARY
CHAR	CHAR, STRING, VARCHAR
CLOB	STRING
DATE	TIMESTAMP
DECIMAL	DECIMAL, DOUBLE, FLOAT
FLOAT	FLOAT
INTEGER	BIGINT, INT, SMALLINT TINYINT
LONG	BIGINT
NCHAR	STRING, VARCHAR
NUMBER	INT when the scale is 0 and the precision is less than 10 BIGINT when the scale is 0 and the precision is less than 19 DECIMAL when the scale is greater than 0 or the precision is greater than 19
NVARCHAR2	STRING, VARCHAR
SMALLINT	SMALLINT
TIMESTAMP	TIMESTAMP
TIMESTAMP WITH LOCAL TIME ZONE	TIMESTAMP
TIMESTAMP WITH TIME ZONE	TIMESTAMP
VARCHAR2	STRING, VARCHAR

Cloudera Hive Writer

ClouderaHiveWriter is identical to HiveWriter except that, since Cloudera's Hive distribution does not support SQL MERGE, there is no Mode property. ClouderaHiveWriter is always in InitialLoad mode, and writes always use SQL APPEND or INSERT INTO.

See Hive Writer for further discussion and documentation of the properties.

Hortonworks Hive Writer

Note

Hortonworks was acquired by Cloudera and is no longer available. This adapter will be removed in a future release of Striim.

Except for the name of the adapter, HortonworksHiveWriter is identical to HiveWriter. See Hive Writer for documentation of the properties.