Hive Writer
Writes to one or more tables in Apache Hive.
When the input stream of a HiveWriter target is of a user-defined type, it can write to Hive tables that use Avro, ORC, Parquet, or text file storage formats, and writes use SQL APPEND or INSERT INTO.
When the input stream is the output stream of a DatabaseReader or CDC source:
and the Mode is initialload, the storage format may be Avro, ORC, Parquet, or text file, and writes use SQL APPEND or INSERT INTO.
the Mode is incremental, and the storage format is Avro or Parquet, writes use SQL APPEND or INSERT INTO.
the Mode is incremental, and the storage format is ORC, Hive ACID transactions must be enabled and writes use SQL MERGE (which your version of Hive must support). In this case, there will be no duplicate events written to Hive ("exactly-once processing") after recovery (Recovering applications), as may happen when using SQL APPEND or INSERT INTO.
Limitations:
When the input stream is the output steam of a CDC reader, the reader's Compression property must be False.
DDL is not supported. If you need to alter the source tables, quiesce the application, change the source and target tables, and restart.
Columns specified in the Tables property's keycolumns option may not be updated. Any attempted update will be silently discarded.
Bucketed or partitioned columns may not be updated. This is a limitation of Hive, not HiveWriter.
Multiple instances of HiveWriter cannot write to the same table. When a HiveWriter target is deployed on multiple Striim servers, partition the input stream or use an environment variable in table mappings to ensure that they do not write to the same tables.
Hive Writer properties
property | type | default value | notes |
---|---|---|---|
Authentication Policy | String | If the HDFS cluster uses Kerberos authentication, provide credentials in the format | |
Connection URL | String | the JDBC connection URL, for example, | |
Directory | String | By default, Striim will create an HDFS directory on the Hive server to use as a staging area. If Striim does not have permission to create the necessary directory, HiveWriter will terminate with a "File Not Found" exception. To resolve that issue, create a staging directory manually and specify it here.. | |
Hadoop Configuration Path | String | If using Kerberos authentication, specify the path to Hadoop configuration files such as core-site.xml and hdfs-site.xml. If this path is incorrect or the configuration changes, authentication may fail. | |
Hadoop URL | String | The URI for the HDFS cluster NameNode. See below for an example. The default HDFS NameNode IPC port is 8020 or 9000 (depending on the distribution). Port 50070 is for the web UI and should not be specified here. For an HDFS cluster with high availability, use the value of the dfs.nameservices property from hdfs-site.xml with the syntax | |
Ignorable Exception Code | Striim | Set to TABLE_NOT_FOUND to prevent the application from terminating when Striim tries to write to a table that does not exist in the target. See Handling "table not found" errors for more information. Ignored exceptions will be written to the application's exception store (see CREATE EXCEPTIONSTORE). | |
Merge Policy | String | eventcount:10000, interval:5m | With the default setting, events are written every five minutes or sooner if there are 10,000 events. |
Mode | String | incremental | With an input stream of a user-defined type, do not change the default. See Replicating Oracle data to Hive. |
Password | encrypted password | The password for the specified user. See Encrypted passwords. | |
Parallel Threads | Integer | ||
Tables | String | The name(s) of the table(s) to write to. The table(s) must exist in Hive. When the input stream of the target is the output of a DatabaseReader, IncrementalBatchReader, or SQL CDC source (that is, when replicating data from one database to another), it can write to multiple tables. In this case, specify the names of both the source and target tables. You may use the source.emp,target.emp source.db1,target.db1;source.db2,target.db2 source.%,target.% source.mydatabase.emp%,target.mydb.% source1.%,target1.%;source2.%,target2.% Since HIve does not have primary keys, you must use the See Mapping columns for additional options. | |
Username | String | A Hive user for the server specified in ConnectionURL. The user must have INSERT, UPDATE, DELETE, TRUNCATE, CREATE, DROP, and ALTER privileges on the specified tables. |
Hive Writer sample application
The following sample code writes data from PosDataPreview.csv
to Hive (to run this code, you must first create the target table in Hive):
CREATE SOURCE PosSource USING FileReader ( directory:'Samples/PosApp/AppData', wildcard:'PosDataPreview.csv', positionByEOF:false ) PARSE USING DSVParser ( header:yes ) OUTPUT TO RawStream; CREATE CQ CsvToPosData INSERT INTO PosDataStream SELECT TO_STRING(data[1]) as merchantId, TO_DATEF(data[4],'yyyyMMddHHmmss') as dateTime, TO_DOUBLE(data[7]) as amount, TO_STRING(data[9]) as zip FROM RawStream; CREATE TARGET HiveSample USING HiveWriter ( ConnectionURL:'jdbc:hive2://192.0.2.76:10000', Username:'hiveuser', Password:'********', hadoopurl:'hdfs://192.0.2.76:9000/', Tables:'posdata' ) FORMAT USING DSVFormatter () INPUT FROM PosDataStream;
HiveWriter data type support and correspondence
When the input stream is of a user-defined type:
TQL type | Hive type |
---|---|
java.lang.Byte | BINARY |
java.lang.Double | DOUBLE |
java.lang.Float | FLOAT |
java.lang.Integer | INTEGER |
java.lang. Long | BIGINT |
java.lang.Short | SMALLINT, TINYINT |
java.lang.String | CHAR, DECIMAL, INTERVAL, NUMERIC, STRING, VARCHAR |
org.joda.time.DateTime | TIMESTAMP |
When the input stream of a HiveWriter target is the output of an Oracle source (DatabaseReader or OracleReader):
Oracle type | Hive type |
---|---|
BINARY_DOUBLE | DOUBLE |
BINARY_FLOAT | FLOAT |
BLOB | BINARY |
CHAR | CHAR, STRING, VARCHAR |
CLOB | STRING |
DATE | TIMESTAMP |
DECIMAL | DECIMAL, DOUBLE, FLOAT |
FLOAT | FLOAT |
INTEGER | BIGINT, INT, SMALLINT TINYINT |
LONG | BIGINT |
NCHAR | STRING, VARCHAR |
NUMBER |
|
NVARCHAR2 | STRING, VARCHAR |
SMALLINT | SMALLINT |
TIMESTAMP | TIMESTAMP |
TIMESTAMP WITH LOCAL TIME ZONE | TIMESTAMP |
TIMESTAMP WITH TIME ZONE | TIMESTAMP |
VARCHAR2 | STRING, VARCHAR |
Cloudera Hive Writer
ClouderaHiveWriter is identical to HiveWriter except that, since Cloudera's Hive distribution does not support SQL MERGE, there is no Mode property. ClouderaHiveWriter is always in InitialLoad mode, and writes always use SQL APPEND or INSERT INTO.
See Hive Writer for further discussion and documentation of the properties.
Hortonworks Hive Writer
Except for the name of the adapter, HortonworksHiveWriter is identical to HiveWriter. See Hive Writer for documentation of the properties.