Free Form Text Parser
Use with a compatible reader to use a regular expression to parse unstructured data, such as log files, with events that span multiple lines.
Free Form Text Parser properties
property | type | default value | notes |
---|---|---|---|
Block as Complete Record | Boolean | False | With JMSReader or UDPReader, if blockascompleterecord is set to True, the end of a block will be considered the end of the last record in the block, even if the row delimiter is missing. This does not change the default value with other readers. |
Charset | String | UTF-8 | |
Ignore Multiple Record Begin | Boolean | True | With the default setting of True, additional occurrences of the RecordBegin string before the next RecordEnd string will be ignored. Set to False to treat each occurrence of RecordBegin as the beginning of a new event. |
Record Begin | String | Specify a string that defines the beginning of each event. The string may include date expressions and/or NOTE: RecordBegin does not support regex. | |
Record End | String | An optional string that defines the end of each event (see Using regular expressions (regex)). If a RecordEnd pattern starts with the NOTE: RecordEnd does not support regex. | |
Regex | String | A regular expression (regex) defining the beginning (RecordBegin) and end (RecordBegin) of each field to be included in the output. To apply more than one pattern, separate the patterns using the '|' character. For example: regex: '((^([\\w]+)) | ((?<=Author: ).*(\n)) | ((?<=\n\n).*))' For more information about regular expressions, refer to the following resources: NOTE: You cannot specify a regex pattern in a RecordBegin or RecordEnd string. | |
Separator | String | ~ | the separator between multiple values in other properties For example, if the end of the record could be specified by either "millisec" or "processed," with the default separator ~ the RecordEnd value would be millisec~processed. |
Timestamp | String | Defines the format of the timestamp in the source data. The values are output to the "EEE, d MMM yyyy HH:mm:ss Z" "EEE, MMM d, ''yy" "h:mm a" "hh 'o''clock' a, zzzz" "K:mm a, z" "yyMMddHHmmssZ" "YYYY-'W'ww-u" "yyyy-MM-dd'T'HH:mm:ss.SSSXXX" "yyyy-MM-dd'T'HH:mm:ss.SSSZ" "yyyy.MM.dd G HH:mm:ss z" "yyyyy.MMMMM.dd GGG hh:mm aaa" For more information, see the documentation for the Java class SimpleDateFormat. |
The output type of a source using FreeFormTextParser is WAEvent.
Free Form Text Parser example
CREATE SOURCE fftpSource USING FileReader ( directory:'Samples/', WildCard:'catalina*.log', charset:'UTF-8', positionByEOF:false ) PARSE USING FreeFormTextParser ( -- Timestamp format in log is "Aug 21, 2014 8:33:56 AM" TimeStamp:'%mon %d, %yyyy %H:%M:%S %p', RecordBegin:'%mon %d, %yyyy %H:%M:%S %p', regex:'(SEVERE:.*|WARNING:.*)' ) OUTPUT TO fftpInStream; CREATE TYPE fftpOutType ( msg String, origTs long ); CREATE STREAM fftpOutStream OF fftpOutType; CREATE CQ fftpOutCQ INSERT INTO fftpOutStream SELECT data[0], TO_LONG(META(x,'OriginTimestamp')) FROM fftpInStream x; CREATE TARGET fftpTarget USING SysOut(name:fftpInfo) INPUT FROM fftpOutStream;
The RecordBegin
value %mon %d, %yyyy %H:%M:%S %p
defines the beginning of an event as a timestamp like the one at the beginning of the sample shown below. This is also the event timestamp, as defined by the TimeStamp
value.
The regular expression (SEVERE:.*|WARNING:.*)
looks in each event for a string starting with SEVERE
or WARNING
. If one is found, the parser returns everything until the next linefeed. If an event does not include SEVERE
or WARNING
, it is omitted from the output.
The following is the beginning of one of the potentially very long log messages this application is designed to process:
Aug 22, 2014 11:17:19 AM org.apache.solr.common.SolrException log SEVERE: org.apache.solr.common.SolrException: org.apache.solr.search.SyntaxError: Cannot parse '((suggest_title:(04) AND suggest_title:(gmc) AND suggest_title: AND suggest_title:(6.6l) AND suggest_title:(lb7) AND suggest_title:(p1094,p0234)) AND NOT (deleted:(true)))': Encountered " <AND> "AND "" at line 1, column 64. Was expecting one of: <BAREOPER> ... "(" ... "*" ... <QUOTED> ... <TERM> ... <PREFIXTERM> ... <WILDTERM> ... <REGEXPTERM> ... "[" ... "{" ... <LPARAMS> ... <NUMBER> ... at org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:147) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:187) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) ...
The parser discards everything except the line beginning with SEVERE
, and the TO_LONG
function in the CQ converts the log entry's timestamp to the format required by Striim:
fftpInfo: fftpOutType_1_0{ msg: "SEVERE: org.apache.solr.common.SolrException: org.apache.solr.search.SyntaxError: Cannot parse '((suggest_title:(04) AND suggest_title:(gmc) AND suggest_title: AND suggest_title:(6.6l) AND suggest_title:(lb7) AND suggest_title:(p1094,p0234)) AND NOT (deleted:(true)))': Encountered \" <AND> \"AND \"\" at line 1, column 64." origTs: 1408731439000 };