Skip to main content

Free Form Text Parser

Use with a compatible reader to use a regular expression to parse unstructured data, such as log files, with events that span multiple lines.

Free Form Text Parser properties

property

type

default value

notes

Block as Complete Record

Boolean

False

With JMSReader or UDPReader, if blockascompleterecord is set to True, the end of a block will be considered the end of the last record in the block, even if the row delimiter is missing. This does not change the default value with other readers.

Charset

String

UTF-8

Ignore Multiple Record Begin

Boolean

True

With the default setting of True, additional occurrences of the RecordBegin string before the next RecordEnd string will be ignored. Set to False to treat each occurrence of RecordBegin as the beginning of a new event.

Record Begin

String

Specify a string that defines the beginning of each event. The string may include date expressions and/or %IP_ADDRESS% (which will match any IP address). If a RecordBegin pattern starts with the ^ character, the pattern will be excluded from the data.

NOTE: RecordBegin does not support regex.

Record End

String

An optional string that defines the end of each event (see Using regular expressions (regex)). If a RecordEnd pattern starts with the ^ character, the pattern will be excluded from the data.

NOTE: RecordEnd does not support regex.

Regex

String

A regular expression (regex) defining the beginning (RecordBegin) and end (RecordBegin) of each field to be included in the output. To apply more than one pattern, separate the patterns using the '|' character. For example:

regex: '((^([\\w]+)) |
((?<=Author: ).*(\n)) |
((?<=\n\n).*))'

For more information about regular expressions, refer to the following resources:

NOTE: You cannot specify a regex pattern in a RecordBegin or RecordEnd string.

Separator

String

~

the separator between multiple values in other properties For example, if the end of the record could be specified by either "millisec" or "processed," with the default separator ~ the RecordEnd value would be millisec~processed.

Timestamp

String

Defines the format of the timestamp in the source data. The values are output to the originTimeStamp key in WAEvent's metadata map  and as shown in the sample code below can be retrieved using SELECT META(stream_name,'<originTimeStamp>'). Supported pattern strings are:

"EEE, d MMM yyyy HH:mm:ss Z" 
"EEE, MMM d, ''yy" 
"h:mm a" 
"hh 'o''clock' a, zzzz" 
"K:mm a, z" 
"yyMMddHHmmssZ" 
"YYYY-'W'ww-u"
"yyyy-MM-dd'T'HH:mm:ss.SSSXXX" 
"yyyy-MM-dd'T'HH:mm:ss.SSSZ" 
"yyyy.MM.dd G HH:mm:ss z" 
"yyyyy.MMMMM.dd GGG hh:mm aaa"

For more information, see the documentation for the Java class SimpleDateFormat.

The output type of a source using FreeFormTextParser is WAEvent.

Free Form Text Parser example

CREATE SOURCE fftpSource USING FileReader (
        directory:'Samples/',
        WildCard:'catalina*.log',
        charset:'UTF-8',
        positionByEOF:false
)
PARSE USING FreeFormTextParser (
    -- Timestamp format in log is  "Aug 21, 2014 8:33:56 AM"
        TimeStamp:'%mon %d, %yyyy %H:%M:%S %p',
        RecordBegin:'%mon %d, %yyyy %H:%M:%S %p',
        regex:'(SEVERE:.*|WARNING:.*)'
)
OUTPUT TO fftpInStream;
CREATE TYPE fftpOutType (
        msg String,
        origTs long
);
CREATE STREAM fftpOutStream OF fftpOutType;
CREATE CQ fftpOutCQ
INSERT INTO fftpOutStream
SELECT data[0],
  TO_LONG(META(x,'OriginTimestamp'))
FROM fftpInStream x;
CREATE TARGET fftpTarget
USING SysOut(name:fftpInfo)
INPUT FROM fftpOutStream;

The RecordBegin value %mon %d, %yyyy %H:%M:%S %p defines the beginning of an event as a timestamp like the one at the beginning of the sample shown below. This is also the event timestamp, as defined by the TimeStamp value.

The regular expression (SEVERE:.*|WARNING:.*) looks in each event for a string starting with SEVERE or WARNING. If one is found, the parser returns everything until the next linefeed. If an event does not include SEVERE or WARNING, it is omitted from the output.

The following is the beginning of one of the potentially very long log messages this application is designed to process:

Aug 22, 2014 11:17:19 AM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException: org.apache.solr.search.SyntaxError: 
Cannot parse '((suggest_title:(04) AND suggest_title:(gmc) AND suggest_title: 
AND suggest_title:(6.6l) AND suggest_title:(lb7) AND suggest_title:(p1094,p0234)) 
AND NOT (deleted:(true)))': Encountered " <AND> "AND "" at line 1, column 64.
Was expecting one of:
    <BAREOPER> ...
    "(" ...
    "*" ...
    <QUOTED> ...
    <TERM> ...
    <PREFIXTERM> ...
    <WILDTERM> ...
    <REGEXPTERM> ...
    "[" ...
    "{" ...
    <LPARAMS> ...
    <NUMBER> ...
    
    at org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:147)
    at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:187)
    at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) ...

The parser discards everything except the line beginning with SEVERE, and the TO_LONG function in the CQ converts the log entry's timestamp to the format required by Striim:

fftpInfo: fftpOutType_1_0{
  msg: "SEVERE: org.apache.solr.common.SolrException: org.apache.solr.search.SyntaxError:
Cannot parse '((suggest_title:(04) AND suggest_title:(gmc) AND suggest_title: AND 
suggest_title:(6.6l) AND suggest_title:(lb7) AND suggest_title:(p1094,p0234)) AND NOT
(deleted:(true)))': Encountered \" <AND> \"AND \"\" at line 1, column 64."
  origTs: 1408731439000
};