Skip to main content

Extracting substrings from log entries

The MATCH function allows you to match a string using a regex expression (see Functions for details about this function). The 3rd parameter indicates which capture group is used, which may be useful when the input string to a regex can result in multiple capture groups of differing values. In the TQL example, log data in which session information is captured is extracted via the use of basic regex expressions:

MATCH(data[5], "(?<=process: )([a-zA-Z0-9//$]*)",1), /* process */ 
MATCH(data[5], "(?<=pathway: )([a-zA-Z0-9//$]*)",1), /* pathway */ 
MATCH(data[5], "(?<=service code: )([a-zA-Z0-9//_]*)",1), /* service code */ 
MATCH(data[5], "(?<=model: )([a-zA-Z0-9]*)",1), /* model */ 
MATCH(data[5], "(?<=user id: )([0-9]*)",1), /* userId */ 
MATCH(data[5], "(?<=session IP: )([a-zA-Z0-9//.]*)",1), /* session IP */ 
MATCH(data[5], "(?<=source: )([a-zA-Z0-9//.//_///]*)",1), /* source */ 
MATCH(data[5], "(?<=detail: )(.*$)",1) /* detail message */

Here is an example of a typical log entry that may contain SEVERE or WARNING messages:

Aug 22, 2014 11:17:19 AM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException: org.apache.solr.search.SyntaxError: Cannot parse '((suggest_title:(04) AND suggest_title:(gmc) AND suggest_title: AND suggest_title:(6.6l) AND suggest_title:(lb7) AND suggest_title:(p1094,p0234)) AND NOT (deleted:(true)))': Encountered " <AND> "AND "" at line 1, column 64.
Was expecting one of:
    <BAREOPER> ...
    "(" ...
    "*" ...
    <QUOTED> ...
    <TERM> ...
    <PREFIXTERM> ...
    <WILDTERM> ...
    <REGEXPTERM> ...
    "[" ...
    "{" ...
    <LPARAMS> ...
    <NUMBER> ...
     
    at org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:147)
    at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:187)
    at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) ...

We would like to reduce this information to the following:

fftpInfo: fftpOutType_1_0{
  msg: "SEVERE: org.apache.solr.common.SolrException: org.apache.solr.search.SyntaxError: Cannot parse '((suggest_title:(04) AND suggest_title:(gmc) AND suggest_title: AND suggest_title:(6.6l) AND suggest_title:(lb7) AND suggest_title:(p1094,p0234)) AND NOT (deleted:(true)))': Encountered \" <AND> \"AND \"\" at line 1, column 64."
  origTs: 1408731439000
};

To do this, we will include the following regex in the FreeFormTextParser properties:

regex:'(SEVERE:.*|WARNING:.*)'

Here is the complete TQL:

create source fftpSource using FileReader (
  directory:'Samples/',
  WildCard:'catalina*.log',
  charset:'UTF-8',
  positionByEOF:false
)
parse using FreeFormTextParser (
-- Timestamp format in log is  "Aug 21, 2014 8:33:56 AM"
  TimeStamp:'%mon %d, %yyyy %H:%M:%S %p',
  RecordBegin:'%mon %d, %yyyy %H:%M:%S %p',
  regex:'(SEVERE:.*|WARNING:.*)'
)
OUTPUT TO fftpInStream;

CREATE TYPE fftpOutType (
  msg String,
  origTs long
);

create stream fftpOutStream of fftpOutType;

create cq fftpOutCQ
insert into fftpOutStream
select data[0],
  TO_LONG(META(x,'OriginTimestamp'))
from fftpInStream x;

create Target fftpTarget using SysOut(name:fftpInfo) input from fftpOutStream;

See FreeFormTextParser for more information.