Skip to main content

Striim Cloud 4.1.2 documentation

Managing system alerts

The following alerts are enabled by default and are sent for every server, Forwarding Agent, application, source, and target.

By default these alerts are visible only to administrators (members of the Global.admin group) in the alerts drop-down in the top right corner of the Striim web UI and in the Message Log at the bottom of the web UI. You may modify them to be sent by email or to Slack or Microsoft Teams.

Alert name

Alert condition (default)

Notes

Server_HighCpuUsage

the server or Forwarding Agent average per core CPU time used by its Java process is over 90%

By default, an alert will be sent every four hours until the condition is resolved.

Server_HighMemoryUsage

the server's or Forwarding Agent's JVM free heap size is below 10% of the maximum heap size (Xmx)

By default, an alert will be sent every four hours until the condition is resolved.

Server_NodeUnavailable

the server or Forwarding Agent is no longer connected to the cluster

Application_AutoResumed

the application resumed automatically (see Automatically restarting an application)

Application_Backpressured

one or more streams in the application have been backpressured for over ten minutes (see Understanding and managing backpressure)

By default, an alert will be sent every four hours until the condition is resolved.

Application_CheckpointNotProgressing

it has been over 30 minutes since the recovery checkpoint advanced (see Recovering applications)

By default, an alert will be sent every four hours until the condition is resolved.

Application_Halted

the application has halted (see Application states)

 

Application_Rebalanced

not applicable to Striim Cloud

Application_RebalanceFailed

not applicable to Striim Cloud

 

Application_Terminated

the application has terminated (see Application states)

Source_Idle

it has been over 10 minutes since the source read an event

By default, an alert will be sent every four hours until the condition is resolved.

Target_HighLee

one or more events received by the target had an end-to-end lag of over ten minutes (see Monitoring end-to-end lag (LEE))

By default, an alert will be sent every four hours until the condition is resolved.

Target_Idle

it has been over 10 minutes since the target wrote an event

By default, an alert will be sent every four hours until the condition is resolved.

Modifying a system alert

In this release, you must use the console (see Using the console in the web UI) to modify these system alerts.

The properties (which vary depending on the alert) are:

  • alertMessage: the text of the alert

  • alertType: EMAIL, SLACK, TEAMS, or WEB (default); except for WEB, you must also specify the toAddress

  • alertValue:

    • for integer values: the time in seconds before the alert is triggered; for example, for Source_Idle, the number of seconds with no events that need to pass before an alert is sent

    • for string values: the string to search for in the error message; for example, for Application_Terminated, Application terminated

  • comparator: EQ (equals), GT (greater than), LT (less than)

    • for integer values: EQ (equals), GT (greater than), LT (less than)

    • for string values: EQ (equals), LIKE (matches if the specified string occurs anywhere in the value)

  • intervalSec: the number of seconds between alerts (the snooze interval)

  • isEnabled: true (default) or false

  • toAddress: for email, the recipient's address; for Slack or Teams, the channel

To see an alert's properties, use the DESCRIBE command. For example:

DESCRIBE Application_Terminated;
Processing - describe Application_Terminated

SysAlertRule Application_Terminated 
  on .*\.APPLICATION\..*: 
  for LOG_ERROR 
  comparator LIKE 
  with value Application terminated  
  alert type WEB 
  snooze 0 SECOND 
  system-defined and enabled 
  message: Application {{entityName}}: {{metricValue}}.
-> SUCCESS

The property names in the DESCRIBE output correspond to the following keywords in ALERT SMARTALERT commands:

DESCRIBE output

keyword for ALTER SMARTALERT

on

can't be modified

for

can't be modified

comparator

can't be modified; the comparators are

  • for integer values: EQ (equals), GT (greater than), LT (less than)

  • for string values: EQ (equals), LIKE (matches if the specified string occurs anywhere in the value)

with value

alertValue

alert type

alertType

sending to

toAddress

snooze

intervalSec

message

alertMessage

enabled

isEnabled

The on, for,

Examples of modifying alert properties:

  • To change the alert type for Application_Terminated from WEB to EMAIL:

    ALTER SMARTALERT Application_Halted '{"alertType" : "EMAIL", "toAddress" : "somebody@example.com"}';
    Processing - ALTER SMARTALERT Application_Halted 
      '{"alertType" : "EMAIL", "toAddress" : "somebody@example.com"}'
    The modified alert definition is: 
    SysAlertRule Application_Halted 
      on .*\.APPLICATION\..*: 
      for LOG_ERROR 
      comparator LIKE 
      with value Application halted  
      alert type EMAIL 
      sending to somebody@example.com 
      snooze 0 SECOND 
      system-defined and enabled 
      message: Application {{entityName}}: {{metricValue}}.
    -> SUCCESS 
  • To change the alert interval (snooze) for Source_Idle to an hour (3600 seconds):

    ALTER SMARTALERT Source_Idle '{"intervalSec" : "3600"}';
    Processing - ALTER SMARTALERT Source_Idle '{"intervalSec" : "3600"}'
    The modified alert definition is: 
    SysAlertRule Source_Idle 
      on .*\.SOURCE\..*: 
      for LAST_READ_AGE 
      comparator GT 
      with value 600 seconds 
      alert type WEB 
      snooze 1 HOUR 
      system-defined and enabled 
      message: Source {{entityName}}:
        No new event received in last {{metricValue}} (>{{alertValue}}) {{metricUnit}}.
    -> SUCCESS 
  • To disable Source_Idle:

    ALTER SMARTALERT Source_Idle '{"isEnabled" : "false"}';
    Processing - ALTER SMARTALERT Source_Idle '{"isEnabled" : "false"}'
    The modified alert definition is: 
    SysAlertRule Source_Idle 
      on .*\.SOURCE\..*: 
      for LAST_READ_AGE 
      comparator GT 
      with value 600 seconds 
      alert type WEB 
      snooze 5 MINUTE 
      system-defined and disabled 
      message: Source {{entityName}}:
        No new event received in last {{metricValue}} (>{{alertValue}}) {{metricUnit}}.
    -> SUCCESS