Managing system alerts
The following alerts are enabled by default and are sent for every server, Forwarding Agent, application, source, and target.
By default these alerts are visible only to administrators (members of the Global.admin group) in the alerts drop-down in the top right corner of the Striim web UI and in the Message Log at the bottom of the web UI. You may modify them to be sent by email or to Slack or Microsoft Teams.
Alert name | Alert condition (default) | Notes |
---|---|---|
Server_HighCpuUsage | the server or Forwarding Agent average per core CPU time used by its Java process is over 90% | By default, an alert will be sent every four hours until the condition is resolved. |
Server_HighMemoryUsage | the server's or Forwarding Agent's JVM free heap size is below 10% of the maximum heap size (Xmx) | By default, an alert will be sent every four hours until the condition is resolved. |
Server_NodeUnavailable | the server or Forwarding Agent is no longer connected to the cluster | |
Application_AutoResumed | the application resumed automatically (see Automatically restarting an application) | |
Application_Backpressured | one or more streams in the application have been backpressured for over ten minutes (see Understanding and managing backpressure) | By default, an alert will be sent every four hours until the condition is resolved. |
Application_CheckpointNotProgressing | it has been over 30 minutes since the recovery checkpoint advanced (see Recovering applications) | By default, an alert will be sent every four hours until the condition is resolved. |
Application_Halted | the application has halted (see Application states) |
|
Application_Rebalanced | not applicable to Striim Cloud | |
Application_RebalanceFailed | not applicable to Striim Cloud |
|
Application_Terminated | the application has terminated (see Application states) | |
Source_Idle | it has been over 10 minutes since the source read an event | By default, an alert will be sent every four hours until the condition is resolved. |
Target_HighLee | one or more events received by the target had an end-to-end lag of over ten minutes (see Monitoring end-to-end lag (LEE)) | By default, an alert will be sent every four hours until the condition is resolved. |
Target_Idle | it has been over 10 minutes since the target wrote an event | By default, an alert will be sent every four hours until the condition is resolved. |
Modifying a system alert
In this release, you must use the console (see Using the console in the web UI) to modify these system alerts.
The properties (which vary depending on the alert) are:
alertMessage
: the text of the alertalertType
:EMAIL
,SLACK
,TEAMS
, orWEB
(default); except forWEB
, you must also specify thetoAddress
Before modifying an alert to send via Slack, follow the setup instructions in Sending alerts about servers and applications and Configure Slack to receive alerts from Striim.
Before modifying an alert to send via Teams, follow the setup instructions in Sending alerts about servers and applications and Configure Teams to receive alerts from Striim.
alertValue
:for integer values: the time in seconds before the alert is triggered; for example, for Source_Idle, the number of seconds with no events that need to pass before an alert is sent
for string values: the string to search for in the error message; for example, for Application_Terminated,
Application terminated
comparator
:EQ
(equals),GT
(greater than),LT
(less than)for integer values:
EQ
(equals),GT
(greater than),LT
(less than)for string values:
EQ
(equals),LIKE
(matches if the specified string occurs anywhere in the value)
intervalSec
: the number of seconds between alerts (the snooze interval)isEnabled
:true
(default) orfalse
toAddress
: for email, the recipient's address; for Slack or Teams, the channel
To see an alert's properties, use the DESCRIBE command. For example:
DESCRIBE Application_Terminated; Processing - describe Application_Terminated SysAlertRule Application_Terminated on .*\.APPLICATION\..*: for LOG_ERROR comparator LIKE with value Application terminated alert type WEB snooze 0 SECOND system-defined and enabled message: Application {{entityName}}: {{metricValue}}. -> SUCCESS
The property names in the DESCRIBE output correspond to the following keywords in ALERT SMARTALERT commands:
DESCRIBE output | keyword for ALTER SMARTALERT |
---|---|
on | can't be modified |
for | can't be modified |
comparator | can't be modified; the comparators are
|
with value |
|
alert type |
|
sending to |
|
snooze |
|
message |
|
enabled |
|
The on, for,
Examples of modifying alert properties:
To change the alert type for
Application_Terminated
fromWEB
toEMAIL
:ALTER SMARTALERT Application_Halted '{"alertType" : "EMAIL", "toAddress" : "somebody@example.com"}'; Processing - ALTER SMARTALERT Application_Halted '{"alertType" : "EMAIL", "toAddress" : "somebody@example.com"}' The modified alert definition is: SysAlertRule Application_Halted on .*\.APPLICATION\..*: for LOG_ERROR comparator LIKE with value Application halted alert type EMAIL sending to somebody@example.com snooze 0 SECOND system-defined and enabled message: Application {{entityName}}: {{metricValue}}. -> SUCCESS
To change the alert interval (snooze) for
Source_Idle
to an hour (3600 seconds):ALTER SMARTALERT Source_Idle '{"intervalSec" : "3600"}'; Processing - ALTER SMARTALERT Source_Idle '{"intervalSec" : "3600"}' The modified alert definition is: SysAlertRule Source_Idle on .*\.SOURCE\..*: for LAST_READ_AGE comparator GT with value 600 seconds alert type WEB snooze 1 HOUR system-defined and enabled message: Source {{entityName}}: No new event received in last {{metricValue}} (>{{alertValue}}) {{metricUnit}}. -> SUCCESS
To disable
Source_Idle
:ALTER SMARTALERT Source_Idle '{"isEnabled" : "false"}'; Processing - ALTER SMARTALERT Source_Idle '{"isEnabled" : "false"}' The modified alert definition is: SysAlertRule Source_Idle on .*\.SOURCE\..*: for LAST_READ_AGE comparator GT with value 600 seconds alert type WEB snooze 5 MINUTE system-defined and disabled message: Source {{entityName}}: No new event received in last {{metricValue}} (>{{alertValue}}) {{metricUnit}}. -> SUCCESS