How to notify errors through td-agent immediately while only sending counts of certain acknowledged errors daily
💻 Tech
A short while ago, I implemented a slack notification function in our product’s td-agent.
Then I faced a problem. There are errors everyone acknowledges and do not need to be notified immediately.
We only want to know how many times the error occurred daily not to miss if it strangely happens a lot.
I made it possible by using rewrite
, grepcounter
and slack
plugins.
Here’s the code how I overcame it in td-agent.conf file.
# Retrieve Error Log
<source>
type tail
path {{ path }}
format multiline
format_firstline /^\d{4}-\d{2}-\d{2}/
format1 /^(?<text>.*)/
tag raw.app.errorlog.{{ hostname }}
pos_file /var/tmp/app_log.pos.slack
</source>
# Filter Acknowledged Error
<match raw.app.errorlog.{{ hostname }}>
type rewrite
add_prefix filtered
<rule>
key text
pattern FileNotFoundException
append_to_tag true
tag FileNotFoundException
</rule>
</match>
# Notify Error to the Slack channel
<match filtered.raw.app.errorlog.{{ hostname }}>
type slack
webhook_url {{ webhook_url }}
channel {{ channel }}
username ERROR_NOTIFIER
message '{{ td_agent_app_errorlog_mention }}```[host] {{ hostname }} [Path] {{ errorlog_path }}``` %s'
message_keys text
color warning
flush_interval 10s
</match>
# Notify the count of Acknowledged Errors filtered above
<match filtered.raw.app.errorlog.{{ hostname }}.FileNotFoundException>
type grepcounter
count_interval 86400 # = 24 hours
input_key text
threshold 1
add_tag_prefix count
</match>
<match count.filtered.raw.app.errorlog.{{ hostname }}.FileNotFoundException>
type slack
webhook_url {{ webhook_url }}
channel {{ channel }}
username EXISTING_ERROR_NOTIFIER
icon_emoji :admission_tickets:
message_keys count
message '```FileNotFoundException occured %s times within this 24 hours at {{ hostname }}```'
color #FFB6C1
flush_interval 10
</match>
When an error happens, a notification like this immediately sends to a Slack channel.
The count of the acknowledged error is notified like this everyday.
It says how many times the errors occurred, in which host the error took place and the link to the Jira ticket describes the detail of it.