Jump to content

Abstract Wikipedia team/Observability/Logging

From mediawiki.org

Currently, we have separate LogStash dashboards one for frontend and one for backend log events:
WikiLambda logs
Evaluator and Orchestrator logs

Some quick sub-links

Log Levels according to the winstonlogging library
Name Level Usage
error 0 Critical errors that require immediate attention
warn 1 Warnings about potential issues
info 2 General information
http 3 HTTP requests that show HTTP server activity
verbose 4 For debugging; detailed information
debug 5 For debugging; generally in development
silly 6 For debugging and/or tracing; detailed information
[Deprecated] Log levels according to the bunyan logging library
Name Level Usage
'trace' 10 for active debugging
'debug' 20 detailed logs in cases where we want to do follow-up on issues
'info' 30 normal events, e.g. 'incoming request'
'warn' 40 potentially worrying events
'error' 50 worrying events where we can't return a useful response
'fatal' 60 for environmental issues, e.g. 'port cannot be opened'

Where/How Do I Begin?

[edit]
WikiLambda LogStash
[edit]
  • Refer to the 'Events By LevelPath' row to look at the overall frequency of log levels. To look at all of the 'error' log events, you can add another filter using levelPath.
Backend LogStash
[edit]
  • To search by log level, filter via key: log.level

What Do I Look for?

[edit]
  • Get an overview of occurrences using the four main panels above the 'Channel Events' panel
    • Events over time || Events
    • Events by LevelPath || LogLevel
    • Top normalized_message || Events by message
    • LevelPath Types || Events by frequency
  • You can conveniently add filtering by clicking on each message to read them in detail in the 'Channel Events' below
  • When there is a big frequency in errors, outages, or timeouts in the log events, it will be worthwhile to observe timeline of events in Grafana and check how and if they correlate.

When Do I Report?

[edit]
  • For any ‘fatal’, ‘error’, or critical ‘warn’ events (especially if they’re recurring). Be sure to attach link to the particular event (expand the log event on LogStash and click on ‘View Single Document’).
    • If any such detection is a new or unknown event, create a Phabricator ticket. (Of course if any event does not yet have a Phabricator, create one.)
  • If there is an incident as defined in the Sev Levels table, create a new Phabricator ticket and declare an incident by pinging our #aw-engineering channel.