Abstract Wikipedia team/Observability/Logging
Appearance
Currently, we have separate LogStash dashboards one for frontend and one for backend log events:
WikiLambda logs
Evaluator and Orchestrator logs
Some quick sub-links
- WikiLambda: All logs in the past hour; any errors or warnings in the past day
- Orchestrator: All logs in the past hour; any errors or warnings in the past day
- Evaluator: All logs in the past hour; any errors or warnings in the past day
Name | Level | Usage |
---|---|---|
error | 0 | Critical errors that require immediate attention |
warn | 1 | Warnings about potential issues |
info | 2 | General information |
http | 3 | HTTP requests that show HTTP server activity |
verbose | 4 | For debugging; detailed information |
debug | 5 | For debugging; generally in development |
silly | 6 | For debugging and/or tracing; detailed information |
Name | Level | Usage |
---|---|---|
'trace' | 10 | for active debugging |
'debug' | 20 | detailed logs in cases where we want to do follow-up on issues |
'info' | 30 | normal events, e.g. 'incoming request' |
'warn' | 40 | potentially worrying events |
'error' | 50 | worrying events where we can't return a useful response |
'fatal' | 60 | for environmental issues, e.g. 'port cannot be opened' |
Where/How Do I Begin?
[edit]WikiLambda LogStash
[edit]- Refer to the 'Events By LevelPath' row to look at the overall frequency of log levels. To look at all of the 'error' log events, you can add another filter using
levelPath
.
Backend LogStash
[edit]- To search by log level, filter via key:
log.level
What Do I Look for?
[edit]- Get an overview of occurrences using the four main panels above the 'Channel Events' panel
- Events over time || Events
- Events by LevelPath || LogLevel
- Top normalized_message || Events by message
- LevelPath Types || Events by frequency
- You can conveniently add filtering by clicking on each message to read them in detail in the 'Channel Events' below
- When there is a big frequency in errors, outages, or timeouts in the log events, it will be worthwhile to observe timeline of events in Grafana and check how and if they correlate.
When Do I Report?
[edit]- For any ‘fatal’, ‘error’, or critical ‘warn’ events (especially if they’re recurring). Be sure to attach link to the particular event (expand the log event on LogStash and click on ‘View Single Document’).
- If any such detection is a new or unknown event, create a Phabricator ticket. (Of course if any event does not yet have a Phabricator, create one.)
- If there is an incident as defined in the Sev Levels table, create a new Phabricator ticket and declare an incident by pinging our #aw-engineering channel.