Jump to content

Abstract Wikipedia team/Observability/Metrics

From mediawiki.org

Wikifunctions metrics is fed into our Grafana dashboard.

What Do I Look At?

[edit]

CPU Usage

  • For App Overview:
    • 'Total CPU', row: 'Saturation'
  • For Evaluator and Orchestrator: scroll down for their 'Saturation' rows

Heap and Memory Usage

  • For App Overview: 'Total Memory', row: 'Saturation'
    • For Evaluator and Orchestrator: scroll down for their 'Saturation' rows
    • For NodeJS of Orchestrator:
      • panel: 'NodeJS heap allocation overview', row: 'NodeJS'
      • panels: 'NodeJS memory usage overview', 'NodeJS memory usage details'

Garbage Collection

  • For the Orchestrator:
    • panels that preface with 'NodeJS GC time', row: 'NodeJS'

Errors and Failures

  • For the App: N/A
  • For the Orchestrator: look at function_orchestrator_function_implementation_error_count under the row 'Function Calls'

What Do I Look For?

[edit]
  • In the row, 'Function Calls', see if there are spikes in the function_orchestrator_function_duration_milliseconds and function_orchestrator_function_implementation_error_count. Many times, when we see spikes in duration and/or implementation errors, we notice heap/memory exceeded log events from NodeJS. We are still in the process of figuring out if these occurrences are directly correlated which is why we want to watch out for both types of events.
  • As mentioned in our Logging page, see how such spikes and graph trajectory might correspond with the events shown in LogStash.
  • For reasons aforementioned, we should also see if there are any spikes in the 'Saturation' rows for the Orchestrator. Namely, those related to CPU and Memory usage.

(Additional info on Wikifunctions/Performance_observability)

Custom Metrics

[edit]

The following metrics were added with the intent to track them via logging onto our LogStash. These metrics are instantiated in lib/util.js in both the Evaluator and Orchestrator repositories.

(Debugging pro tip: Track these metrics locally by entering curl http://localhost:9100/metrics in your terminal.)

function-evaluator:

Name Type Labels Help (description)
function_evaluator_wasm_subprocess_count Gauge n/a Number of currently running WASM subprocesses in Evaluator

function-orchestrator:

Name Type Labels Help (description)
function_orchestrator_router_request_duration_seconds Histogram 'path', 'method', 'status' request duration handled by router in seconds
function_orchestrator_incomingrequestcount Counter n/a function-orchestrator request count
function_orchestrator_nonrequesterror Counter n/a function-orchestrator non-request error count
function_orchestrator_outgoingresponsecount Counter n/a function-orchestrator response count
function_orchestrator_function_duration_milliseconds Summary 'Z7_function_identity', 'isBuiltIn', 'implementationZID', 'requestId' function call duration in milliseconds
function_orchestrator_function_implementation_error_count Counter 'error', 'reqId' function call implementation errors count
function_orchestrator_function_execute()_count Counter 'reqId' function execution counter

More Resources: