Phlogiston/Data Loading Model

From MediaWiki.org
Jump to navigation Jump to search

A lot of the Phlogiston data model and logic is dedicated to recreating the state of the Phabricator data at a fixed point in time. If this data could be accessed directly from Phabricator, this would be unnecessary. In the model below, the steps that are specific to Phlogiston reporting, and could not be pulled from Phabrictor, are marked with a ✱.

Data Loading[edit]

  1. Download the latest http://dumps.wikimedia.org/other/misc/phabricator_public.dump (updated around 0400 UTC).
  2. Discard all previously loaded Phabricator data (the "Load" tables)
  3. Load the dump file into the database.
    1. All project, column, task, and unparsed transaction information is loaded.
    2. As an optimization, all edge transactions are parsed from the transaction log and a list of edge transactions is generated.
    3. Everything keyed by a PHID is re-keyed to ID.

Data Reconstruction[edit]

Once for each scope:

  1. Generate a list of project IDs relevant to the scope. ✱
    1. All projects listed in the recategorization file are relevant. ✱
    2. The Status Report project in <prefix>_scope.pyis relevant. ✱
    3. The hard-coded IDs for certain keyword tags, e.g., 'category', are relevant. ✱
  2. Determine the range of dates to be processed:
    1. If an incremental run, start the day after the last day in the data.
    2. If a complete run,
      1. wipe reconstruction tables of any data for this scope
      2. Set the start date to the date in <prefix>_scope.py.
  3. For each day since the start date,
    1. For each task in the complete list of tasks in Phabricator,
      1. Get the list of edges from the most recent edge transaction (not later than the working day) associated with the task. For each project in the list of edges,
        1. If the project is also a project relevant to this scope
          1. Make a record in maniphest_edge for this combination of date, task, and project.
          2. Example: In edge transaction data, there is a single record, "on 2018-04-01, Project 300 was added to Task 142.". After reconstruction, there is a one record linking Project 300 and Task 142 for each day from 2018-04-01 to today.
  4. For each day since the start date,
    1. For each task associated with any of the relevant project IDs on the working day,
      1. Reconstruct the state of the task for that day:
        1. Get title from the current (most recent load, not working day) values in maniphest_task.
        2. Get points from the most recent points transaction before or on working day.
          1. If none is available, get points from the current values in maniphest_task.
            1. If none is available, use default_points from <prefix>_scope.py. ✱
        3. Get status from the most recent status transaction before or on working day.
        4. Get priority from the most recent priority transaction before or on working day.
        5. Determine the project ✱
          1. Get all of the edges associated with the task on the working day (from maniphest_edge) ✱
          2. Compare the edges with the list of relevant projects id, looking for the highest-priority project id that matches. ✱
            1. Priority is determined by row order in <prefix>_categorization.csv.
            2. This step is necessary to ensure that the categorization rules can be applied correctly. (If a task kept all of its project edges, it might get counted in a lower-priority category).
        6. With the project determined, get the column on that board, if any, from the most recent core:columns transaction before or on the working day.
      2. Using the reconstructed fields, add a record to task_on_day representing the state of that task on that day within that scope.
    2. For each task that is tagged "Category", and in scope,
      1. For each each descendent of that task,
        1. Make a row in phab_parent_category_edge linking the task to the original ancestor for that day.
      2. The algorithm includes only tasks present in the task_on_day, so if a child is not present in the data (because it doesn't belong to any projects in the source project list), but the grandchild is, the grandchild will not be included. See T115936#1847188 for a more precise algorithm, not implemented in Phlogiston. ✱
    3. Repeat the previous step but for "goal".
  5. For all tasks in scope, use the ancestor relationship, if any, to update the category_title field in the task in task_on_day.
    1. As implemented, this doesn't update the category_title of the ancestor tasks themselves, so another pass updates their category_titles as well
  6. fix_status(). Handle special cases where task status is not properly accessible through the transaction log, and set the task status to whatever the current status is.

Data Reporting[edit]

Once for each scope:

  1. Precalculate a lot of key dates.
  2. Wipe Reporting tables of any data for this scope.
  3. Copy all records in scope in task_on_date to task_on_date_recategorized.
    1. Convert all status=stalled tasks to status=open ✱
    2. Delete all duplicate, invalid, and declined tasks, which will be completely absent from the reports. ✱
    3. For the rest of Reporting, changes apply to the temporary data, not the original reconstructed data.
  4. Reload <scope>_recategorization.csv. This is to allow development of report parameters without having to repeat reconstruction. ✱
  5. Recategorize all tasks in task_on_date_recategorized ✱
    1. For each category rule, in priority order: ✱
      1. Apply the rule to set the category of all tasks in scope that have not already been categorized. ✱
    2. Each task∙day record is recategorized separately, so a task may have one category for some dates and a different category for later dates.
  6. If retroactive categories is specified in the configuration, update the category for each task∙day to be equal to the most recent category for that task. ✱
  7. If retroactive points is specified in the configuration, update the points for each task∙day to be equal to the most recent points for that task. ✱
  8. Prepare data for the recently_closed report ✱
  9. Generate aggregate data ✱
    1. Determine the "backlog_resolved_cutoff". If specified in the scope configuration, this is either the date specified, or the start of the current quarter. Otherwise it is not used. ✱
      1. The purpose of this is to reset the burnup of resolved tasks, so that the burnup for the current period can be clearly seen rising from 0, rather than being an incremental change on top of all previous completed work.
    2. Create three different datasets, all stuffed into task_on_date_agg and aggregated by status, category, maint_type (obsolete), and date.  The datasets aggregating the daily data in three ways: with no cutoff, with the specified cutoff, and with a cutoff three months before specified. ✱
  10. generate_reporting_files(). Generate all of the CSV files necessary for reporting this scope: ✱
    1. Set up a temp dir for this scope, /tmp/<scope>
    2. Execute make_report_csvs.sql to generate csv files in /tmp/phlog ✱
      1. In addition to making many CSV files, this also calls calculate_velocities(), which calculates historical data and forecasts for all categories in this scope. ✱
    3. Rename the files and move them to /tmp/<scope>.
  • Make Tranche Reports ✱
    • Prepare a color palette based on how many categories there are in scope. ✱
    • For each category (Tranche), ✱
      • call make_tranche_chart.R with a bunch of scope-specific parameters, including the color, to generate a set of graphs for that category. ✱
  • Generate a number of charts by querying the database and using python html templates to make html files. ✱
    • Update the dates of the report (date run and date of most recent data) ✱
    • Generate the forecast charts. ✱
    • Generate the open task report. ✱
    • Generate the unpointed tasks report. ✱
    • Generate the recently_closed_tasks report. ✱
    • Generate status reports. ✱
  • Call make_charts.R with data files and parameters, to generate the forecast, burnup charts, velocity, and points histograpm graphs. ✱