Extension:Cargo/Storing data

From MediaWiki.org
Jump to: navigation, search
Cargo - navigation
Basics Main pageExtension:Cargo (talk) · Download and installationExtension:Cargo/Download and installation · Quick start guideExtension:Cargo/Quick start guide · Other documentationExtension:Cargo/Other documentation · SMW migration guideExtension:Cargo/SMW migration guide
Using Cargo Storing dataExtension:Cargo/Storing data · Querying dataExtension:Cargo/Querying data (Display formatsExtension:Cargo/Display formats) · Browsing dataExtension:Cargo/Browsing data · Exporting dataExtension:Cargo/Exporting data · Other featuresExtension:Cargo/Other features
Resources for help Common problemsExtension:Cargo/Common problems · Known bugs and planned featuresExtension:Cargo/Known bugs and planned features · Getting supportExtension:Cargo/Getting support
About Cargo Authors and creditsExtension:Cargo/Authors and credits · Version historyExtension:Cargo/Version history · Sites that use CargoExtension:Cargo/Sites that use Cargo · Cargo and Semantic MediaWikiExtension:Cargo/Cargo and Semantic MediaWiki · FAQExtension:Cargo/FAQ

The creation of data structures, and storage of data, is done in Cargo exclusively via templates. Any template that makes use of Cargo needs to contain calls to the parser functions #cargo_declare and #cargo_store; or, more rarely, calls to #cargo_attach and #cargo_store. #cargo_declare defines the fields for a table of data, #cargo_store stores data within that table, and #cargo_attach specifies that a template stores its data to a table that has been defined elsewhere.

Declaring a table[edit]

A template that stores data in a table needs to also either declare that table, or "attach" itself to a table that is declared elsewhere. Since there is usually one table per template and vice versa, most templates that make use of Cargo will declare their own table. Declaring is done via the parser function #cargo_declare.

This function is called with the following syntax:

{{#cargo_declare:
_table = table_name
|field_1 = field description 1
|field_2 = field description 2
...etc.
}}

First, note that neither the table name nor field names can contain spaces; instead, you can use underscores, CamelCase, etc.

The field description must start with the type of the field, and in many cases it will simply be the type. The following types are predefined in Cargo:

  • Page - holds the name of a page in the wiki
  • String - holds standard, non-wikitext text
  • Text - holds standard, non-wikitext text; intended for longer values
  • Integer - holds an integer
  • Float - holds a real, i.e. non-integer, number
  • Date - holds a date without time
  • Datetime - holds a date and time
  • Boolean - holds a Boolean value, whose value should be 1 or 0, or 'yes' or 'no'
  • Coordinates - holds geographical coordinates
  • Wikitext - holds text that is meant to be parsed by the MediaWiki parser
  • Searchtext - holds text that can be searched on, using the MATCHES command
  • File - holds the name of an uploaded file or image in the wiki (similar to Page, but does not require specifying the "File:" namespace)
  • URL - holds a URL
  • Email - holds an email address

Any other type specified will simply be treated as type "String".

A field can also hold a list of any such type. To define such a list, the type value needs to look like "List (delimiter) of type". For example, to have a field called "Authors" that holds a list of string values separated by commas, you would have the following parameter in the #cargo_declare call:

|Authors=List (,) of String

The description string can also have additional parameters; these all are enclosed within parentheses after the type identifier, and separated by semicolons. Current allowed parameters are:

  • size= - for fields of type "Page", "String", "Wikitext", "File", "URL" and "Email", sets the size of this field, i.e. the number of characters; the default is 300
  • hierarchy - specifies that the field holds a hierarchy of values, as defined in the "allowed values" parameter (see next item)
  • allowed values= - a set of allowed values that a field can have. (This is usually only done for fields of type "String" or "Page".) If "hierarchy" is not specified, this should simply be a set of comma-separated values. If "hierarchy" is specified, the values should be defined using the syntax of a bulleted list. In brief: every value should be on its own line, each line should start with at least one "*", the first line should start with exactly one "*", and the number of "*" should increase by no more than one at a time.

    For example, to define a field called "Main ingredient" that is a hierarchy, you could have the following declaration:

    |Main_ingredient = String (hierarchy;allowed values=*Fruits
    **Mangoes
    **Apples
    *Vegetables
    **Root vegetables
    ***Carrots
    ***Turnips
    **Peppers)
    
  • link text= - for fields of type "URL", sets text that would be displayed as a link to that URL. By default the entire URL is shown.
  • hidden - takes no value. If set, the field is not listed in Special:Drilldown, although it is still queriable.

    For example, to define a field called "Color" that has three allowed values, you could have the following declaration:

    |Color=String (size=10;allowed values=Red,Blue,Yellow)
    

    #cargo_declare also displays a link to the Special:CargoTables page for viewing the contents of this database table.

Attaching to a table[edit]

In some cases, you may want more than one template to store their data to the same Cargo table. In that case, only one of the templates should declare the table, while the others should simply "attach" themselves to that table, using the parser function #cargo_attach.

This function is called with the following syntax:

{{#cargo_attach:
_table = table_name
}}

You do not actually need this call in order for a template to add rows to some table; a #cargo_store call placed anywhere, via a template or otherwise, will add a row to a table (assuming the call is valid). However, #cargo_attach lets you do the "Recreate data" action for that template - see "Creating or recreating data", below.

Storing data in a table[edit]

A template that declares a table or attaches itself to one should also store data in that table. This is done with the parser function #cargo_store. Unlike #cargo_declare and #cargo_attach, which apply to the template page itself and thus should go into the template's <noinclude> section, #cargo_store applies to each page that calls that template, and thus should go into the template's <includeonly> section.

This function is called with the following syntax:

{{#cargo_store:
_table = table_name
|field_1 = value 1
|field_2 = value 2
...etc.
}}

The field names must match those in the #cargo_declare call elsewhere in the template.

The values will usually, but not always, be template parameters; but in theory they could hold anything.

Storing a recurring event[edit]

Special handling exists for storing recurring events, which are events that happen regularly, like birthdays or weekly meetings. For these, the parser function #recurring_event exists. It takes in a set of parameters for a recurring event (representing the start date, frequency etc.), and simply prints out a string holding a list of the dates for that event. It is meant to be called within #cargo_store (for a field defined as holding a list of dates), and #cargo_store will then store the data appropriately. #recurring_event is called with the following syntax:

{{#recurring_event:
start=start date
|end=end date
|unit=day, week, month or year
|period=some number, representing the number of "units" between event instances (default is 1)
|include=list of dates, to be included in the list
|exclude=list of dates to exclude
|delimiter=delimiter for dates (default is ',')
}}

Of these parameters, only "start=" and "unit=" are required.

By default, if no end date is set, or if the end date is too far in the future, #recurring_event stores 50 instances of the event. To change this, you can add a setting for $wgCargoRecurringEventMaxInstances in LocalSettings.php, under the inclusion of Cargo. For instance, to set the number to 100, you would add the following:

$wgCargoRecurringEventMaxInstances = 100;

Example[edit]

You can see two templates that make use of #cargo_declare and #cargo_store here and here.

Creating or recreating a table[edit]

No data is actually generated or modified when a template page containing a #cargo_declare call is saved. Instead, the data must be created or recreated in a separate process. There are two ways to do this:

Web-based tab[edit]

The form displayed at "?action=recreatedata" if an existing table is being created

From the template page, select the tab action called either "Create data" or "Recreate data". This will bring up an interface that may contain a checkbox reading "Recreate data into a replacement table, keeping the old one for querying". That checkbox will only appear if the Cargo table in question already exists.

Once you hit "OK", one of the following will happen:

  1. If the checkbox was selected, a "replacement table" will be created, while the current table remains unaffected. This replacement table can be viewed by anyone, but its data will not be used in queries. (In the database, the actual table will have a name like "cargo__tableName__NEXT".) If/when you think this replacement table is ready to be used, you can click on the "Switch in table" link at Special:CargoTables. This link will delete the current Cargo table and rename the replacement table so that it becomes the official table. Conversely, if you don't want to use the replacement table, you can click on the "Delete" link for it.
  2. If the checkbox was not selected, the current table will be deleted immediately, and a new version will get created.
  3. If the checkbox was not there, it means that this is a new table. In that case, the table will be created.

In all three cases, MediaWiki jobs are used to cycle through all the relevant pages and recreate the data - a separate job is created for each page. This can be a lengthy process for large tables, which is why using the "replacement table" approach is strongly recommended for larger tables.

Depending on your MediaWiki configuration, a call to MediaWiki's runJobs.php script may be useful or even necessary for all these jobs to get run.

If any templates contain #cargo_attach, they too will get a "Create data" or "Recreate data" tab. If this tab is selected and activated, it will not drop and recreate the database table itself; instead, it will only recreate those rows in the table that came from pages that call that template.

Permissions[edit]

The ability to create/recreate data is available to users with the 'recreatecargodata' permission, which by default is given to sysops. You can give this permission to other users; for instance, to have a new user group, 'cargoadmin', with this ability, you would just need to add the following to LocalSettings.php:

$wgGroupPermissions['cargoadmin']['recreatecargodata'] = true;

Once a table exists for a template, any page that contains one or more calls to that template will have its data in that table refreshed whenever it is resaved; and new pages that contain call(s) to that template will get their data added in when the pages are created.

Command-line script[edit]

If you have access to the command line, you can also recreate the data by calling the script cargoRecreateData.php, located in Cargo's /maintenance directory. It can be called in one of two ways:

  • php cargoRecreateData.php - recreates the data for all Cargo tables in the system
  • php cargoRecreateData.php --table tableName - recreates the data for the one specified Cargo table.

In addition, the script can be called with the --quiet flag, which turns off all printouts. For full usage information, call it with --help.

Storing page data[edit]

You can create an additional Cargo table that holds "page data": data specific to each page in the wiki, not related to infobox data. This data can then be queried either on its own or joined with one or more "regular" Cargo tables. The table is named "_pageData", and it holds one row for every page in the wiki. You must specify the set of fields you want the table to store; by default it will only hold the five standard Cargo fields, like _pageName (see Database storage details). To include additional fields, add to the array $wgCargoPageDataColumns in LocalSettings.php, below the line that installs Cargo.

Currently there are six more fields that can be added to the _pageData table; here are the six fields, and the call to add each one:

  • _creationDate - the date the page was created:
$wgCargoPageDataColumns[] = 'creationDate';
  • _modificationDate - the date the page was last modified:
$wgCargoPageDataColumns[] = 'modificationDate';
  • _creator - the username of the user who created the page:
$wgCargoPageDataColumns[] = 'creator';
  • _fullText - the (searchable) full text of the page:
$wgCargoPageDataColumns[] = 'fullText';
  • _categories - the categories of the page (a list, queriable using "HOLDS"):
$wgCargoPageDataColumns[] = 'categories';
  • _numRevisions - the number of edits this page has had:
$wgCargoPageDataColumns[] = 'numRevisions';

Once you have specified which fields you want the table to hold, go to the Cargo /maintenance directory, and make the following call to create, or recreate, the _pageData table:

php setCargoPageData.php

If you want to get rid of this table, call the following instead:

php setCargoPageData.php --delete

You do not need to call the "--delete" option if you are planning to recreate the table; simply calling setCargoPageData.php will delete the previous version.

Storing file data[edit]

Similarly to page data, you can also automatically store data for each uploaded file. This data gets put in a table called "_fileData", which holds one row for each file. This table again has its own settings array, to specify which columns should be stored, called $wgCargoPageDataColumns.

There are currently three columns that can be set:

  • '_mediaType' - the media type, or MIME type, of each file, like "image/png":
$wgCargoFileDataColumns[] = 'mediaType';
  • '_path' - the directory path of the file on the wiki's server:
$wgCargoFileDataColumns[] = 'path';
  • '_lastUploadDate' - the date at which the file was last uploaded:
$wgCargoFileDataColumns[] = 'lastUploadDate';
  • '_fullText' - the full text of the file; this is only stored for PDF files:
$wgCargoFileDataColumns[] = 'fullText';
  • '_numPages' - the number of pages in the file; this is only stored for PDF files:
$wgCargoFileDataColumns[] = 'numPages';

To store the full text of PDF files, you need to have the pdftotext utility installed on the server, and then add the following to LocalSettings.php:

$wgCargoPDFToText = '...path to file.../pdftotext';

pdftotext is available as part of several different packages. if you have the PdfHandler extension installed (and working), you may have pdftotext installed already.

To store the number of pages, you need to have the pdfinfo utility installed on the server, and then add the following to LocalSettings.php:

$wgCargoPDFInfo = '...path to file.../pdfinfo';

Database storage details[edit]

When the data for a template is created or recreated, a database table is created in the Cargo database that (usually) has one column for each specified field. This table will additionally hold the following columns:

  • _pageName - holds the name of the page from which this row of values was stored.
  • _pageTitle - similar to _pageName, but leaves out the namespace, if there is one.
  • _pageNamespace - holds the numerical ID of the namespace of the page from which this row of values was stored.
  • _pageID - holds the internal MediaWiki ID for that page.
  • _ID - holds a unique ID for this row.

Storage of lists[edit]

For fields that have lists of values, the handling is more complex: a whole separate database table is created to hold all the individual values for this field. This table will get the name "MainTableName__FieldName" (e.g. "Books__Authors"), and it will have the following fields:

  • _rowID - holds the ID of the row (i.e., _ID) in the main table that this value corresponds to.
  • _value - holds the actual, individual value.

So if an "Authors" field contained three values, the "Books__Authors" table would have three rows corresponding to that one page.

There's one more complication for list fields: the corresponding field for a list field in the the database table will not actually be given that name, but will rather be called "FieldName__full", e.g. "Authors__full". This is to enable the "true" field name to serve as a "virtual" field within the #cargo_query call, to make querying on the field values table easier (see 'The "HOLDS" command').

Storage of coordinates[edit]

For fields of type 'Coordinates', like for fields that hold a list of values, no database field is created with the actual specified field name. Instead, the following three fields are created:

  • fieldName__full - holds the coordinates as written in the page
  • fieldName__lat - holds the latitude from the coordinates, as a float
  • fieldName__lon - holds the longitude from the coordinates, as a float

If the coordinates cannot be parsed, the "__full" field still gets the value, but the "__lat" and "__lon" fields are set to null.

Storage of dates[edit]

For fields of type 'Date' or 'Datetime', an extra field is created that is named "fieldName__precision". It holds an integer value representing the "precision" of each date value, i.e. whether it holds a full date, only a year, etc. The possible values are:

  • 0 - date and time (can only occur for 'Datetime' fields)
  • 1 - date only
  • 2 - year and month only
  • 3 - year only