Manual:PhpWiki conversion

This article describes the step of a process that was developed for converting PhpWiki pages into MediaWiki pages. (The process was devised by User:KeithTyler for use in a workplace intranet.)

The process entails:
 * Exporting
 * Control script
 * Scrubbing
 * Sed script
 * Database insert

Exporting
The export was performed via the "ZIP Snapshot" administrator function of PhpWiki. The function sends a ZIP file of all pages, in RFC822 format, to the browser for download.

The result zip file was unzipped into a source directory.

For past revisions
ZIP Snapshot only includes the current versions of files. If historic versions of pages are required, then the ZIP Dump should be used.

This process does not address past revisions. A process to migrate all past revisions of PhpWiki pages will require a separate (or smarter) database insertion process. In MediaWiki, current revisions of pages are stored in the  table, while previous revisions are stored in the   table.

Control script
The control script consists of a for loop that calls the scrubbing and sed script on each file, and sends the output to a target results directory.

for file in * do   tail +14 $file | sed -f ../phpwikiconvert > converted/$file done

Scrubbing
The "tail +14" removes the RFC822 header information from the result file.

These headers contain PhpWiki metadata for the page, such as: Date: Mon, 18 Oct 2004 16:31:28 -0700 Mime-Version: 1.0 (Produced by PhpWiki 1.3.5pre) Content-Type: application/x-phpwiki; pagename=10-digit%20dialing; flags=""; author=KeithTyler; version=2; lastmodified=1098142288; author_id=192.168.32.112; markup=2; charset=iso-8859-1 Content-Transfer-Encoding: binary

If this metadata needs to be maintained in the migration, a more complex conversion method will be required that retains this data and uses it in the database insertion step of the process.

Note that using  is inelegant, but seems to work. A more robust method would extract all lines before the first blank line.

Sed script
The following sed script covers, in simple fashion:
 * Typeface markup
 * Header markup
 * WikiLink markup
 * Table markup
 * Redirect markup

For the database insertion process, we also need to escape all double-quotes.

Note that this does not perform any implicit conversion or linking of CamelCase links. Note also any implicity conversion of CamelCase links (e.g. to insert a space) will have to also have considerations for the database insertion step, or in the output step of the hook script.

s/_\([^_]*\)_/\1/g # italic -- OK s/\*\([^\*]*\)\*/\1/g  # boldface -- OK s!=\([^=]*\)=! !g # fixed-width -- OK
 * 1) typeset markup

s/!!!\(.*\)$/==\1==/g s/!!\(.*\)$/===\1===/g s/!\(.*\)$/====\1====/g
 * 1) header markup -- OK

s!\([^|][^|]*\)|!\1||!g s!^|!|-\n|!g # convert row start -- OK s!.*plugin OldStyleTable.*!\{\|! # convert table start -- mostly OK s!^?>$!\|\}! # convert table end -- mostly OK
 * 1) table markup (hopefully)

s!\[\(.*\)|\(http.*\)]![\2 \1]!g # url format -- OK s!\[\(.*\)|\(.*\)\]![\2|\1]!g  # switch display and link text -- OK s!\[\([^]]*\)\]!\1!g  # double bracketize -- OK s!\[\[\(http.*\)\]\]![\1]!g  # undo double-bracketing urls by above -- OK
 * 1) link markup

s!<?plugin RedirectTo page=\(.*\)?>!#REDIRECT \1!
 * 1) redirects

s!"!\\"!g
 * 1) quotes

There are almost certainly issues that this script does not take care of. For example, it certainly will not take care of any PhpWikiPlugins (besides OldStyleTable). Of course, there are plenty of plugins that do not have apparent analogues in MediaWiki (like CalendarPlugin), so you will always need to be prepared for more human-oriented conversion and de-functionalisation work. If you've become wholly dependent on PhpWikiPlugins, you're quite likely in for a migration headache.

DlStyleTables
The above sed does not address DlStyleTables. Which is good, because DlStyleTables are a lame excuse for not having more integrated table support in PhpWiki. If you never saw the point of DlStyleTables, good. But if for some reason you eventually decided to find a use for them, you'll have to convert them yourself.

If you don't convert DlStyleTables, they will look like this:

Term 1| Definition 1 Term 2| Definition 2

It will be challenging to find an automated process to convert these. Try perl; sed alone can't do it very easily. You will need to:
 * Convert Term lines to boldface cells
 * Convert Definition lines to cells
 * Place row dividers between each pair
 * Add table start and end markup to blocks of rows.
 * Avoid having DlStyleTables treated as OldStyleTable markup. This means your process will need to recognize the multi-line nature of DlStyleTables and recognize the line sequence as such, and not as part of an OldStyleTable. PhpWiki could do this because OldStyleTables were inside  blocks, and DlStyleTables were regular markup.

File renaming
It was deemed desirable to rename the files to remove escaped characters (specfically space (%20) to underscore, and slash (%2F)).

A command line for loop performs this for spaces:

/phpwiki/converted% for file in *; do; mv $file `echo $file|sed 's!%20!_!g'`; done 2>/dev/null

For slashes, we need to perform this conversion during the database insert step. Most systems will not allow slashes in filenames as slashes are directory name delimiters.

Database insert
The most efficient means determined for inserting a collection of pages into the MediaWiki DB was to use SQL INSERT statements (as part of a for loop using mysql client). The minimum set of fields needed to insert are:



If a special user is desired to be attributed to the migration process, that user should be created, and then  must be added to the insert. In the above example,  will be the column default of 0. is required however, as the page history will display incorrectly without a name to display. Likewise for cur_timestamp.

We need to make sure of the following: Otherwise, MediaWiki will never line up the requested name with the title in the database. The first issue has been solved by the for-loop above; the second and third will be addressed in the insertion loop.
 * No spaces appear in the filenames
 * Slash characters are converted to slashes
 * All titles begin with a capital letter

The shell script to do the insertion should look something like this:

for file in *; do title=`echo $file|sed 's!%2F!/!g'|perl -n -e "print ucfirst;"`` cat <<END | mysql --password="(your db password)" wikidb insert into cur (cur_title,cur_user_text,cur_timestamp,cur_text) values("$title","PhpWikiMigration",now+0,"`cat $file`"); END done

We use a quick sed to convert %2F to a slash, and a quick perl to capitalize the filename. Then we use cat with a heredoc to throw in a SQL statement with our fixed title, and the contents of the associated file.

Caveats
Further caveats beyond those already stated above.

This process does not handle conversion of PhpWiki user's pages into Wikipedia User: namespace pages. It is possible by using a trick with the PhpWiki metadata: if the title name is the same as the author name, then put the page into namespace=2. Unfortunately, among other things, this assumes that the user was the last person to edit their user page. Depending on the size of your user base, you may opt to manually move these pages.

Plugin markup besides OldStyleTable and RedirectTo will remain in the converted output. This will need to be changed by hand. Since PhpWikiPlugins are essentially open-ended modules, there is really no way for the process to know exactly what your plugin is supposed to do (in our case, some of the plugins had been hacked at to add or change features, and some new plugins were created).

Even if PhpWiki provided a dump method that implicitly called the plugins and spat out their output, this could cause a mess when it comes to RedirectToPlugin and probably other convertable plugins (we'd get HTML tables instead of being able to convert OldStyleTable markup to MediaWiki table wikicode, for example).

If you were to opt to pull out your PhpWiki content in processed HTML, you'd have to invent a messy process for converting your WikiLinks back into WikiLinks.

This process does not treat underscores within words differently than underscores surrounding words. You will get errant italicization. The sed script could probably be fixed to address this.

Other things lacking from this process that could be added:
 * Conversion of ~ to
 * Conversion of %%% to
 * Conversion of  to