Convert Socialtextwiki to Mediawiki

From MediaWiki.org
Jump to navigation Jump to search

This page describes how to convert a Socialtextwiki to Mediawiki using Linux. It is based on a single conversion and is by no means exhaustive; it was tested with a wiki comprising only a few hundred pages and files and could be improved a lot.

Socialtextwiki is similar to Kwiki.

The procedure described below can:

  • convert pages, retaining the essential syntax
  • convert files
  • convert histories of pages and files

and cannot:

  • convert tables (reedit them manually, mostly adding only start and end syntax)
  • most other features of Socialtextwiki
  • convert user-association of edits
  • and much more

Introduction[edit]

Our socialtext wiki is stored as files located in a directory named

data

The tree contains one directory per page (below data/{WORKSPACE}), with one index.txt containing the current version of the page and several {date}.txt files containing older revisions. (Workspaces are separate branches of socialtext wiki.)

The files are located within a directory named

plugin

Put all the following files and dirs (except for the new wiki) into one working directory and proceed as follows.

Install MediaWiki[edit]

  • install a current mediawiki
  • allow upload of all files
$wgEnableUploads = true;
$wgStrictFileExtensions = false;
$wgCheckFileExtensions = false;
  • modify php.ini and reload apache2 (to be able to upload bigger files)
post_max_size = 32M
upload_max_filesize = 32M

Copy the original files to the new host[edit]

Copy these directories (use scp, not rsync, since we don't want symlinks; index.php are symlinks):

  • data
  • plugin

Script to convert a single page[edit]

create a script conv.py to convert a single page. It takes the file name of the page as arg1.

#!/usr/bin/python

import re
import sys

filename = sys.argv[1]
 
f = open(filename, "r")
text = f.read()
(header,content) = text.split('\n\n',1)

# trim content lines
lines = content.split('\n')
lines2 = [line.strip() for line in lines]
content = '\n'.join(lines2)

# headings
p = re.compile('^\^\^\^\^(.*)$', re.M)
content = p.sub('====\\1 ====', content)
p = re.compile('^\^\^\^(.*)$', re.M)
content = p.sub('===\\1 ===', content)
p = re.compile('^\^\^(.*)$', re.M)
content = p.sub('==\\1 ==', content)
p = re.compile('^\^(.*)$', re.M)
content = p.sub('=\\1 =', content)

# bold
p = re.compile('([^\*]+)\*([^\*]+)\*', re.M)
content = p.sub('\\1\'\'\'\\2\'\'\, content)

# link
p = re.compile('\[([^\]]+)\]', re.M)
content = p.sub('[[\\1]]', content)

# file
p = re.compile('{file: ([^}]+)}', re.M)
content = p.sub('[[Media:\\1]]', content)

# image
p = re.compile('{image: ([^}]+)}', re.M)
content = p.sub('[[Bild:\\1]]', content)

# item level 1
p = re.compile('\342\200\242\011', re.M)
content = p.sub('* ', content)

# table, only partially, do the rest manually!
# you have to add {|... , |} , and check for errors due to empty cells
p = re.compile('[^\n]\|', re.M)
content = p.sub('\n|', content)
p = re.compile('\|\s*\|', re.M)
content = p.sub('|-\n|', content)

# lines with many / * + symbols were used as separator lines...
p = re.compile('[\/]{15,200}', re.M)
content = p.sub('----', content)
p = re.compile('[\*]{15,200}', re.M)
content = p.sub('----', content)
p = re.compile('[\+]{15,200}', re.M)
content = p.sub('----', content)

# external links
p = re.compile('\"([^\"]+)\"<http(.*)>\s*\n', re.M)
content = p.sub('[http\\2 \\1]\n\n', content)
p = re.compile('\"([^\"]+)\"<http(.*)>', re.M)
content = p.sub('[http\\2 \\1]', content)


# add categories
content += '\n'
header_lines = header.split('\n')
for line in header_lines:
    if re.match('^[Cc]ategory: ', line):
        category = re.sub('^[Cc]ategory: (.*)$', '\\1', line)
        content += '[[Category:' + category + ']]\n'

# departments / workspaces
if re.match('data/zsi-fe', filename):
    content += '[[Category:FE]]\n'
if re.match('data/zsi-ac', filename):
    content += '[[Category:AC]]\n'
if re.match('data/zsi-tw', filename):
    content += '[[Category:TW]]\n'

print content

Test it like this:

./conv.py data/{WORKSPACE}/{PAGENAME}/{REVISION}

Just copy the resulting wiki text into a page of the new mediawiki and use preview.

Adapt the python script to your needs until most pages are translated correctly.

Script to upload a single file[edit]

The MediaWiki API does not yet have action=upload. Get upload.pl.

The file has to be modified to use our new server instead of mediawiki.blender.org . Also edit username and password. Create a directory called 'upload', put content there and test uploading.

script to migrate pages[edit]

Use this script (which calls ./conv.py) to migrate pages. They will be uploaded in chronological order:

#!/bin/sh

wikiurl="http://NAME.OF.NEW.SERVER/mediawiki/api.php"
lgname="WikiSysop"
lgpassword="*************"

# login
login=$(wget -q -O - --no-check-certificate --save-cookies=/tmp/converter-cookies.txt \
             --post-data "action=login&lgname=$lgname&lgpassword=$lgpassword&format=json" \
             $wikiurl)
#echo $login 

# get edittoken
edittoken=$(wget -q -O - --no-check-certificate --save-cookies=/tmp/converter-cookies.txt \
             --post-data "action=query&prop=info|revisions&intoken=edit&titles=Main%20Page&format=json" \
             $wikiurl)
#echo $edittoken
token=$(echo $edittoken | sed -e 's/.*edittoken.:.\([^\"]*\)...\".*/\1/')
token="$token""%2B%5C"
#echo $token

# test editing with a test page
#cmd="action=edit&title=test1&summary=autoconverted&format=json&text=test1&token=$token&recreate=1&notminor=1&bot=1"
#editpage=$(wget -q -O - --no-check-certificate --load-cookies=/tmp/converter-cookies.txt --post-data $cmd $wikiurl)
#echo $editpage
#exit 

# loop over all pages except for dirs in the list of excludes
find data -not -path "data/help*" -type f -and -not -name ".*" | sort |
while read n; do
    pagedir=$(echo $n | sed -e 's/.*\/\(.*\)\/index.txt/\1/')
    if [[ "`grep -q $pagedir excludes; echo $?`" == "0" ]]; then
        echo "omitting  $pagedir"
    else
        echo "parsing   $pagedir"
        workspace=$(echo $n | sed -e 's/.*\/\(.*\)\/[^\/]\+\/index.txt/\1/')
        pagename=$(egrep '^Subject:' $n | head -n 1 | sed -e 's/^Subject: \(.*\)/\1/')
        pagedate=$(egrep '^Date:' $n | head -n 1 | sed -e 's/^Date: \(.*\)/\1/')
        echo "$workspace $pagedir -------------- $pagename";
        text=$(./conv.py $n)
        text1=$(php -r 'print urlencode($argv[1]);' "$text")
        pagename1=$(php -r 'print urlencode($argv[1]);' "$pagename")
        pagedate1=$(php -r 'print urlencode($argv[1]);' "$pagedate")
        cmd="action=edit&title=$pagename1&summary=$pagedate1+autoconverted+from+socialtextwiki&format=json&text=$text1&token=$token&recreate=1&notminor=1&bot=1"
        editpage=$(wget -q -O - --no-check-certificate --load-cookies=/tmp/converter-cookies.txt --post-data $cmd $wikiurl)
        #echo $editpage    
    fi
done

script to migrate files[edit]

Use this script (which calls ./upload.pl) to migrate files. The files will be uploaded in chronological order:

#!/bin/sh

find plugin -path 'plugin/zsi*/attachments/*.txt' | sort |
while read f; do
    if [[ "`grep -q 'Control: Deleted' $f; echo $?`" != "0" ]]; then
        d=${f/.txt}
        filenameNew=$(egrep '^Subject:' $f | sed -e 's/Subject: \(.*\)/\1/')
        filenameOrig=$(ls -1 $d | head -n 1)
        version=$(egrep '^Date: ' $f | sed -e 's/Date: \(.*\)/\1/')
        #echo "---------------------------"
        #echo $filenameOrig
        #echo "$filenameNew"
        rm upload/*
        cp $d/$filenameOrig "upload/$filenameNew"
        # prepare upload
        echo -e ">$filenameNew\n$filenameNew\n$version\n(autoconverted from socialtext wiki)" > upload/files.txt
        # upload
        ./upload.pl upload
    fi
done

Notes[edit]