Talk:Parsoid

Jump to navigation Jump to search

About this board

How to use thenets/parsoid in Docker in Windows 10?

1
DungLe94 (talkcontribs)

I've installed thenets/parsoid on Docker on Windows 10. I want to convert the text file F:\zim\pomme.txt to html. I tried

docker run --name myparsoid -d -t -i -v /f/zim:/zim thenets/parsoid:latest sh

type /zim/pomme.txt | docker exec myparsoid php bin/parse.php --wt2html --offline

but it returns an error

Microsoft Windows [Version 10.0.19042.928] (c) Microsoft Corporation. All rights reserved.

C:\Users\Akira>docker run --name myparsoid -d -t -i -v /f/zim:/zim thenets/parsoid:latest sh 7912b0cef8fba4244b2519f4f9603ec8e278b67bcc4fe08f4658721b98f941f3

C:\Users\Akira>type /zim/pomme.txt | docker exec myparsoid node bin/parse.js --wt2html --offline The syntax of the command is incorrect. internal/modules/cjs/loader.js:638

   throw err;
   ^

Error: Cannot find module '/bin/parse.js'

   at Function.Module._resolveFilename (internal/modules/cjs/loader.js:636:15)
   at Function.Module._load (internal/modules/cjs/loader.js:562:25)
   at Function.Module.runMain (internal/modules/cjs/loader.js:831:12)
   at startup (internal/bootstrap/node.js:283:19)
   at bootstrapNodeJSCore (internal/bootstrap/node.js:622:3)

Could you please shed some light on how to fix the error?

Reply to "How to use thenets/parsoid in Docker in Windows 10?"

Whitespace in headings?

2
Summary by Arlolra
RoySmith (talkcontribs)

What is the intended behavior when parsingehavior when parsing:

== Foo ==

I would have expected the whitespace around Foo to be preserved, but it's not. The example at Parsoid/API#POST 2 implies that it is, but when I try it, the whitespace is gone:


wget -q -O -  'http://en.wikipedia.org/api/rest_v1/page/html/User:RoySmith%2Fsandbox%2Fparsoid-whitespace-example'


<!DOCTYPE html>

<html prefix="dc: http://purl.org/dc/terms/ mw: http://mediawiki.org/rdf/" about="https://en.wikipedia.org/wiki/Special:Redirect/revision/1005513872"><head prefix="mwr: https://en.wikipedia.org/wiki/Special:Redirect/"><meta property="mw:TimeUuid" content="b56a5f60-69b0-11eb-876b-49aa12313550"/><meta charset="utf-8"/><meta property="mw:pageId" content="66664679"/><meta property="mw:pageNamespace" content="2"/><link rel="dc:replaces" resource="mwr:revision/0"/><meta property="mw:revisionSHA1" content="9fa2ea02674418d1bab8d09bd0c639bcf220a57b"/><meta property="dc:modified" content="2021-02-08T01:54:45.000Z"/><meta property="mw:html:version" content="2.2.0"/><link rel="dc:isVersionOf" href="//en.wikipedia.org/wiki/User%3ARoySmith/sandbox/parsoid-whitespace-example"/><title>User:RoySmith/sandbox/parsoid-whitespace-example</title><base href="//en.wikipedia.org/wiki/"/><link rel="stylesheet" href="/w/load.php?lang=en&amp;modules=mediawiki.skinning.content.parsoid%7Cmediawiki.skinning.interface%7Csite.styles&amp;only=styles&amp;skin=vector"/><meta http-equiv="content-language" content="en"/><meta http-equiv="vary" content="Accept"/></head><body id="mwAA" lang="en" class="mw-content-ltr sitedir-ltr ltr mw-body-content parsoid-body mediawiki mw-parser-output" dir="ltr"><section data-mw-section-id="0" id="mwAQ"></section><section data-mw-section-id="1" id="mwAg"><h2 id="Foo">Foo</h2></section></body></html>

SSastry (WMF) (talkcontribs)

Parsoid starts but fails to connect with curl

2
Summary by Arlolra

User disappeared

Johnjin216326 (talkcontribs)

OS is Fedora 31

I downloaded parsoid from bluespice wiki

ii. create service under /etc/system/system/parsoid.service


[Unit]

Description=Mediawiki Parsoid web service on node.js

Documentation=

Wants=local-fs.target network.target

After=local-fs.target network.target

    [Install]

    WantedBy=multi-user.target

    [Service]

    Type=simple

    User=nobody

    Group=nobody

    WorkingDirectory=/opt/parsoid

    #EnvironmentFile=-/etc/parsoid/parsoid.env

    ExecStart=/usr/bin/nodejs /opt/parsoid /bin/server.js

    KillMode=process

    Restart=on-success

    PrivateTmp=true

    StandardOutput=syslog


iii. Under /opt/parsoid/config.yaml

worker_heartbeat_timeout: 300000

    logging:

        level: info

    services:

      - module: lib/index.js

        entrypoint: apiServiceWorker

        conf:

            localsettings: ./localsettings.js

iv. Under /opt/parsoid/localsettings.js

/

* This is an example configuration for a BlueSpiceWikiFarm setup

* In this case 'httpd' is used as wiki webserver machine name as it is in our

* docker environment.

/

'use strict';

    exports.setup = function(parsoidConfig) {

        parsoidConfig.dynamicConfig = function(domain) {

   var baseUrl = Buffer.from( domain, 'base64').toString();

    parsoidConfig.setMwApi({

        uri: baseUrl + '/api.php',

        domain: domain,

        strictSSL: false

    });

}

};

The nodejs is at version 10 and parsoid is v0.10

Here's the output of curl

[root@wiki-server BlueSpice3]# curl http://127.0.0.1:8000

<!DOCTYPE html>

<#html lang="en">

<#head>

<#meta charset="utf-8">

<#title>Error<#/title>

<#/head>

<#body>

<#pre>Internal Server Error<#/pre>

<#/body>

<#/html>

(I've added a # in the bracket to show more info)

SELINUX is disabled, firewall is open and listening port 8000, although netstat doesn't show that parsoid service is using the port

[root@wiki-server BlueSpice3]# netstat -aon | grep 8000

tcp 0 0 127.0.0.1:8000 0.0.0.0:* LISTEN off (0.00/0/0)

httpd is configured with SSL domain certificate and https enabled.

Why does this fail?

Arlolra (talkcontribs)

Try restarting the Parsoid service and see what information is logged to syslog?

Parsoid is not working in non-english language

2
Summary by Arlolra

User disappeared

186.151.92.120 (talkcontribs)

I tried several time to setup my wiki in spanish languate with MediaWiki 1.35.1, and always shows a Parasoid/Rest error curl 7, however when I choose english as the wiki languate, everything goes fine.

Hope this can be fixed, cause my users doesn't talk english.

Arlolra (talkcontribs)

How did you install Parsoid?

Parsoid - Memory exhaustion on a big page

5
Summary by Arlolra

Something in the user's setup is enforcing the limit.

189.9.10.125 (talkcontribs)

Hello, I'm having some problems with mediawiki parsoid regarding memory exhaustion, can someone help me?

on a very big page (can't tell the exact size, but the original written on MS Word has more than 70 pages) I get the following issue

Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 135168 bytes) in /var/www/html/mediawiki-1.35.1/vendor/wikimedia/parsoid/src/Html2Wt/WikitextSerializer.php on line 1683. Is an explode function


As you can see, it says the memory limit is 128M, but my phpinfo says 750M, configured via php.ini in several places to make sure (php.ini, php-fpm.conf)

from my phpinfo

memory_limit 750M 750M

here's a grep -r memory_limit on my /etc

php-fpm.d/www.conf:php_admin_value[memory_limit] = 750M

php.ini:memory_limit = 750M

so, both php.ini and fpm are configured with 750M


I already tryed to fix the memory_limit on the LocalSettings.php, but also no deal


PHP 7.4.14 (fpm-fcgi)

MediaWiki 1.35.1

Lua 5.1.5

ICU 65.1

MySQL 5.6.35-80.0-log

wikimedia/parsoid 0.12.1


Can someone help me? This is preventing me and my team to create long and important documents.

Thank you!

189.9.10.125 (talkcontribs)

Oh, and I did stop/restart php, phpfpm and httpd. Even restarted the OS.

189.9.10.125 (talkcontribs)

I also tryed to set ini_set( 'memory_limit', '750M' ); on wikitext2html and WikitextSerializer.php, inside serializeDOM, but it raises the same error

Arlolra (talkcontribs)

This isn't an inherent problem with Parsoid. On the WMF cluster, Parsoid runs with an ~1.4G memory limit, which it occasionally hits, but is certainly not limited to 128M

https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/InitialiseSettings.php#L18558-L18560

Something in your setup is enforcing that limit. Maybe it's the OS, maybe it's the HTTP server, or PHP configurations you're mentioning.

Try isolating it. You can run Parsoid on the command line with bin/parse.php

Pass it your large page and see if you run up against the memory limit there

189.9.10.125 (talkcontribs)

Ok, i'll try that,

thank you

How to convert Wikitionary dump to html?

2
DungLe94 (talkcontribs)

I've just found from this link that Parsoid which is a perfect tool to convert Wikitionary dump to html. I've downloaded the latest dump from here. However, I could not find any instruction to use Parsoid on this offline dump. Could you please elaborate on this issue?


Thank you so much for your help!

Dung Le.

Arlolra (talkcontribs)

There hasn't been any effort to make Parsoid usable with those dumps.

There are often questions about how to use Parsoid offline. See past discussions, https://lists.wikimedia.org/pipermail/wikitext-l/2020-February/000994.html https://lists.wikimedia.org/pipermail/wikitext-l/2020-April/000999.html https://www.mediawiki.org/wiki/Topic:Uko9gbijtxv2nh19

But so far Parsoid is mostly useful when it has access to a MediaWiki API to fetch configuration and resolve templates.

If you wanted to use the source from the dump, you could do something like cat "text from source" | php bin/parse.php --domain fr.wiktionary.org --wt2html and that will output some html. Alternatively, you can use the titles from the dump and fetch the html from the REST API, https://fr.wiktionary.org/api/rest_v1/page/html/bonjour

Reply to "How to convert Wikitionary dump to html?"

Parsoid with Kerberos and Auth_Remoteuser

1
Wikweng (talkcontribs)

Hello all,

I'm facing around with some problems with Parsoid and the Remoteuser Authentication with Kerberos.

First my setup:

Ubuntu 20.04.

Mediamywiki 1.31.12

PHP 7.4.3

BlueSpice 3.2.0

Parsoid 0.10.0

Kerberos SSO is working fine. Now the problem is, that when editing an article with Visual Editor, the page turns white and when trying to save an "HTTP 500" error appears. In the syslog I have an "401 Unauthorized" Error. I have the following configs. Parsoid is running on the same server as my Apache Webserver and is accessible at port 8000 (via cli and curl and also via browser). Also, when creating a new section or page, the Visual Editor is working.

___________________________________________

Apache vhost:

<VirtualHost *:443>

  ServerName mywiki.mydomain.com

  ServerAlias mywiki

   DocumentRoot /path/to/mediawiki

   <Directory /path/to/mediawiki>

       Options Indexes FollowSymLinks MultiViews

       AllowOverride None

       

   <RequireAny>

   AuthType Kerberos

   AuthName "Kerberos Login"

   KrbAuthRealms mydomain.COM

   KrbServiceName HTTP/serviceusr.mydomain.com

   Krb5KeyTab /etc/apache2/kerberos/mykeytab.keytab

   KrbLocalUserMapping On #Strips @REALM

   KrbMethodNegotiate on

   KrbMethodK5Passwd off

   Require valid-user

   Require ip 127.0.0.0/255.0.0.0

   </RequireAny>

   

   </Directory>

___________________________________________

LocalSettings.php:

$wgAuthRemoteuserUserName = function() {

   global $wgDBname;

   $user = '';

   if( isset( $_SERVER[ 'REMOTE_USER' ] ) ) {

       $user = $_SERVER[ 'REMOTE_USER' ] . '@mydomain.com';

   }

   if( isset( $_SERVER[ 'REMOTE_ADDR' ] ) && substr( $_SERVER[ 'REMOTE_ADDR' ], 0, 4 ) == '127.' ) {

       if( empty( $user ) ) {

           $user = $_COOKIE[$wgDBname.'304f3058RemoteToken'] . '@mydomain.com';

       }

   }

   return $user;

  };

___________________________________________

settings.d\020-VisualEditor.php:

// Creating base64 encoded path

$fullPath = $GLOBALS['wgServer'] . $GLOBALS['wgScriptPath'];

$encFullPath = base64_encode( $fullPath );

// Linking with Parsoid

$wgVirtualRestConfig['modules']['parsoid'] = array(

   // URL to the Parsoid instance

   // Use port 8142 if you use the Debian package

   'url' => 'http...127.0.0.1:8000', // I wasn't allowed to post it with "://"

   'domain' => $encFullPath,

   'forwardCookies' => true

);

$wgVisualEditorEnablemywikitext = true;

___________________________________________

Parsoid config.yaml:

worker_heartbeat_timeout: 300000

logging:

  level: info

services:

  - module: lib/index.js

  entrypoint: apiServiceWorker

  conf:

       localsettings: ./localsettings.js

___________________________________________

Parsoid localsettings.js:

'use strict';

exports.setup = function(parsoidConfig) {

   parsoidConfig.dynamicConfig = function(domain) {

       var baseUrl = Buffer.from( domain, 'base64').toString();

       parsoidConfig.setMwApi({

           uri: baseUrl + '/api.php',

           domain: mydomain,

           strictSSL: false

       });

   }

};

___________________________________________


Maybe I miss the obvious here, but I'm facing around with this issue for a few days now and i think it is time to ask the community for help ;)

Reply to "Parsoid with Kerberos and Auth_Remoteuser"
S0ring (talkcontribs)

I used to install Parsoid for MW 1.31 (supported until June 2021) with:

# git clone --branch v0.10.0 https://gerrit.wikimedia.org/r/p/mediawiki/services/parsoid /usr/lib/parsoid

but the URL is no longer valid, it returns 404. How to install it?

Arlolra (talkcontribs)
S0ring (talkcontribs)
S0ring (talkcontribs)

This works # git clone --branch v0.10.0 https://github.com/wikimedia/parsoid.git

Arlolra (talkcontribs)
S0ring (talkcontribs)

This URL returns the following error:

# git clone --branch v0.10.0 https://gerrit.wikimedia.org/r/admin/repos/mediawiki/services/parsoid /usr/lib/parsoid

Cloning into '/usr/lib/parsoid'...

fatal: https://gerrit.wikimedia.org/r/admin/repos/mediawiki/services/parsoid/info/refs not valid: is this a git repository?

Arlolra (talkcontribs)

TagWhiteList in PHP Parsoid

3
Summary by Arlolra

See AllowedLiteralTags

Dueni.f (talkcontribs)

Is there a TagWhiteList in PHP parsoid? I used it in JS parsoid for the Draw.io Extension (added 'A' and 'IMG' to WikitextConstants.js) to show the images in VisualEditor.

Arlolra (talkcontribs)
Dueni.f (talkcontribs)

Exactly right. Thank you very much.

Couldn't resolve host name (curl error: 6) but able to curl MediaWiki fine

2
Summary last edited by Babajkov 11:26, 27 January 2021 2 months ago

Couldn't resolve host name (curl error: 6) but able to curl MediaWiki fine


Fix:

ubuntu

nano /etc/hosts

add

127.0.1.1 you.domen.wiki.name

FireAmpersand (talkcontribs)

Not sure if I should be posting here or on the visualeditor page, but I can't for the life of me get Parsaoid/VisualEditor working.

Here is some info about my setup:

- Host: Centos 8

- Docker Installed and MediaWiki is running on that with 80->80

- MediaWiki is a private wiki

- Parsaoid is installed on the host and mapped to 8001 (Default had a conflict with container already in use)

- VisualEditor is installed and throws no error other then the one in the title.


Now I have gone through the following pages to try to troubleshoot my issue:

Parsoid/Troubleshooting

Extension:VisualEditor#Linking with Parsoid in private wikis


Currently I can curl the api through Parsoid and get values through a private wiki but within mediawiki, I can not edit with visual editor.


LocalSetting.php


<?php

# This file was automatically generated by the MediaWiki 1.34.0

# installer. If you make manual changes, please keep track in case you

# need to recreate them later.

#

# See includes/DefaultSettings.php for all configurable settings

# and their default values, but don't forget to make changes in _this_

# file, not there.

#

# Further documentation for configuration settings may be found at:

# https://www.mediawiki.org/wiki/Manual:Configuration_settings

# Protect against web entry

if ( !defined( 'MEDIAWIKI' ) ) {

   exit;

}

## Uncomment this to disable output compression

# $wgDisableOutputCompression = true;

$wgSitename = "FireAmpersand Documentation";

$wgMetaNamespace = "FireAmpersand_Documentation";

## The URL base path to the directory containing the wiki;

## defaults for all runtime URL paths are based off of this.

## For more information on customizing the URLs

## (like /w/index.php/Page_title to /wiki/Page_title) please see:

## https://www.mediawiki.org/wiki/Manual:Short_URL

$wgScriptPath = "";

## The protocol and server name to use in fully-qualified URLs

$wgServer = "http://bumblebee.fireampersand.ca";

## The URL path to static resources (images, scripts, etc.)

$wgResourceBasePath = $wgScriptPath;

## The URL path to the logo.  Make sure you change this from the default,

## or else you'll overwrite your logo when you upgrade!

$wgLogo = "$wgResourceBasePath/resources/assets/wiki.png";

## UPO means: this is also a user preference option

$wgEnableEmail = false;

$wgEnableUserEmail = true; # UPO

$wgEmergencyContact = "apache@🌻.invalid";

$wgPasswordSender = "apache@🌻.invalid";

$wgEnotifUserTalk = false; # UPO

$wgEnotifWatchlist = false; # UPO

$wgEmailAuthentication = true;

## Database settings

$wgDBtype = "sqlite";

$wgDBserver = "";

$wgDBname = "my_wiki";

$wgDBuser = "";

$wgDBpassword = "";

# SQLite-specific settings

$wgSQLiteDataDir = "/var/www/data";

$wgObjectCaches[CACHE_DB] = [

   'class' => SqlBagOStuff::class,

   'loggroup' => 'SQLBagOStuff',

   'server' => [

       'type' => 'sqlite',

       'dbname' => 'wikicache',

       'tablePrefix' => '',

       'variables' => [ 'synchronous' => 'NORMAL' ],

       'dbDirectory' => $wgSQLiteDataDir,

       'trxMode' => 'IMMEDIATE',

       'flags' => 0

   ]

];

$wgLocalisationCacheConf['storeServer'] = [

   'type' => 'sqlite',

   'dbname' => "{$wgDBname}_l10n_cache",

   'tablePrefix' => '',

   'variables' => [ 'synchronous' => 'NORMAL' ],

   'dbDirectory' => $wgSQLiteDataDir,

   'trxMode' => 'IMMEDIATE',

   'flags' => 0

];

$wgJobTypeConf['default'] = [

   'class' => 'JobQueueDB',

   'claimTTL' => 3600,

   'server' => [

       'type' => 'sqlite',

       'dbname' => "{$wgDBname}_jobqueue",

       'tablePrefix' => '',

       'variables' => [ 'synchronous' => 'NORMAL' ],

       'dbDirectory' => $wgSQLiteDataDir,

       'trxMode' => 'IMMEDIATE',

       'flags' => 0

   ]

];

## Shared memory settings

$wgMainCacheType = CACHE_ACCEL;

$wgMemCachedServers = [];

## To enable image uploads, make sure the 'images' directory

## is writable, then set this to true:

$wgEnableUploads = true;

$wgUseImageMagick = true;

$wgImageMagickConvertCommand = "/usr/bin/convert";

# InstantCommons allows wiki to use images from https://commons.wikimedia.org

$wgUseInstantCommons = false;

# Periodically send a pingback to https://www.mediawiki.org/ with basic data

# about this MediaWiki instance. The Wikimedia Foundation shares this data

# with MediaWiki developers to help guide future development efforts.

$wgPingback = false;

## If you use ImageMagick (or any other shell command) on a

## Linux server, this will need to be set to the name of an

## available UTF-8 locale

$wgShellLocale = "C.UTF-8";

## Set $wgCacheDirectory to a writable directory on the web server

## to make your wiki go slightly faster. The directory should not

## be publicly accessible from the web.

#$wgCacheDirectory = "$IP/cache";

# Site language code, should be one of the list in ./languages/data/Names.php

$wgLanguageCode = "en-ca";

$wgSecretKey = "d8f1a050a167350916861d1c65c428ec2e15ef1ddde8c3d305533350dd83f480";

# Changing this will log out all existing sessions.

$wgAuthenticationTokenVersion = "1";

# Site upgrade key. Must be set to a string (default provided) to turn on the

# web installer while LocalSettings.php is in place

$wgUpgradeKey = "364ffa43b8516e75";

## For attaching licensing metadata to pages, and displaying an

## appropriate copyright notice / icon. GNU Free Documentation

## License and Creative Commons licenses are supported so far.

$wgRightsPage = ""; # Set to the title of a wiki page that describes your license/copyright

$wgRightsUrl = "";

$wgRightsText = "";

$wgRightsIcon = "";

# Path to the GNU diff3 utility. Used for conflict resolution.

$wgDiff3 = "/usr/bin/diff3";

# The following permissions were set based on your choice in the installer

$wgGroupPermissions['*']['createaccount'] = false;

$wgGroupPermissions['*']['edit'] = false;

$wgGroupPermissions['*']['read'] = false;

## Default skin: you can change the default skin. Use the internal symbolic

## names, ie 'vector', 'monobook':

$wgDefaultSkin = "vector";

# Enabled skins.

# The following skins were automatically enabled:

wfLoadSkin( 'MonoBook' );

wfLoadSkin( 'Timeless' );

wfLoadSkin( 'Vector' );

# End of automatically generated settings.

# Add more configuration options below.

#Extensions

# VisualEditor

wfLoadExtension( 'VisualEditor' );

$wgDefaultUserOptions['visualeditor-enable'] = 1;

$wgHiddenPrefs[] = 'visualeditor-enable';

#Parsoid Connection

$wgVirtualRestConfig['modules']['parsoid'] = array(

   #URL to the Parsoid instance

   'url' => 'http://bumblebee.fireampersand.ca:8001',

   'domain' => 'localhost',

   'prefix' => 'localhost'

   );

//$wgSessionsInObjectCache = true;

//$wgVirtualRestConfig['modules']['parsoid']['forwardCookies'] = true;

if ( $_SERVER['REMOTE_ADDR'] == '172.17.0.1'){

   $wgGroupPermissions['*']['read'] = true;

   $wgGroupPermissions['*']['edit'] = true;

};

ini_set('error_log','/tmp/php-error.log');

error_log($_SERVER['REMOTE_ADDR']);



Any help would be apreciated, been working on this since last night.

Arlolra (talkcontribs)