Initial setup[edit]

Login to stat1007:

$ ssh stat1007

Setup proxy[edit]

$ export https_proxy=https://webproxy.eqiad.wmnet:8080

$ export http_proxy=http://webproxy.eqiad.wmnet:8080

Create virtualenv[edit]

$ virtualenv --python=/usr/bin/python3 python3

Active the virtual environment by:

$ source python3/bin/activate

Now, install jupyter notebook:

$ pip install jupyter

Next, add the following lines to your .profile file:

export PYSPARK_PYTHON=/usr/bin/python3.5
export PYSPARK_PYTHON=/srv/home/USER/python3/bin/python

You can additionally add these two lines to make your life easier:

alias venvspark="source python3/bin/activate; source ~/.profile"
alias startspark="pyspark2 --master yarn --deploy-mode client --executor-memory 8g --driver-memory 8g --conf spark.dynamicAllocation.maxExecutors=128"

Close the session, and you will have everything configured.

Starting notebook[edit]

Make sure to check Kerberos authentication timeout first. Default is set to 48 hours now.

$ klist

Extend it by running kinit:

$ kinit

Now, you can login again and you will just need to do this:

$ venvspark

$ startspark

Press ESC,

And check in which port the jupyter notebook is running (usually you should have 8888 or 8889), in this example is 8889


Then, in your local machine, create a tunnel by running:

$ ssh -N stat1007.eqiad.wmnet -L 8889:

And then using your browser you will see the normal notebook in:


Running scripts[edit]

1. Run all notebooks in order.

2. 00ExtractNamedTempates.ipynb overwrites existing output if runs again, so it is better to save products json files somewhere to save time.

Using Python[edit]

1. Convert ipynb to Python files: ```bash jupyter nbconvert --to python nb.ipynb ``` 2. Update config.json, remove unneeded pair.

3. Put Wikipedia dumps under: `templatesAlignment/../../dumps/%swiki/latest/` only.

4. Rename dump to reflect dumpdate as `latest` to simplify script run.

5. Run all scripts in order.


1. can not be run on local machine.

2. If running locally, `01Download` need to run with ipython or just download needed models.

3. fastText_multilingual is available at: Apply patch given at #23 to fix ModuleNotFound error while running script.

4. requires instead of version provided by pip.

Also see[edit]