User:KartikMistry/TPA

From mediawiki.org

Initial setup[edit]

Login to stat1007,

ssh stat1007

Setup proxy[edit]

export https_proxy=https://webproxy.eqiad.wmnet:8080
export http_proxy=http://webproxy.eqiad.wmnet:8080

Setup repository[edit]

git clone https://github.com/digitalTranshumant/templatesAlignment.git
cd templatesAlignment

Create virtualenv[edit]

virtualenv --python=/usr/bin/python3 python3

Active the virtual environment by:

source python3/bin/activate

Now, install jupyter notebook:

pip install jupyter

Next, add the following lines to your .profile file:

export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
export PYSPARK_PYTHON=/usr/bin/python3.7
export PYSPARK_PYTHON=/srv/home/USER/python3/bin/python

You can additionally add these two lines to make your life easier:

alias venvspark="source python3/bin/activate; source ~/.profile"
alias startspark="pyspark2 --master yarn --deploy-mode client --executor-memory 8g --driver-memory 8g --conf spark.dynamicAllocation.maxExecutors=128"

Close the session, and you will have everything configured.

Starting notebook[edit]

Make sure to check Kerberos authentication timeout first. Default is set to 48 hours now.

klist

Extend it by running kinit:

kinit

Now, you can login again and you will just need to do this:

venvspark

startspark

Press ESC,

And check in which port the jupyter notebook is running (usually you should have 8888 or 8889), in this example is 8889

http://localhost:8889

Then, in your local machine, create a tunnel by running:

ssh -N stat1007 -L 8889:127.0.0.1:8889

And then using your browser you will see the normal notebook in:

http://localhost:8889

Running scripts[edit]

1. Run all notebooks in order.

2. 00ExtractNamedTempates.ipynb overwrites existing output if runs again, so it is better to save products json files somewhere to save time.

Notes[edit]

1. 02alignmentsSpark.py can not be run on local machine.

2. If running locally, `01Download Models.py` need to run with ipython or just download needed models.

3. fastText_multilingual module is available at: https://github.com/babylonhealth/fastText_multilingual

Apply patch given at #23 to fix ModuleNotFound error while running script.

4. 03ProduceAlignments.py requires https://github.com/facebookresearch/fastText/tree/master/python instead of version provided by pip.

Also see[edit]