Toolserver:SGE for beginners

The toolserver uses SGE (previously Sun Grid Engine) for distribute (long-running) jobs (like bots) over the cluster. SGE is very flexible, but it can also be confusing from time to time. So this page tells just simple recipes, for more details see SGE. SGE thinks in the concept of jobs and resources. SGE is like a big company with different employees. You can call the firm and say "I need you to build me a house" (that's the job) and the company will look when its employees (that's the resources) have free time to build you a house. The only difference is that SGE is not so smart as a human manager and you need to tell SGE what you need (how many bricklayers, carpenters, plumbers, etc.). But it is smart enough to only execute your job when all needed craftsmen are available (human managers are not so smart from time to time…). If not enough free craftsmen are around the job is stored in a queue and waits there until all needed craftsmen are around again. Please do not think "I just not tell them that I need plumbers so my job is execute faster" because
 * there may be no plumbers around when you need them (so you get a house without tubes)
 * it's unfair against other honest people.

The beginning
You can only use SGE if your job can run without you (that's called non-interactive). If you run your program in cron already, it is non-interactive. If you are unsure just start your program on the command-line. Does it ask any questions? Do you have to type something in? Then it is interactively and you can not use SGE. If your program is non-interactive (because you know or you tested) than the next test is: Start it on the command-line and see if it runs. If not: Fix it (maybe it misses parameters or required software or libraries are missing – it is not focus of this How-To to fix this, sorry). If your program runs than you need to create a so called wrapper-script next. A wrapper-script (short wrapper) does nothing other than execute another program or script. You will see later why we need it. Lets assume that your program is in the sub-directory "bot" in your home, it is called fixit.py and it needs the parameter "-login" to run. Lets also assume that your login-name is smith. So until now your home looks like to following: / /bot fixit.py The complete path to your fixit.py is /home/smith/bot/fixit.py. Create now a wrapper-file; it should end with ".sh" in your bot-subdirectory. In this example it's called "start_fixit.sh" and looks like the following: Not very spectacular yet. Don't forget to make it executable with chmod u+x "/home/smith/bot/start_fixit.sh"

You should now be able to start your bot with just calling /home/smith/bot/start_fixit.sh on the command-line. Until now no SGE is involved in any way. That will change in the next chapter.

The very basics
For every job you need to tell SGE about 4 resources: Runtime, memory-usage and architecture. The runtime is how long your job will run. This can be a few minutes (a bot that puts the flower-of-the-day on a main-page of a wiki), some hours (a bot that cleans up talk-pages), days (a bot that maintenances a working-site on a wiki) or forever (like an IRC-bot). Normally your task should NOT run forever; let it stop from time to time (to free memory and give other people a chance to run jobs). A good runtime of a forever bot is 3 to 7 days – it is no problem to start the bot again short after. To tell SGE about the runtime (and every other resources too) you have to set a parameter in the script. The parameter for the runtime is called h_rt and expects its value in the format hh:mm:ss. Thats looks more complicated as it is, see the following examples Every resource you need is insert in the wrapper in the following way: Lets assume that your fix-it-bot runs 15 minutes for normal. You should always add a safety buffer to every resource, so we will tell SGE that your job will run 20 minutes. So you will add to the beginning of your wrapper. The wrapper looks now like the following:
 * 00:03:00 → 3 minutes
 * 01:00:00 → 1 hour
 * 96:00:00 → 4 days
 * 44:01:00 → 1 day, 20 hours and 1 minute.
 * 1) $ -l resource_name=value
 * 1) $ -l h_rt=00:20:00

Now we have to think about the second important parameter: memory using. Your program (whatever it is) will use memory during its run. SGE needs to know how many so it will not start your job on a server where not enough memory is left. The value to give is the maximal amount (called peak usage) of memory needed. To get the value is a little bit tricky. The simplest way is to ask somebody who knows (because he/she runs a similar program), the dirties way is to guess and the surest is to measure. One way to see the memory needed by a program is to start the program in one terminal and than open a second terminal (like another putty) to the same server and run ps -e -o user,comm,rss|grep smith (replace "smith" with your login-name). The command outputs the name and the memory-usage of all your programs running. The value you need is in the last column in KB. No matter how you got the memory-usage the SGE-parameter is called virtual_free and the value it needs is in MB (KB/1000). Lets assume that your fix-it-bot needs 250MB (plus 50MB to be safe) you would add to the wrapper-script. It looks now the following: The architecture is the easiest of the three. Most times your job can run on every architecture we have (that's Linux and Solaris at the moment). So you can set the SGE-parameter arch to  '*'  (please notice the ' ). Your script is no special case so you can add to your wrapper-script which looks now like the following:
 * 1) $ -l virtual_free=300M
 * 1) $ -l arch='*'

Databases
If you program do not need database-access in any way you can skip this chapter. This how-to handles only read-access to the replicated wmf-databases (sql-sX-rr). There are several other types of sql-resources but all are handled in the same way. For details see SGE. SQL-resources are really simple: Just tell how many you need (most times: 1, maximum 4). Lets assume that your fix-it-bot needs read-access to the database of the english wikipedia (enwp for short). enwp is located on cluster 1 (use the sql-table toolserver.wiki to get the cluster-number), so the resource is called sql-s1-rr. If you would need a read-access to a database of cluster 2 the resource would be sql-s2-rr and so one. Like most programs you only need 1 connection so you add the line to the wrapper-script. It looks now the following:
 * 1) $ -l sql-s1-rr=1

Notification and logging
We are done with the resources. But we have to tell SGE two last things: When it should notify you and what to do with the program-output. Because both things are no resources so no "-l" is used here, but own SGE-parameters. First notification: SGE can send you eMails about many things (see here for details). The parameter is "-m". It needs a value that tells SGE when you would like to get an email. Most time you interested in only 1 thing: "Was my job aborted?" and that's the value "a". In the testing phase of you might also interested in "When did my job started?" ("b") and "When did my job end?" ("e"). So long story short: Set "-m abe" in the testing phase and when everything works switch it to "-m a". Your job is in the testing phase so it looks now like the following:

We are nearly done now. Your script will (most likely) produce output. You have to tell SGE what to do with this output. Because this is a beginner-howto we will just put the output in a file. The parameter here is "-o" and the needed value is a path+filename. Also we need to set the "-j"-parameter to "y" (if you like to learn why, see SGE). In this example we will put the output to /home/smith/bot/fixit.log. This makes your script look like the following:

And now we are done with the wrapper-script.

Submitting
Until now we just did preparations. Now we will do the real thing: Executing the job in SGE. You just have to think about one last thing: A name. I will explain later for that the name is, for now just make sure it is meaningful and not too long (<10 chars is great). For the how-to we will choose "fixIt" as name. To execute a job in SGE you have to submit it. Submitting is like the calling of our house-example from the beginning. There are different methods to do this, but this how-to will only explain to most common: Submitting a job by cron. How cron works is not focus of this how-to. The command-part of a cron-line for your SGE-job looks just like the following: qcronsub -N fixIt /home/smith/bot/start_fixit.sh Easy, isn't it? Now I will tell you about the name. The name is good for 2 things: The later is a often overlooked advantage on SGE. Why is it so useful? Imaging your job is a bot that should run for 3 days. After 12h the bot crashes because of a software-bug, but you are in bed during this time. Normally your bot would be offline until you awakes and restart it – but with SGE you can submit your job just every hour. If a job with that name is already running, SGE does nothing, if no job with that name is running SGE will execute the job. So in our example the bot would be down for 1 hour at maximum. A word of warning: Do NOT submit your job too often (like every minute) – that's wasting of resources. That's it, we are done. You should get an eMail as soon as your job is started (normally that will happen shortly after your cron submitted the job). There are many great things left to learn about SGE, but if you successful mastered this how-to, you know the important basics about SGE – congratulation.
 * for identification,
 * to prevent SGE from running your job more than 1 time in parallel.