Toolserver:SGE for beginners

The toolserver uses SGE (previously Sun Grid Engine) for distribute long-running jobs (like bots) over the cluster. SGE is very flexible, but it can also be confusing from time to time. So this page tells just simple recipes, for more details see SGE. SGE thinks in the concept of jobs and resources. SGE is like a big company with different employees. You can call the firm and say "I need you to build me a house" (that's the job) and the company will look when its employees (that's the resources) have free time and build you a house. The only difference is that SGE is not so smart as a manager and you need to tell SGE what you need (how many bricklayers, carpenters, plumbers, etc.). But it is smart enough to only execute your job when all needed craftsmen are available (human managers are not so smart from time to time…). If not enough free craftsmen are around the job is stored in a queue and waits there until all needed craftsmen are around again. Please do not think "I just not tell them that I need plumbers so my job is execute faster" because
 * there may be no plumbers around when you need them (so you get a house without tubes)
 * it's unfair against other honest people.

The beginning
You can only use SGE if your job can run without you (that's called non-interactive). If you run your program in cron already, it is non-interactive. If you are unsure just start your program on the command-line. Does it ask any questions? Do you have to type something in? Then it is interactively and you can not use SGE. If your program is non-interactive (because you know or you tested) than the next test is: Start it on the command-line and see if it runs. If not: Fix it (maybe it misses parameters or required software or libraries are missing – it is not focus of this How-To to fix this, sorry). If your program runs than you need to create a so called wrapper-script next. A wrapper-script (short wrapper) does nothing other than execute another program or script. You will see later why we need it. Lets assume that your program is in the sub-directory "bot" in your home, it is called fixit.py and it needs the parameter "-login" to run. Lets also assume that your login-name is smith. So until now your home looks like to following: / /bot fixit.py The complete path to your fixit.py is /home/smith/bot/fixit.py. Create now a wrapper-file; it should end with ".sh" in your bot-subdirectory. In this example it's called "start_fixit.sh" and looks like the following: Not very spectacular yet. Don't forget to make it executable with chmod u+x "/home/smith/bot/start_fixit.sh"

You should now be able to start your bot with just calling /home/smith/bot/start_fixit.sh on the command-line. Until now no SGE is involved in any way. That will change in the next chapter.

The very basics
For every job you need to tell SGE about 2 resources: Runtime and memory-usage. The runtime is how long your job will run. This can be a few minutes (a bot that puts the flower-of-the-day on a main-page of a wiki), some hours (a bot that cleans up talk-pages), days (a bot that maintenances a working-site on a wiki) or forever (like an IRC-bot). Normally your task should NOT run forever; let it stop from time to time (to free memory and give other people a chance to run jobs). A good runtime of a forever bot is 3 to 7 days – it is no problem to start the bot again short after. To tell SGE about the runtime (and every other resources too) you have to set a parameter in the script. The parameter for the runtime is called h_rt and expects its value in the format [dd:]hh:mm:ss. Thats looks more complicated as it is, see the following examples Every resource you need is insert in the wrapper in the following way: Lets assume that your fix-it-bot runs 15 minutes for normal. You should always add a safety buffer to every resource, so we will tell SGE that your job will run 20 minutes. So you will add to the beginning of your wrapper. The wrapper looks now like the following:
 * 00:03:00 → 3 minutes
 * 01:00:00 → 1 hour
 * 04:00:00:00 → 4 days
 * 01:20:01:00 → 1 day, 20 hours and 1 minute.
 * 1) -l resource_name=value
 * 1) -l h_rt=00:20:00

Now we have to think about the second important parameter: memory using. Your program (whatever it is) will use memory during its run. SGE needs to know how many so it will not start your job on a server where not enough memory is left. The value to give is the maximal amount (called peak usage) of memory needed. To get the value is a little bit tricky. The simplest way is to ask somebody who knows (because he/she runs a similar program), the dirties way is to guess and the surest is to measure. One way to see the memory needed by a program is to start the program in one terminal and than open a second terminal (like another putty) to the same server and run ps -e -o user,comm,rss|grep smith (replace "smith" with your login-name). The command outputs the name and the memory-usage of all your programs running. The value you need is in the last column in KB. No matter how you got the memory-usage the SGE-parameter is called virtual_free and the value it needs is in MB (KB/1000). Lets assume that your fix-it-bot needs 250MB (plus 50MB to be safe) you would add to the wrapper-script. It looks now the following:
 * 1) -l virtual_free=300M

Databases
If you program do not need database-access in any way you can skip this chapter. This how-to handles only read-access to the replicated wmf-databases (sql-sX-rr). There are several other types of sql-resources but all are handled in the same way. For details see SGE. SQL-resources are really simple: Just tell how many you need (most times: 1, maximum 4). Lets assume that your fix-it-bot needs read-access to the database of the english wikipedia (enwp for short). enwp is located on cluster 1 (use the sql-table toolserver.wiki to get the cluster-number), so the resource is called sql-s1-rr. If you would need a read-access to a database of cluster 2 the resource would be sql-s2-rr and so one. Like most programs you only need 1 connection so you add the line to the wrapper-script. It looks now the following:
 * 1) -l sql-s1-rr=1

Notification and logging
We are done with the resources. But we have to tell SGE two last things: When it should notify you and what to do with the program-output. Because both things are no resources no "-l" is used here, but own SGE-parameters. First notification: SGE can send you eMails about many things (see here for details). Most people just care for 2 things: When did my job started and was is aborted?