Wikimedia Labs/Tool Labs/Needed Toolserver features

= Toolserver features needed in Tool Labs =

Database

 * MySQL-access to the WMF-project-databases (minimum same table/row/field-scope like on the toolserver).
 * Separate MySQL-server(-cluster) for user- and project-databases.
 * 1 MySQL-Server per cluster (s1-s7) where user- and project-databases can be created.
 * Minimum 1 MySQL-Server per cluster (s1-s7) where a live-copy of commons is running in parallel.
 * Minimum 1 MySQL-Server per cluster (s1-s7) where a live-copy of wikidata is running in parallel.
 * Support for short- and long-running queries (maximal run-time depending on the load up to few days).
 * CatScan and CatScan 2.0 are popular GUIs to the current databases. I used them extensively on Commons. Unfortunately both suffer from frequent timeouts and occasional "too many users using this tool" errors. Some standard maintenance queries (saved for easy access) that worked several years ago stopped working as number of images on Commons grew. --Jarekt (talk) 18:56, 7 March 2013 (UTC)
 * Mail-message-system about killed queries.
 * If the labs system is not as underpowered as toolserver, in terms of the database servers, a query killer may not be necessary. CBM (talk) 21:54, 20 December 2012 (UTC)
 * DNS-aliases for clusters (sql-s1.toolserver.org) and wikis (enwiki-p.rrdb.toolserver.org).
 * replication-replag-reporting-system.

Filesystem

 * home-directories. ✅
 * project-directories for working together. ✅
 * common-directory for wikimedia-dumps with an automatic update of them. ✅
 * common-directory for page-stats with an automatic update of them.
 * Quota limits on filesystem usage / method for users to check their quota usage

Languages

 * i18n for all (available) languages of this world.
 * Can you clarify this? i18n for what specifically?--Ryan lane (talk) 21:06, 20 December 2012 (UTC)
 * Maybe this is a reference to the [//wiki.toolserver.org/view/Toolserver_Intuition Toolserver Intuition] framework? – Minh Nguyễn (talk, contribs) 06:38, 21 December 2012 (UTC)
 * Support for perl, c, python, c++, php, mono, java, tcl, bash-, ksh- and zsh-script (for (fast-)cgi and cli). ✅
 * Anything that comes with Ubuntu, or someone is willing to package (that is open source) is available.--Ryan lane (talk) 21:06, 20 December 2012 (UTC)
 * Tons of libraries for each programming-language.
 * Same as previous.--Ryan lane (talk) 21:06, 20 December 2012 (UTC)

Web

 * Bugreporting-site (issue and bug tracker like JIRA or bugzilla) with the possibility to create (sub-)projects that can administrate by an user/project.
 * preferably with conversion/migration from TS JIRA to the new one on labs
 * SVN Git repositories ✅ (people can already request Git repositories in Gerrit for labs stuff)
 * Why SVN? -- Krenair (talk &bull; contribs) 21:05, 20 December 2012 (UTC)
 * Agreed. Let's switch everything to git and be done with svn once and for all.--Ryan lane (talk) 21:07, 20 December 2012 (UTC)
 * To clarify, I'm fine with people hosting their code in SVN. What I'm saying is that WMF won't provide a SVN hosting solution. If volunteers want to create an svn project in Labs, and run svn.wmflabs.org, that's fine.--Ryan lane (talk) 21:52, 20 December 2012 (UTC)
 * Access to (anonymized) web- and web-error-logs.
 * Server statistics, workload, status and else
 * E.g. Munin or something similar, please confer TS Munin
 * General server (service) status-page on a external and independent system with own password-system (human and machine-readable e.g. XML, JSON, csv), please confer TS Status
 * May be webserver page view/visit statistics too
 * https://graphite.wikimedia.org/ (requires labs account)
 * https://gdash.wikimedia.org/ (open to the world)
 * (more at http://old.nabble.com/What-is-slow-td34077918.html)
 * Server/project for web tools and pages. ✅

Job-system

 * Job-system for executing of non-web-tasks, e.g. stable cron.
 * crontab or else without "(CRON) CAN'T FORK (child_process): Not enough space." issue, please confer Cron#Timing and work load
 * may be add a queue system like SGE
 * Various resources like run-time, needed databases, free memory and user-slots.
 * Automatic switch-over and moving in case of a server-failure.
 * Possibility to let a non-root maintain the job-system (adding resources, killing jobs, etc. pp.).

OSM

 * A Postgresql-server for OSM.
 * Render-server with enough space to host several layouts.


 * OSM is being added to WMF production. We'll have a test/dev version of OSM in Labs, but we have no plans on having a quasi-production like version of OSM in Labs.--Ryan lane (talk) 21:08, 20 December 2012 (UTC)
 * Without a live OSM-planet-db the labs fails the criteria to replace toolserver. --Kolossos 15:20, 23 December 2012 (UTC)


 * We have special projects on toolserver like WIWOSM that combine Wikipedia- and OSM-data in one script. This is only possible if a script can connect to both databases. Without such an feature we will lose this service in Wikipedia. --Kolossos 18:02, 1 February 2013 (UTC)


 * Other users created tools for translating places or to maintain street-lists. --Kolossos 19:12, 14 February 2013 (UTC)

Backup

 * Daily backups for user- and project-directories for minimum 1 week.
 * including crontab and all other necessary configuration files
 * Daily backups for user- and project-databases for minimum 1 week.

Various

 * Mail-address for users.
 * Mail-address for projects.
 * Configurable Mail-forwarding for both.
 * Nagios-system. ✅ (see: http://nagios.wmflabs.org/nagios3/)
 * We have a ganglia system too; see http://ganglia.wikimedia.org/latest/ --Ryan lane (talk) 21:15, 20 December 2012 (UTC)
 * Nothing to see, password protected. – Giftpflanze (talk) 15:50, 23 December 2012 (UTC)
 * It's temporarily private - see wikitech-l/2012-December/065184.html. -- Krenair (talk &bull; contribs) 16:11, 23 December 2012 (UTC)
 * Free license-choice of the user for his/her tools.
 * This is of course the case, as long as the license is an OSI approved license.--Ryan lane (talk) 21:09, 20 December 2012 (UTC)

End-user-support

 * 1 or more root with an understanding for user-problems.
 * We have 3 on-staff and many volunteers.--Ryan lane (talk) 21:10, 20 December 2012 (UTC)
 * Having a root who is *understanding* is particularly important. Many users are not experts in what they are doing, and need friendly help in solving problems or cleaning up messes they have created
 * Documentation for beginners and non-technicals.

Render
Things available to Render on the Toolserver which are needed for a transition to WikiLabs/ToolLabs
 * Database access like currently on the Toolserver
 * Possibility to build and run C and C++ apps
 * Possibility to build external C and C++ libraries (app dependencies) from source and install in user directory, without packaging
 * 64-Bit OS for sufficient address space
 * ~28GB physical RAM (comparable to ortelius)
 * CPU with at least 4 cores with at least 3 GHz, which can actually be used in parallel (comparable to ortelius)
 * We need to keep large amounts of data available in RAM. Regenerating it is expensive so we need a server with good uptimes. A VM or server which reboots or needs to be re-setup every month or so is not acceptable.
 * Ideally, we need to use a physical machine like we currently can. It is not yet clear whether a VM will do.

= Toolserver features wanted in Tool Labs =

Below is a list of features of toolserver which would be cool to have available on the projects that are supposed to replace toolserver:

Labs wide (not only bots / tools), but available for all projects
seem to include database database replication in a form that makes it a useful, direct replacement for toolserver]]».
 * Access to production db (read-only / replicated)
 * joining of user databases with wiki databases
 * will the commons database be replicated to all clusters, like the toolserver?
 * allow queries like the one this researcher was desperate for
 * Unlikely, that logic should be handled within the application. It's impossible to shared data otherwise, as well as the extra overhead on the database servers which are effectively a shared component. DamianZaremba (talk) 11:13, 27 September 2012 (UTC)
 * Explanation by Carl (CBM): «[[mailarchive:toolserver-l/2012-September/005382.html|plans for WMF Labs don't
 * Home directories (done?)
 * What is meant by this? Labs is meant to be a collaborative, community maintained environment. I specifically want to avoid the Toolserver way of individualizing everything. If a user leaves, their bot, or tool, should very easily be transferable to another user. We tend to opt for using project storage (/data/project) rather than home directories. It's also preferable to run things as service users rather than individualized users, though that isn't a requirement.--Ryan lane (talk) 22:17, 25 September 2012 (UTC)
 * I should also mention that we already have per-project home directories, where the directories are accessible to every instance in a project. It's possible to do things the toolserver way, but it's a very limiting way of doing things.--Ryan lane (talk) 22:17, 25 September 2012 (UTC)
 * Lets cross this, it isn't really relevant in labs. Sure every ssh user has a home directory, that's standard Linux. But for storing applications and databases, this must be stored elsewhere (on the mounted project storage, not in a home directory, not on the instance itself, he instance itself needs to be recyclable from puppet). Krinkle (talk) 02:42, 26 September 2012 (UTC)
 * Looks like a bad argument. You wrote in mailing list that things can be abstracted etc., so let's not strike things before the users actually get what they expect and need. I guess this point means "provide a way to make stuff available which is as easy as placing a file (executable or not) in public_html"? --Nemo 05:15, 26 September 2012 (UTC)
 * Right, I meant to mark done, not strike. This is done. Krinkle (talk) 06:46, 26 September 2012 (UTC)
 * Technically home directories are done. They've been available since Labs launched. We discourage their use fairly heavily, though. Ideally, the only thing that should go into a user's home directory is their environment settings. Home directories are personal, and therefore they explicitly stop collaboration; people are generally unwilling to go into another user's home directory, even if the user retires or disappears. Rather than using home directories, users should use project storage, which is shared to all instances in a project, just like home directories. We encourage project storage to be used in a fairly open way (not per-user, but per-bot or per-tool, or per-subproject).--Ryan lane (talk) 06:09, 26 September 2012 (UTC)
 * Central per-user directory mounted on all instances within a project ✅
 * Again. Let's try to avoid per-user things. We have shared storage at /data/project that is accessible by everyone in a project. It's possible to lock this down by file permissions, even to per-user, but it's way better to make things owned by a service user (and control access to that user), or to have things owned by the project group.--Ryan lane (talk) 22:18, 25 September 2012 (UTC)
 * Can someone explain the difference between this and home directories?--Ryan lane (talk) 22:18, 25 September 2012 (UTC)
 * If the previous point means what above, this is probably about being able to access user data? Is there a way on Labs to easily share data across all projects? --Nemo 05:15, 26 September 2012 (UTC)
 * As mentioned above, yes, there is per-project storage that is shared to all instances in a project. It's accessible at /data/project.--Ryan lane (talk) 06:09, 26 September 2012 (UTC)
 * Backup of home directories and user databases
 * Backups of databases should likely be handled by users, and saved in project storage--Ryan lane (talk) 06:11, 26 September 2012 (UTC)
 * It's possible that we'll do automatic backups of the user databases as well, assuming they are provided by a database service.--Ryan lane (talk) 01:29, 2 October 2012 (UTC)
 * Create generic project for web tools (Bug 40510) ✅
 * Create generic project for periodic/long-running bots ✅
 * It is not clear yet whether people should share instances or create their own. As they are unlikely to interfere with each other, the overhead of N linuxes may not be worth it. Instead it may be more useful to have 1 big instance, or a grid of instances but control distribution with SGE instead of manually.
 * 'production' tool labs should be locked down access wise and run with some form of scheduler, development should be shared instances for collaboration. DamianZaremba (talk) 11:13, 27 September 2012 (UTC)
 * Agreed. We could use SGE for this, but it really seems like overkill...--Ryan lane (talk) 01:29, 2 October 2012 (UTC)
 * Is there a "standard" way of job scheduling in Ubuntu shops? --Tim&#160;Landscheidt 12:27, 8 October 2012 (UTC)
 * Local and auto-updated copies of:
 * Wikimedia XML Dumps ✅
 * This is accessible on every instance at /public/datasets
 * visits per page (pagecounts) and project (projectcounts)
 * same visits stats, in database format to allow direct querying
 * Simple setup to allow HTTP access to projects/instances (reverse proxy, port forwarding, public ip)
 * A http proxy is on the todo list for labs. Some projects have shared web instances currently. DamianZaremba (talk) 11:13, 27 September 2012 (UTC)
 * Misc. Toolserver features:
 * Support SGE to automatically defer starting of expensive processes based on current capacity and usage (qcronsub, qsub) https://wiki.toolserver.org/view/Job_scheduling#arguments_to_qsub/qcronsub
 * If the presented alternative has the same user interface, it shouldn't be a problem. For instance, people don't have an opinion about which of the SGE forks would be preferable.
 * WikiMiniAtlas depends on:
 * the OSM database mirror being available
 * Once we have a copy in production, we could arrange replication from there. Max Semenik (talk) 12:53, 4 February 2013 (UTC)
 * the WIWOSM project (although dschwen could proxy that from the TS)
 * Proxying is not allowed per our terms of use. DamianZaremba (talk) 11:13, 27 September 2012 (UTC)
 * If we're looking to replace TS, then we need to make this available.--Ryan lane (talk) 01:29, 2 October 2012 (UTC)
 * Dispenser's coordinate extraction database (GHEL)
 * This should be obsoleted with GeoData. Max Semenik (talk) 12:53, 4 February 2013 (UTC)
 * Replicate or transfer MMP (multimaintainer projects) from Toolserver
 * Ryan says: "They can be created in LDAP by making a labsconsole account. Additionally, unless the account needs to directly log in via ssh, there's no request process needed for the user to be used. Alternatively, the user could be created as a system account via puppet."
 * What's that supposed to mean? The migration of (multi-maintainer) projects is the goal of this list, not one of its means.  --Tim&#160;Landscheidt 12:27, 8 October 2012 (UTC)
 * servlets
 * missing support blockers for migration
 * no support for new users not familar with unix based systems
 * Can someone explain what this means, or how TS is currently doing this?--Ryan lane (talk) 01:29, 2 October 2012 (UTC)
 * no transparent updating of packages with security problems/bug ✅
 * All instances by default are configured for unattended-upgrades. A general overview/management system has been roughly discussed and would be nice to implement though. DamianZaremba (talk) 11:13, 27 September 2012 (UTC)
 * So are you sure that everything requested under this point is done? --Nemo 11:25, 27 September 2012 (UTC)
 * permanent blockers for migration
 * license problems ("i wrote code at work for my company and reuse parts for my bot framework. I have not the right to declare this code as open source which is needed by labs policy") ❌
 * This isn't going to change, we should be open and collaborative. Restricting access to code goes against that. DamianZaremba (talk) 11:13, 27 September 2012 (UTC)
 * +1. We will only ever allow open source and open content. I honestly wonder how you are legally using proprietary software on TS.--Ryan lane (talk) 01:29, 2 October 2012 (UTC)
 * You cannot simplify that much. E.g. I am operating a bot using a lot of CV (computer vision) algorithms. Some of them WE (wikipedia/wikimedia community) are just ALLOWED to use the code because I kindly asked the owners. But you cannot assume they to agree to change their license just because of TS to labs migration! So this could indeed become a blocker and the "irony" is it affects the mainly the newer algorithms (the interesting ones since they are state of the art - which is a pitty). To summarize; it is a good thing to force open source but we should not be too strict here - as a few people mentioned this was a strength of the TS as I see it. Greetings --DrTrigon (talk) 11:21, 1 January 2013 (UTC)
 * You're allowed to use the software for now. Is there a license agreement? What happens if you no longer want to support the bot and someone else needs to maintain it? Do they have permission? What if someone copied the software and publishes it? Who's liable? We do have a guideline for exceptions when using closed source software, but it would require proper paperwork (like license agreements from the owner, etc). There's a very good reason we have this rule, and it's insane that TS doesn't.--Ryan lane (talk) 22:49, 4 January 2013 (UTC)
 * no DaB.
 * DaB isn't a feature but a community member, he can't be technically implemented (though his help would be appreciated!). DamianZaremba (talk) 11:13, 27 September 2012 (UTC)
 * I don't see anything about this page being about technical implementable stuff only. --Nemo 11:25, 27 September 2012 (UTC)
 * "never change a running system"
 * This is a joke/troll, right?--Ryan lane (talk) 22:41, 4 January 2013 (UTC)
 * SFTP access to instances ✅
 * This is currently available if you have SSH access. DamianZaremba (talk) 11:13, 27 September 2012 (UTC)
 * Email addresses/forwarding
 * Localisation for tools on translatewiki.net
 * Per-project optional svn and/or git repo
 * For "Tool Labs" that is, since in "WikiDev Labs" this is mandatory workflow
 * Should versioning really be optional? Even for tools, is it ever a good idea not to have a repo? I'd rather improve Git/Gerrit usability and integration. Eloquence (talk) 21:07, 25 September 2012 (UTC)
 * I think it would be hard to enforce the use of source control, unless someone was policing things, or if we required it to deploy tools (maybe via git-deploy?). Using the deployment system for this actually may be the easiest way to enforce this.--Ryan lane (talk) 22:25, 25 September 2012 (UTC)
 * Do we want to have a a bare git server for wmflabs (like there svn.toolserver.org), or do it in Gerrit? Or maybe allow any git url so that users can store it where they like (be it gerrit.wikimedia.org, github.com etc.). Krinkle (talk) 02:42, 26 September 2012 (UTC)
 * No, not another server that someone has to maintain. Gerrit or GitHub are fine.  --Tim&#160;Landscheidt 12:27, 8 October 2012 (UTC)
 * Mysql query killer (especially for queries to the replicated wmf wiki databases)
 * On the roadmap - relatively easy change.--Ryan lane (talk) 01:29, 2 October 2012 (UTC)
 * Need a clear story for access to bot credentials - should ownership of bot accounts be shared between all users of the server, or should each bot account be owned by an individual user? If the latter, how will projects get access to the credentials? Note that there are some bad practices going on that might affect this: some tools contain hardcoded bot credentials, and some people share passwords between their user accounts and bot accounts.
 * Issue and bug tracker like JIRA (or bugzilla) ✅
 * preferably with conversion/migration from TS JIRA to the new one on labs
 * We have a Wikimedia Labs product in bugzilla--Ryan lane (talk) 22:53, 4 January 2013 (UTC)
 * Can this bugzilla instance also host user projects or "products"? E.g. like "DrTrigon's tools"? What about migration from JIRA? --DrTrigon (talk) 19:36, 24 January 2013 (UTC)
 * Stable cron
 * crontab or else without "(CRON) CAN'T FORK (child_process): Not enough space." issue, please confer Cron#Timing and work load
 * may be add a queue system like SGE
 * Server statistics, workload, status and else
 * e.g. Munin or something similar, please confer TS Munin
 * general server (service) status, please confer TS Status
 * may be webserver page view/visit statistics too

Bots project

 * Public webserver for logs ✅
 * CGI (perl, python, ...) for helper tools ✅
 * Some packages to install
 * Support for mono (c#), python, svn, git ✅
 * PyWikipediaBot requires python v2.7.2 or greater but less than v3 but python is currently v2.6.5 on bots-1
 * Is this resolved by using Ubuntu precise? If so, bots that need the newer version can be on Ubuntu precise instances.--Ryan lane (talk) 01:29, 2 October 2012 (UTC)
 * resolved by new distro -- use bots-nr1 and later for these bots Petrb (talk) 07:36, 18 October 2012 (UTC)
 * PyWikipediaBot and all prerequisites for Perl MediaWiki::Bot and ::API should be installed system-wide on Bots instances
 * Yes. We should puppetize this, so that we can track dependencies properly.--Ryan lane (talk) 01:29, 2 October 2012 (UTC)
 * Text editors: vim (probably some others like nano etc.) ✅
 * Shells - bash, ksh, csh, tclsh ✅
 * Libraries - libtcl ✅ (needs puppetize)
 * Php, java ✅ (needs puppetize)
 * Perl (needs full cpan upgrade and MediaWiki:: packages) ✅ (needs puppetize)
 * Why does it need cpan at all? I understand that on Toolserver with its limited admin capacity it's hard to wait for proper installation, but I assume that in Labs there will probably always be a root around who can build a proper package on the fly.  --Tim&#160;Landscheidt 12:27, 8 October 2012 (UTC)
 * Flup (Python WSGI)
 * imagemagick
 * All the packages requested by TS users and now taken as a given; see e.g. tswiki:User:Dab/Debian-Packages and install requests on JIRA – someone make a list
 * Per-project optional custom MySQL databases ✅ (bots-sql1, -sql2 ...) Petrb (talk) 07:36, 18 October 2012 (UTC)
 * This is on the roadmap. Will likely come some time after replicated databases.
 * Basically similar to the mounted project-wide storage. Not on any individual instance, accessible from within each project instance.
 * Toolserver also has the principle of public databases that can be read from other projects. This is probably something we'd want too, so that projects can build on top of each other.
 * The current concept behind this in Labs is that all databases will be accessible from all instances. Creation/modification/grants/etc. will be handled by sysadmins in the project that owns the database.--Ryan lane (talk) 06:09, 26 September 2012 (UTC)

= Links =
 * Toolserver/List_of_Tools