Talk:Wikimedia technical search

Original comments from Nike's blog

 * Copied here for future reference; originally from laxstrom.name/blag

@Nemo, you should place the list of URLs for the custom search engine somewhere public, preferably in source control (github?) so people could contribute additions / tweaks.

Specifically, I’d love if the search included blog posts from Planet Wikimedia and Open Planet Wikimedia. Also, technical village pumps and perhaps template talk pages would be good places to search, also.

Finally, it seems obvious that code itself should be searchable (e.g. single-line comments that don’t make it to doxygen, or even class/function/variable names, etc.

Waldir 14:22, 11 February 2013 (UTC)


 * @Waldir: I’ve just placed the list on the wiki: there isn’t any way to sync it automatically, is there? Planets can’t be searched, they’re ephemeral resources without archives so you’d be searching only the last few days, is this ok? Technical village pumps and template_talk namespaces are quite a challenge but I guess they may be included if someone made a list. Or you can add them yourself, I’ve added you as admin.
 * Nemo 17:27, 11 February 2013 (UTC)


 * I never tried creating a custom Google search engine before, but I doubt there's an automatic way to keep it up to date, which is a shame. I've heard that Yahoo's BOSS is quite powerful, but I don't know whether it allows automatic updating either (e.g. from a text file somewhere public, a git repo, etc.). It might be interesting to try out nevertheless.
 * As for the planets, that's quite a shame. I wonder if we should then include the urls of the blogs included in the planet(s). The blogs are quite an important resource.
 * I see that code search is already included. So that leaves, from my suggestions:
 * blog posts from Planet Wikimedia (needs to be regularly updated, from here)
 * Technical village pumps
 * Template talk pages
 * And a new one I just thought of:
 * Signpost Technology Reports (including possible previous URL formats using the BRION nomenclature)
 * What else? --Waldir (talk) 21:25, 11 February 2013 (UTC)
 * Maybe there's a way to create a mirror of the planet which is easier to update. Could be just a feedburner or Google reader or whatever, maybe: do you know some such services? Updating hundreds of entries by hand doesn't look fun, unless you think it's worth adding only a smallish subset of tech-oriented blogs. --Nemo 06:38, 12 February 2013 (UTC)
 * I do think that a subset would make more sense. Not only most of the blogs aren't exclusively about wiki-stuff (even less so wikitech-stuff), but many of those who are aren't exclusively so (i.e. they use tags or categories). Unfortunately tags/categories can't be filtered through the URLs, so we'll have to add only blogs whose main focus is wikitech, which I believe is a reasonable subset. Then tere are those which can be filtered because they publish posts whose titles follow a pattern, such as Wikimedia blog's tecnical reports. Ther used to be a blog only for tech staff, is the domain still active? we could add it to search the older posts that were published there. In any case, it seems to be manageable. --Waldir (talk) 13:01, 12 February 2013 (UTC)

Gitweb not indexable
https://www.google.com/search?q=site:https://gerrit.wikimedia.org/r/gitweb returns a single result, with the following non-description: "A description for this result is not available because of this site's robots.txt". What purpose does that setting serve? --Waldir (talk) 21:29, 11 February 2013 (UTC)
 * I asked on IRC. Apparently this was for performance reasons (gitweb couldn't cope with crawlers). Possibly the upcoming update to gitblit will make things better. Logs of that conversation should be available here sometime from now; timestamps approx 21:32 --> 21:42. --Waldir (talk) 21:44, 11 February 2013 (UTC)
 * It's enough to look in ^demon's talk, where I asked the same question. ;) I had added it "just in case". --Nemo 06:38, 12 February 2013 (UTC)
 * Indeed I re-added it on the list here, so it doesn't raise any eyebrows, but not in the actual search engine definition, by laziness. We can certainly have it tere too, as it's inconsequential. --Waldir (talk) 13:01, 12 February 2013 (UTC)

Bugzilla not indexed
See http://www.google.com/search?q=site:bugzilla.wikimedia.org -- it says HTTPS has to be used. http://www.google.com/search?q=site:https://bugzilla.wikimedia.org doesn't do any better. --Waldir (talk) 22:52, 11 February 2013 (UTC)
 * This is weird, it used to work. If Gmane works, however, it would be a duplicate of wikibugs. --Nemo 06:38, 12 February 2013 (UTC)
 * Yes, the wikibugs list was only added because bugzilla wouldn't. If we manage to make it work the ideal would be to remove wikibugs-l. --Waldir (talk) 13:01, 12 February 2013 (UTC)

Sandbox
Less user-friendly, but immediate for testing purposes, one can use urls such as https://www.google.com/search?q=site:mediawiki.org+OR+site:github.com/wikimedia/ --Waldir (talk) 22:52, 11 February 2013 (UTC)

Duplicates

 * 1) Why were duplicates added, like   and its subset  ? I think the wildcard crosses directories and everything.
 * 2) Gmane is not crawled, but some of its subdomains indeed are (comments.gmane.org, perhaps permalink, what else?). I'd use only Gmane.
 * 3) mediawiki-cvs/mediawiki-commits was added to search commit messages and code review comments, doesn't github add duplicates? --Nemo 06:38, 12 February 2013 (UTC)


 * Those aren't really duplicates. Both will be searched in the generic search (and duplicate results are, I believe, filtered by google itself), but the second one is exclusive to the "commits" category. That is, searching in that category won't return entries such as, etc.
 * What do you mean, "only use gmane"? There's only one category (Bugs) where a specific list (wikibugs-l) was added through two archivers (gmane and mail-archive), but I did so because I'm not sure google crawls all gmane's archives or presents them in the best way (the page title, for instance, in many cases doesn't seem to be the message's title but the list's description). In this case, if we are to remove one entry, I'd probably vote to keep mail-archive, but I don't think having both is necessarily detrimental. Let's wait and see what people say.
 * Yes, duplicates are added, but as I mentioned in the previous point, that's not necessarily a bad thing. I would, for instance, prefer clicking github links, while others might be more comfortable wiht the commit format that is sent to the mailing lists (doubtful, but you never know). Again, I think actual usage and feedback is key to make decisions here. In any case, too many results is better than too few.
 * --Waldir (talk) 13:01, 12 February 2013 (UTC)

Page name
Should we think about moving this page? It seems to be stable enough for announcement in mailing lists, signpost, blogs, whatever, and given the permanent nature of these, it would be better to have a better URL going in the archives. I renamed the search engine to "Mediawiki Unified Search", but we can simply call it MediaWiki Search or something like that. What do you think? --Waldir (talk) 13:01, 12 February 2013 (UTC)