User:TJones (WMF)/Notes/Survey of Regular Expression Searches

From MediaWiki.org
Jump to navigation Jump to search

May 2018 — See TJones_(WMF)/Notes for other projects. See also T195491

Background[edit]

As part of T195491, Erik has been looking into the details of our regex processing and ways to handle ridiculously long-running regex queries. He pulled all the regex queries over the last 90 days to get a sense of what features people are using and what impact certain changes he was considering would have on users. Turns out there are a lot more users than I would have thought—which is good news! And a lot of them look like bots.

He also made the mistake of pointing me to the data and highlighting a common pattern—searches for interwiki links. I couldn't help myself—I started digging around found that the majority of the searches are looking for those interwiki links, and the vast majority of regex searches fall into three types.

Overall, there are 5,613,506 regexes total across all projects and all languages, over a 90-day period. That comes out to ~62K/day—which is a lot more than I'd expected, though I hadn't thought about bots using regexes.

The Big Three[edit]

Of the 5.6M regexes, 93.8% fall into three categories:

  • Interwiki links: 3,552,889 (63.3%) are interwiki links of the form insource:/\[\[en:[^#\]]+\]\]/ where 'en' can be replaced with the language code for any language. I only included those with 10 or more occurrences because I was then able to verify them all pretty easily. The ones that occur 50K times or more are:
     50673 insource:/\[\[nl:[^#\]]+\]\]/
     51428 insource:/\[\[it:[^#\]]+\]\]/
     52223 insource:/\[\[fa:[^#\]]+\]\]/
     55392 insource:/\[\[de:[^#\]]+\]\]/
     56798 insource:/\[\[fr:[^#\]]+\]\]/
     58155 insource:/\[\[es:[^#\]]+\]\]/
     64192 insource:/\[\[ru:[^#\]]+\]\]/
     67595 insource:/\[\[id:[^#\]]+\]\]/
    166646 insource:/\[\[en:[^#\]]+\]\]/

Bot! Based on the frequency and regularity, I think this looks like something at least partially automatically generated; a bot, an app, or a link used by a lot of people.

  • URLs: 904,545 (16.1%) are URLs, specifically ones formatted more or less like this: insource:"/www.example.org" insource:/\/www[.]example[.]org[ \|\]}$\\n]/i. No individual query occurs more than 12 times.

Bot! The regularity of the formatting all but requires this to be done by an automated bot or app. This looks like a tool for looking up whether specific URLs occur on-wiki.

  • Library of Congress: 808,224 (14.4%) appear to be Library of Congress (LOC) collection ID queries. The majority end with LOC or +LOC, while some start with loc.gov LOC. The rest of the query is of the generally of the form insource:/<COLLECTION>\.<ID>/, many with an additional character class ([\s\<\}\n\.\,\;]] that looks like it's trying to find a word boundary. The most common collections, after stripping out the LOC and word boundary parts of the query and replacing the ID with #, are:
      17454 insource:/nclc.#/
      19248 insource:/cwpb.#/
      40943 insource:/highsm.#/
      42445 insource:/gsc.#/
      50781 insource:/ppmsca.#/
      55346 insource:/pga.#/
      59482 insource:/npcc.#/
      98915 insource:/cph.#/
      99561 insource:/matpc.#/
     126837 insource:/hec.#/
     139778 insource:/ggbain.#/

Bot! The regularity of the formatting, especially the escaping and the word boundary character classes, makes this look like a bot or an app.

Some Stragglers[edit]

After filtering out the three groups above, a few more things popped up.

  • Missing links: 44,578 instances (0.8%; 16,754 unique) of the schema insource:/<X>/ -linksto:<X>. Looks like this is intended to find missing links, presumably so someone can add them. More of these in Hebrew than anything else, though also lots in Arabic and English.
  • Number + apostrophe: One oddly common pattern was a number followed by a straight or curly apostrophe, possibly followed by another character or two. No individual query was super common, but overall there were 11,781 (0.2%) that fit the pattern. The most common individual queries are:
    120 insource:/ 1’/
    124 insource:/7'0/
    128 insource:/2'\,/
    133 insource:/0' \(/
    135 insource:/5'\,/
    135 insource:/9'\,/
    138 insource:/9'\|/
    149 insource:/0'0/
    154 insource:/8'0/
    155 insource:/0'\,/
    159 insource:/0’ /
    202 insource:/6'\,/
    689 insource:/0's/i

I don't know if this is a bot, or just people typing out this relatively straightforward query. Not sure what it's trying to do.

  • Bio Templates: There are 3,452 distinct queries (10,886 (0.2%) total) looking for a "Bio" template and what appear to be human names: hastemplate:Bio insource:/Nome *= *<name>[^a-z]/.
  • Number One with a Bullet: There are 6,519 distinct queries (6,596 total) of the form linksto:"Wesley Moodie" insource:/\* *\[\[Wesley\ Moodie/, which seem to be looking for names that are in a bulleted list.
  • Listed Buildings: There are 3,799 distinct queries (5,268 total) looking for "listed buildings" in German by Object ID: insource:/ObjektID *= *<#>/ intitle:"Liste der denkmalgeschützten Objekte in". where # is an ID number.
  • Randomly Popular: No other obvious large-scale patterns jumped out at me in the remaining queries—though lots of little patterns of a few hundred queries or less showed up. However, some of the remaining individual queries occur over a thousand times (i.e., averaging more than 10 times a day for 90 days)—which is also interesting. на России means "in Russia" in Russian, překlad is Czech for "translation", and the Chinese regex includes words for "quick delete" and the like. The three [Cc]ite patterns stick out a bit—but there aren't any others of note like that—maybe a bot or a really useful link?
     1125 insource:"на России" insource:/на России,/
     1641 -insource:/\{\{[Pp]řeklad/
     3433 insource:/\<ref?\>\{\{[Cc]ite/ hastemplate:no_footnotes
     3434 insource:/\<ref?\>\[http/ hastemplate:no_footnotes
     3435 insource:/\{\{[Cc]ite/ hastemplate:blpunsourced
     3436 insource:/\{\{[Cc]ite/ hastemplate:unreferenced
     3437 insource:/\<ref\>\[http/ hastemplate:blpunsourced
     3438 insource:/\<ref\>\[http/ hastemplate:unreferenced
     7204 insource:/\{\{(D|d)elete\|/
     7224 insource:/\{\{(D|d)\|/
     7226 insource:/\{\{(A|a)fd\|/
     7227 insource:/\{\{(V|v)fd\|/
     9701 insource:/gate\/big5/
     9706 insource:/\[\[\[/
    12682 insource:/\{\{\s*((db|d|sd|csd|speedy|delete|速刪|速删|快刪|快删|有爭議|有争议|[vaictumr]fd|delrev|存廢覆核|存废复核)\s*(\||}})|(db|vfd)-)/
    19362 insource:/zh\.wikipedia\.org/

Bots Are Friends[edit]

I've pointed out that most of these look like bots, apps, tools, or in some cases maybe useful links that get used frequently. None of that is bad! As long as a bot or bot-like thing that puts up the really big numbers identifies itself as such (set your User Agent, folks!) and doesn't abuse the service it's connecting to, then more power to them!

We do often try to identify bots because they can skew our A/B tests and other research because they have such extreme behavior. A bot that makes it into an A/B test and then has a session that lasts for 72 hours and sends over 200,000 queries will skew the results by more than a little.

I also want to point out that a lot of these automatic-looking bot-like queries are very well formulated. Regexes are expensive because they need to scan over the text to find what they need. Using a non-regex part of the query to limit the number of articles helps a lot.

In the case of the URL queries—like insource:"/www.example.org" insource:/\/www[.]example[.]org[ \|\]}$\\n]/i—the insource:"/www.example.org" part of the query gets the set of documents to consider down from over five million to 74 on English Wikipedia. Scanning just those with the exact regex insource:/\/www[.]example[.]org[ \|\]}$\\n]/i is much much faster than scanning all five million.

Similarly, the query insource:"на России" insource:/на России,/ is well written because anything that has "на России," with a comma will of course also have the phrase "на России" in it. On Russian Wikipedia, there are only three instances with a comma, but you only have to search the 80 articles that match insource:"на России" to find them. It also decreases the number of false postive hits on other words that end in "-на" that are followed by "России,".

I wrote a blog post, "So -happy to meet you," about how to make similar queries more efficient.

Well-formulated queries like that are good for the services they use—because the queries are less expensive to run—and good for the users sending the queries—because the queries are less likely to time out and give partial results.