User talk:TJones (WMF)/Notes/Survey of Regular Expression Searches

From MediaWiki.org
Jump to navigation Jump to search

Interwiki links[edit]

The regex search for interwiki links seems rather innocuous (why are they even using a regex and not a literal insource search?). However, if it turned out to be very expensive we could always add another special page or API query from which to get such lists. I believe someone has already filed such a feature request. --Nemo 05:26, 31 May 2018 (UTC)

I don't think the interwiki links are necessarily a problem, though there are an awful lot of them! Regex search is "trigram accelerated" so it searches for results with obvious trigrams from the regex before doing the more costly regex search itself. So for insource:/\[\[fr:[^#\]]+\]\]/ it finds pages with "[[f", "[fr", and "fr:" before applying the regex. These "nice" searches work well and don't time out like more pattern-heavy regexes with no obvious trigrams in them. I think they use the regex feature because some literal characters are still very hard to search for even with insource. The query insource:"[[fr:" gives the same results as insource:"fr" because tokenization is still happening. The original query also limits results to those that do not have a # anchor in the link, which would be impossible with a non-regex insource search. We didn't track down the source of these queries because this analysis wasn't the goal of the original data gathering, and for now it's not high enough on the list to go back and find out. Trey Jones (WMF) (talk) 19:04, 5 June 2018 (UTC)
Thanks! --Nemo 19:19, 5 June 2018 (UTC)