Topic on Help talk:CirrusSearch

Is it possible to search by section?

5
Karlpoppery (talkcontribs)

For example, if I have a section called "Etymology" on many of my pages, would it be possible to search only the text that appears in the etymology sections?

If there's no option to do that, is it something that could be done with regex in a reasonable time?

197.235.75.234 (talkcontribs)

No, it isn't possible to do it directly.

Yes, it is theoretically possible to use regex, but in many cases it will timeout.

Karlpoppery (talkcontribs)

I see, that's a bummer. There must be ways to hack around this, though. For example I could automatically save those sections in their own articles under the same category, then do a search by category and redirect the result to the original article. Or maybe I could use Cargo

197.235.219.81 (talkcontribs)

Such hacks will work, but they seem like a lot of effort. The way to make it less likely to timeout is to use efficient regex, along with a simplified text. For example:

"personal life" insource:/\=\=\s*Personal life\s*\=\=.*?he attended.*?\=\=\w*?\=\=.*?/

https://en.wikipedia.org/w/index.php?search=insource%3A%22personal+life%22+insource%3A%2F%5C%3D%5C%3D%5Cs*Personal+life%5Cs*%5C%3D%5C%3D.*%3Fhe+attended.*%3F%5C%3D%5C%3D%5Cw*%3F%5C%3D%5C%3D.*%3F%2F&title=Special%3ASearch&profile=default&fulltext=1

The search above attempts find "he attended" within a section, and because it is simplified it will fail in many cases, for example, if an article only contains one section or if a section is created by a template. Sections are simply too complicated to deal with, because people think of them as sub-documents but in reality they are just pieces of the same document. A related feature request is https://phabricator.wikimedia.org/T27062.

Note that the query above results in a timeout .

TJones (WMF) (talkcontribs)

Regexes in general are tricky, and making them work efficiently in on-wiki search is hard, too. So, a few notes:

  • Put as many required words outside the regex as possible. "personal life" gets about 150K results. "personal life" attended only gets 50K, which cuts the number the regex has to scan by a third. "personal life" "he attended" gets only 13K. The regex matches both "he attended" and "she attended", but I would suggest searching them separately, since "personal life" "she attended" only gets 5K results, and splitting your search into two queries against ~18K results is better than one query against 50K.
    • Any other non-regex info, like categories, also helps. Anything to allow the search index to give the regex fewer documents to scan is a plus.
  • In general, put as much relevant plain text in your regex as possible. We use plain text trigrams to accelerate the regex search, so we're limiting the regexs to scan only documents with "Per", "ers", "rso", etc. In this case it doesn't help much because the plain text in the regex is almost the same as the non-regex search terms—though the regex is case sensitive, so the "Per" will filter the list down a bit more.
  • Regexes are case sensitive unless you tell them not to be with /i at the end, so this won't match sentences that start with "He attended", like Jamie How. Rather than let everything be case insensitive, I'd use [Hh] to allow that one character to match upper or lower case, since here it still leaves a lot of plain text trigrams in place. (I don't recall whether the trigrams processing looks into simple character classes like [Hh] and make trigrams across them—so don't count on it.)
  • The regex suggested above doesn't quite guarantee what you want, only that he attended appears after the Personal life section title—it could be in a different section, as in the case of Jeff Grub, where it's in the section after the Personal life section.
    • You can instead match "not equal sign" with [^=]* though it is more expensive. It will also fail to match if there is an equal sign in the Personal life section before the "he attended" part. Unlikely, but possible. It will also exclude results with a sub-section under Personal life that contains "he attended", which is probably not desirable.
  • Also, the extra \=\=... at the end requires another same-level section after the Personal life section, so it won't match if Personal life is the very last section, which may be unlikely, but is not true for all section titles. I'd drop it.

If you don't need a perfect list, but just a short list of ~100 articles you could review manually to find what you need, I'd recommend these two for this use case (link for ..he.. and ..she.. queries):

"personal life" "he attended" insource:/\=\=\s*Personal life\s*\=\=.*?[hH]e attended/

and..

"personal life" "she attended" insource:/\=\=\s*Personal life\s*\=\=.*?[sS]he attended/

Both of these queries finish on English Wikipedia, and return ~750 to ~1600 results. Still a lot, almost certainly with false positives, but potentially manageable.

EDIT: I'm sure there is still some way to further improve the regex. There always is! Hopefully this helps, though.

Reply to "Is it possible to search by section?"