Regexes in general are tricky, and making them work efficiently in on-wiki search is hard, too. So, a few notes:
- Put as many required words outside the regex as possible. "personal life" gets about 150K results. "personal life" attended only gets 50K, which cuts the number the regex has to scan by a third. "personal life" "he attended" gets only 13K. The regex matches both "he attended" and "she attended", but I would suggest searching them separately, since "personal life" "she attended" only gets 5K results, and splitting your search into two queries against ~18K results is better than one query against 50K.
- Any other non-regex info, like categories, also helps. Anything to allow the search index to give the regex fewer documents to scan is a plus.
- In general, put as much relevant plain text in your regex as possible. We use plain text trigrams to accelerate the regex search, so we're limiting the regexs to scan only documents with "Per", "ers", "rso", etc. In this case it doesn't help much because the plain text in the regex is almost the same as the non-regex search terms—though the regex is case sensitive, so the "Per" will filter the list down a bit more.
- Regexes are case sensitive unless you tell them not to be with
/i
at the end, so this won't match sentences that start with "He attended", like Jamie How. Rather than let everything be case insensitive, I'd use [Hh]
to allow that one character to match upper or lower case, since here it still leaves a lot of plain text trigrams in place. (I don't recall whether the trigrams processing looks into simple character classes like [Hh]
and make trigrams across them—so don't count on it.)
- The regex suggested above doesn't quite guarantee what you want, only that he attended appears after the Personal life section title—it could be in a different section, as in the case of Jeff Grub, where it's in the section after the Personal life section.
- You can instead match "not equal sign" with
[^=]*
though it is more expensive. It will also fail to match if there is an equal sign in the Personal life section before the "he attended" part. Unlikely, but possible. It will also exclude results with a sub-section under Personal life that contains "he attended", which is probably not desirable.
- Also, the extra
\=\=
... at the end requires another same-level section after the Personal life section, so it won't match if Personal life is the very last section, which may be unlikely, but is not true for all section titles. I'd drop it.
If you don't need a perfect list, but just a short list of ~100 articles you could review manually to find what you need, I'd recommend these two for this use case (link for ..he.. and ..she.. queries):
"personal life" "he attended" insource:/\=\=\s*Personal life\s*\=\=.*?[hH]e attended/
and..
"personal life" "she attended" insource:/\=\=\s*Personal life\s*\=\=.*?[sS]he attended/
Both of these queries finish on English Wikipedia, and return ~750 to ~1600 results. Still a lot, almost certainly with false positives, but potentially manageable.
EDIT: I'm sure there is still some way to further improve the regex. There always is! Hopefully this helps, though.