Topic on Help talk:CirrusSearch

several corrections/improvements in description of regexps

9 comments • 18:45, 3 July 2021 2 years ago

9

Lustiger seth (talkcontribs)

hi!
i'm very used to regular expressions, but it's very hard for me to understand the corresponding paragraph here. in particular:

in the sentence "These return much much faster when [...]" it's not clear to me, what "These" refers to.
"All regexp searches also require that the user develop a simple filter to generate the search domain for the regex engine to search:"
- it should be "the users develop" or "the user develops".
- the examples following that sentences should make clearer that one part creates the search domain (if i understood it correctly).
- in the first example: what is the difference of searching via
insource:"debian.reproducible.net" insource:/debian\.reproducible\.net/ or via

insource:"debian.reproducible.net"? if there is no difference, then the example is not good.
after the examples there's some text about an example with "FULLPAGENAME". it's not clear to me, whether FULLPAGENAME is meant as a meta syntactic variable or literarally.
what is an "HTML timeout"? does it mean http/server timeout?
the given link in section "Metacharacters" should be updated to https://www.elastic.co/guide/en/elasticsearch/reference/current/regexp-syntax.html, right?
the example
- /"literal back\slash"/ is as good as /literal back\/slash/
seems wrong to me. shouldn't it be
- /"literal back\slash"/ is as good as /literal back\\slash/ ?
the typical line break characters "are not reserved for matching a newline".
- so how do i search for a string that does not contain a newline?
- what happens, if i use \r or \n? are they treated as literal r and n respectively?
"The number # sign means something": ok, but what does it mean?

-- seth (talk) 09:34, 12 December 2020 (UTC)

Reply 09:34, 12 December 2020 3 years ago

Speravir (talkcontribs)

Ad 1: "These" refers to regexp searches. I think the paragraph before has been rephrased, but it was overlooked, how the next paragraph starts.

Ad 2: Compare with other parts of the help text. I think the singular form "the user develops" would be right here. Everyone can edit the text. We just need a translation admin afterwards.
Someone thought it would be clear from the context that the search domain is the first part of every example. Marking the search domain is a good idea, though, but how to? Both italic and bold are already used.
insource:"debian.reproducible.net" is an indexed based search which has two consequences: It is case insensitive, and the period is a grey space character meaning that also occurrences with e.g. space in between would be found. Cf. section Words, phrases, and modifiers.

Ad 3: I thought it would be clear from the example (I did not add this) that a literal {{FULLPAGENAME}} was meant. And the folowing note tells you that this does not work in the search bar, but in links only (I assume using templates like en:Template:Search link.)

Ad 4: I guess you are right. It’s the timeout after about 20 seconds also being warned in the warning box slightly above this text.

Ad 6: Yes, this is wrong.

Ad 7: The developers decided for a reduced function amount. (Auf Deutsch sage ich immer: „Die Regex-Suche ist leider kastriert.“). This includes that you cannot search for newlines. There is this one example, but this works only under certain circumstances. If you add \r or \n in your search query they will from my understanding be literally searched.

Reply 01:13, 13 December 2020 3 years ago

Speravir (talkcontribs)

I now made some fixes regarding items 2, 4 an 6, and I also marked the search domains in the one example block (item 2; I also fixed there another mistake I noticed).

@Shirayuki: Because you are in most cases the translation admin who marks new versions for translation: In my opinion the text block 286 (the one referred to in item 1) should be inserted into text block 288. How to to it in best way? Splitting T:288? It has 2 sentences now, T:286 should be inserted between them, and afterwards the text has to be slightly adjusted (note that I added another sentence). What I think of is:

An "exact string" regexp search is a basic search; it will simply "quote" the entire regexp, or "backslash-escape" all non-alphanumeric characters in the string. Regexp searches return ''much much'' faster when you limit the regexp search-domain to the results of one or more index-based searches. All rexexp searches block some server capacity for the time of search query. Therefore, all regexp searches also require that the user develops a simple filter to generate the search domain for the regex engine to search (in examples index based search domain is marked bold, regexp part marked in italics):

Slightly an issue is that a bit later the info regarding adding an indexed based search domain is also pointed out in other words (292, 293, 565), but I think there are already some more repetions of this important info in the whole help text, anyway.

Reply 22:50, 14 December 2020 3 years ago

This post was hidden by Speravir (history)

Lustiger seth (talkcontribs)

thanks for your answer and some corrections! ad 1: i tried to solve that now. your hidden(?) solution is also ok for me. ad 2--6: yes, that's better now. :-) ad 7: so it is not possible to search for two strings that have to be written on the same line (in a given order, but with arbitrary chars between them), right? (in perl syntax: /foo.*bar/ or /foo(?-s:.*)bar/)

Reply 11:19, 19 December 2020 3 years ago

Speravir (talkcontribs)

/foo(?-s:.*)bar/ will not work, because modifiers are not supported (with exception of i after the closing slash).

/foo.*bar/ will work, but will match on

foo bar (and an almost unlimited number of spaces in between, should only be limited by maximum of wiki text.)
foo (who the heck had the idea to use this as example placeholder) bar
foo\n
bar (note: not a literal \n, just marks the line break here; matches also an almost unlimited number of line breaks in between).

That’s what I meant with you cannot search for newlines.

BTW will also /foo *bar/ work, but not /foo\s*bar/, at least not reliable. A real example from Dewiki I had been asked for:

(The ping for Shiruyuki you can see above did not work. In the hidden contribution I tried it with an explicit signature, but this did not work either. Because this message does not contain any other information I have hidden it.)

Reply Edited 02:03, 20 December 2020 3 years ago

Speravir (talkcontribs)

Oh, almost forgotten: A better example that \s does not work, also from Dewiki: The central sandbox (named “Spielwiese”) does by default contain a template {{Bitte erst NACH dieser Zeile schreiben! (Begrüßungskasten)}}.

This search will not find any occurrence: wikipedia: spielwiese insource:/\{\{Bitte\s*erst\s*NACH/.
But this search will: wikipedia: spielwiese insource:/\{\{Bitte *erst *NACH/.

Reply 01:51, 20 December 2020 3 years ago

DCausse (WMF) (talkcontribs)

Ad 5: updated, thanks

Ad 8: The number sign is part of the Lucene RegExp syntax, it denotes the empty language which is not useful for the purpose of insource://.

Reply Edited 20:43, 14 December 2020 3 years ago

Lustiger seth (talkcontribs)

ad 8: ok, thanks, i added the reference there. but i'm not sure whether i used the right syntax (with all that translate stuff). so i would be great, if somebody would check.

Reply 11:24, 19 December 2020 3 years ago

Reply to "several corrections/improvements in description of regexps"