Jump to content

Topic on Help talk:CirrusSearch

Automatik (talkcontribs)

Hi. Is there any way to exclude redirects from search results? I want, e.g., to find entries that are not redirections and that contain some character in their title. How to do that?

TJones (WMF) (talkcontribs)

Unfortunately, there's no easy way to exclude redirects from search results.

However, depending on the scope of the task you are trying to complete and your technical ability you could try to use the Search API to semi-automatically do what you need.

This query will give you back the top results with "English Wikipedia" in the title or a redirect:

https://en.wikipedia.org/w/api.php?action=query&list=search&srlimit=50&srsearch=intitle:%22english%20wikipedia%22

The default format is JSON converted to HTML so it's easy to read for a human, but hard to read for a computer. If you only have a small number of queries to deal with, and only need a limited number of results from each (up to 500—set by srlimit), you might be able to get what you need by getting these results and looking through the titles by hand.

If you need a computer to process the results for you, say, because you have many queries, you can get real JSON by adding &format=json:

https://en.wikipedia.org/w/api.php?action=query&format=json&list=search&srlimit=50&srsearch=intitle:%22english%20wikipedia%22

On a Unix-like command line (I'm working in Terminal on OS X) you can use curl to fetch the JSON, python to make it pretty, and grep to pull out the titles, and grep again to find the specific ones you want:

curl -s "https://en.wikipedia.org/w/api.php?action=query&list=search&format=json&srlimit=50&srsearch=intitle:%22english%20wikipedia%22" | python -m json.tool | grep "\"title\":" | grep -i "english wikipedia"

Note that the API URL is URL-encoded (spaces become %20, quotes become %22, etc.).

Results:

   "title": "English Wikipedia",
   "title": "Simple English Wikipedia",
   "title": "Notability in the English Wikipedia",

The results aren't pretty, and in this case there are only 8 results total and 3 that are not redirects. If you are searching for specific characters, you may need to do some more pre-processing before the final grep. (If you are searching for "e", everything will match, because "title" has an "e" in it, for example.) If you need to go through more than the top 500 results, you'll have to figure out how to get the API to give you additional results, etc.

It's not pretty and it's not easy, but it's a start.

Automatik (talkcontribs)

Thanks for this answer. It is clearly not easy or convenient, and pretty similar to run the query manually (then, filtering visually with CTRL+F "(redirection" and picking only the results without the "(redirection" text highlighted. Developers should add an option "do not follow redirects", to avoid tedious work for all users using this functionality. I guess it is not so difficult, as this option already exists in some use cases (e.g. when displaying a page with &redirect=no).

TJones (WMF) (talkcontribs)

It is very similar to the ctrl-F solution, just more automatic! For me, somewhere around 25 to 50 queries it would be faster (or at least less boring and thus less error-prone) to go for a hacked-together semi-automatic solution.

Adding a title-only index is probably not a trivial change to make from our current state. We have a search index for intitle:, with the text from titles and redirects in it. There's no differentiation between the title and redirect text once it's in the index. I think we'd have to create another field that was title-only (and maybe a redirect-only field would be equally useful—which together would be bigger than the size of the current title index).

It's not clear to me how many people would need such an index. I'm really curious what your use case is—both to get a sense of how useful title-only search would be, and to see if there's a better clever way to get what you need.

You could open a Phabricator ticket and ask for this feature, but that certainly doesn't guarantee that it would be implemented any time soon.

Automatik (talkcontribs)

On the French Wiktionary, we use the typographic apostrophe in titles, instead of the typewriter/vertical apostrophe. I was looking for titles that use the vertical apostrophe, without being a redirection.

Moreover, I am using Windows, which is less convenient than Unix-like command line regarding command-line tools (documentation unclear/not a unified way to run commands in Windows, etc.)

TJones (WMF) (talkcontribs)

Ah.. that's a sensible use case. No other obvious solution comes to mind, but I'll think about it more and if I think of anything useful I'll let you know.

If you are already familiar with Unix-like commands (or want to learn), but just don't have them available because you are on Windows, you could look at Cygwin (English WP, French WP, website)—it's not an emulator or virtual machine, it just gives you versions of standard Unix commands that work on Windows. I used it about 15 years ago when I had a Windows machine for my job. I found it very useful back then, but haven't used it since.

Automatik (talkcontribs)

Thanks for the advice, however the bash terminal from Cygwin does not work (and the solution suggested in https://superuser.com/questions/1172759/cygwin-error-failed-to-run-bin-bash-no-such-file-or-directory does not work out either). Moreover, now that I have installed the program, I cannot uninstall it anymore (at least, not easily), as it does not appear in "Programs and features", and when I click "Uninstall" from a right click on the program icon, it opens the "Programs and features" windows, anyway.

TJones (WMF) (talkcontribs)

Oh no! I should have known better than to suggest software I haven't used in so long—but it was so nice back in the day. I haven't used Windows in almost 15 years either, so I don't really have any helpful advice. Crap, I'm sorry!

Automatik (talkcontribs)

No worries: I "uninstalled" it by removing its folders, and re-installed it using another repository, and now it works! Thanks for the tip then. To look for more than 500 results, I added the &sroffset=500 parameter (then 1000, 1500,... until no results are found)

Speravir (talkcontribs)

Oh, slightly funny: Unaware of this thread I recently opened a ticket on Phabricator: phab:T204089.

197.235.98.211 (talkcontribs)

It seems that it used to be possible to filter redirects at some point, and this was removed https://phabricator.wikimedia.org/T5174, https://phabricator.wikimedia.org/rMW52e699441edf2958701cea692a5dc3243ec3b064.

It seems developers are confused and going back and forth between removing and readding redirects to search. As the old saying goes, "clients don't know what they want". Anyway, a more sensible approach would be a degree of faceting, where it returns all results but aggregates similar properties, e.g. many pages will be in the same category, or many pages will be redirects, disambiguations, poor quality stubs, etc...

It is probably simpler to resolve this using the API, since it already has options for redirect titles. There are also at most about 10000 results, so it would probably be less challenging to filter through those. Anyway, if the search results aren't too many it is easier to include redirect title in API search results and use your favorite replace tool to clean up all those that don't match, e.g. https://en.wikipedia.org/wiki/Special:ApiSandbox#action=query&format=json&list=search&srsearch=shakespeare&srlimit=500&srprop=redirecttitle . This would be easier if CSV was a valid API output format.

197.235.98.211 (talkcontribs)
Speravir (talkcontribs)

(Nitpicking) @IP, apparently not: User/developer debt closed phab:T90807 as declined, but with the words “If there is more of a use case than what is in this ticket, please reopen and show examples / steps to reproduce.” Well I did not reopen, because this ticket was not found in a search for older tickets, but the same user/dev debt did not close the ticket opened by me. It seems I showed some valid use cases.

197.235.98.211 (talkcontribs)

Well, it seems more sensible to formulate it as "restore ability to remove redirects from search results" . This was explicitly and deliberately removed for specific reasons.

The general problem with wikis is that they attempt to cater to two sometimes conflicting groups. Pure readers, and editors. The average reader wants the best results, and doesn't even know about the existence of redirects. An editor sometimes wants worse results because they want to address a specific problem.

There are several orders of magnitude more readers than editors, and that's likely the reason it was removed . There is no doubt that such filters have its uses, although the question is whether it justifies the older functionality being restored. Also chances are that "debt" probably forgot about the older ticket or they would likely reopen it, and duplicate that task.

Speravir (talkcontribs)

Fair enough.

Reply to "How to exclude redirects from search results?"