Topic on Extension talk:CirrusSearch

Suggestion: Support searching for external links

6
197.218.80.203 (talkcontribs)

Problem

As a reader, I want to find articles that mention contain a specific link (e.g. a new story, a hoax or an untrustworthy site) to verify its validity.

As a editor, I want to find articles that mention a specific link and some keyword to eliminate spam or certain vandalism or hoaxes.

Background

Currently, cirrussearch allows searching for internal links, yet it doesn't make it possible to do this for external links. This means that one has to use a page such as Special:LinkSearch or complicated regex with "insource" that may not always find the link because they can be constructed by templates in hard to find ways, e.g. "{{{mainsite}}}.com/{{stringsub}}".

Proposed solution

A new search "keyword" or predicate that indexes external links, e.g.:

banana cures aids extlinksto:/*.hoaxysite.com/
-extlinksto:/*.hoaxysite.com/
197.218.80.203 (talkcontribs)

Linksearch also doesn't resolve International domain names(https://phabricator.wikimedia.org/T130482), so while these should be equivalent they aren't:

Form 1 Form 2 Form 3
xn--bcher-kva.ch buecher.de Bücher.de
this http://www.sina.com.hk/news/article/20170330/5/45/49/各款時尚耳機送給愛音樂又好動的他-7164476.html

That makes things way worse.

EBernhardson (WMF) (talkcontribs)

This isn't impossible, but it would take a little time to get going. Basically we already have the external links in the search index, but they are not processed in a way that is useful for this type of search. If you had to guess what is the relative usefulness of searching tokenized full urls, vs say a suffix search on domains?

By tokenized urls i mean we would break up http://www.sina.com.hk/news/article/20170330/5/45/49/各款時尚耳機送給愛音樂又好動的他-7164476.html into [www, sina, com, hk, news, article, 20170330, 5, 45, 49, 各款, 時尚, 耳機, 送給, 愛音樂, 又, 好動, 的, 他, 7164476, html] and allow matching individual pieces of the url.

CKoerner (WMF) (talkcontribs)
197.218.90.120 (talkcontribs)

> If you had to guess what is the relative usefulness of searching tokenized full urls, vs say a suffix search on domains?

Tokens are likely to be far more sensible and useful and have use cases beyond simple validity checking or vandalism fighting, for example academics can use it to find links to specific resources the domain registered to one country. Right now one has to figure out the exact syntax of linksearch and it isn't all that intuitive.

However, it would really depend on the syntax implemented and support at least wildcards if regex is not feasible due to performance issues or technical issues.

>Thank you for the suggestion. I've created a task to track the request.

You're welcome.

197.218.90.120 (talkcontribs)

One other concrete usecase that I recently saw:

https://en.wikipedia.org/w/index.php?title=Wikipedia%3AVillage_pump_%28technical%29&type=revision&diff=772883257&oldid=772883163

"Free-content attribution" insource:http://unesdoc.unesco.org/images/0023/002325/232555e.pdf or http://unesdoc.unesco.org/images/0024/002446/244676e.pdf

This would have been trivial, e.g. : "extlinkto:unesdoc.unesco.org*232555epdf" OR "extlinkto:unesdoc.unesco.org*244676e.pdf"

Or using regex magic :

unesdoc\.unesco\.org.*(232555e|244676e)\.pdf

It is possible to get those using insource but it requires a lot of hoop jumping, and there is no guarantee that the link will not be in a template somewhere or be combined differently.

The greatest benefit will be that it will be possible to select specific namespaces. Linksearch is just everything jumbled together, due to performance concerns presumably.

Reply to "Suggestion: Support searching for external links"