Topic on Help talk:CirrusSearch

Easy way to identify articles with/without images?

6
Astinson (WMF) (talkcontribs)

So one of the larger theories about reader experience is that illustrated content is more catchy and engaging -- and that we will want to connect content from Commons with those potential articles.

Moreover, when community groups want to organize events like meta:VisibleWikiWomen or meta:Wikipedia_Pages_Wanting_Photos, it would be super useful to be able to identify which articles don't yet have Commons media on them.

Is there a way to surface whether or not articles have image in search? Right now the closest thing I can find to this, is whether or not a page has a pageimage indexed in https://en.wikipedia.org/wiki/Mother_Teresa?action=info . Magnus's petscan surfaces that element: but it's not reliable -- sometimes a page will have an image, but not a high quality one -- or it will be a logo or something that doesn't meet whatever the criteria is being used for that filter.

@DTankersley (WMF) @DCausse (WMF) & @EBernhardson (WMF) -- would love your thoughts.

TJones (WMF) (talkcontribs)

I don't think there is currently a good tool for this. You can do something with insource: regular expressions, but regexes can be very expensive queries and they aren't necessarily scalable. (We only allow so many regex queries at once, and if you have no other search terms to narrow the scope of the regex, it will always return incomplete results on large wikis because it times out.)

Here's a fairly generic regex that finds File: links with image suffixes (you may want to add other suffixes):

insource:/\[\[File[^|\]]+\.(jpg|png|gif|svg)[|\]]/

So, this query on enwiki currently returns about 100K results, but it times out, so the list is not complete.

The negation ( -insource:/\[\[File[^|\]]+\.(jpg|png|gif|svg)[|\]]/ ) returns 440K documents, and also timed out.

However, if you can limit your search to a particular category or title match or even fairly rare keyword, it should complete. For example: deepcat:"Film stubs" -insource:/\[\[File[^|\]]+\.(jpg|png|gif|svg)[|\]]/ finishes and gives 632 results (deepcat:"Film stubs" only gets 641 results, so it is easy for the regular expression to run over that limited set).

Note that insource: looks at the actual source of the page, so images included by templates, transclusion, etc, will not be detected.

So, as a once-in-a-while query or set of queries to generate lists for an editathon or other event, this would work. As a widely deployed user-facing tool, it probably would not—though maybe if there are always focused additional search terms.

If you are open to non-search approaches, you could also look at the dumps and write a tool to scan the latest dump for articles without images. It wouldn't be up-to-the-minute, but you could process 100% of a wiki if you wanted to, which would never be possible with insource: searches on larger wikis.

TJones (WMF) (talkcontribs)

As @DCausse (WMF) pointed out, not every wiki uses File: so another regex may work, or may not. Infoboxes and templates may have other syntaxes. I suppose just looking for things that look like image file names might work, with a few false positives where an article discusses images without actually having one—which seems rare. Parsing dumps sounds better and better.

Astinson (WMF) (talkcontribs)

@TJones (WMF) that is a very interesting solution, but regex would not be what I would want to provide to provide to organizers for regular use.

Also, just tried your query with a small set and get one image almost immediately: https://en.wikipedia.org/wiki/Lyra_McKee ). I have tried multiple examples, and its seems to be retrieving a not-insignificant number of false positives. I tried something else without regex and it seems to produce a better result. That solves my short term question.

In the long term, I would think a filter like this would be super useful in the search interface itself. I think the challenge with dumps, is that you create a huge barrier to novel use cases by folks who are wiki literate, but not necessarily technically literate. There are some tools that kindof do this kind of search live (i.e. FIST: https://tools.wmflabs.org/fist -- but that tool is kindof overwhelming, and breaks from the typical workflows (i.e. leveraging Petscan for categories because deepcat seems to break everytime I use it (as too many categories)). But search makes a lot more sense in a tool like petscan (or any other end-user tool). Theoretically you would want to be able to share a tool links that generates a query and share it around with others, and then the updates are going to be consistent from search.

Astinson (WMF) (talkcontribs)
197.218.85.218 (talkcontribs)

It seems like the most sensible place to add this functionality would be as a way to fetch page properties (https://phabricator.wikimedia.org/T200860). Of course there would need to be a generic property added to a page that contains any image at all, e.g. "hasimages" property. Currently it only gets added when an image fulfills certain criteria. Then it would work in a similar manner to Special:PagesWithProp .

Anyway, just knowing whether a page has an image can probably be done by Wikidata Query Service. For instance in a sparql query (https://w.wiki/MrK). That example simply searches for cats within the category:cats on English wikipedia.

I just cobbled that together using a few sparql examples. People more proficient with this would probably be able to make an image appear there, and make it possible to filter out pages without images. It would also be possible to generate those using templates for various use cases.


Reply to "Easy way to identify articles with/without images?"