@John Broughton @Zoozaz1 @Sdkb -- you brought up thoughts about how the algorithm works, and how to prefer the three main inputs to it:
- Look at the Wikidata item for the article. If it has an image (P18), choose that image.
- Look at the Wikidata item for the article. If it has a Commons category associated (P373), choose an image from the category.
- Look at the articles about the same topic in other language Wikipedias. Choose a lead image from those articles.
Ideally, we would be able to generate sufficient data on images from each input to calculate the rank ordering of which inputs are the best and worst. Imagine if people went through hundreds of potential matches, and labeled them "Yes" or "No", and then we could see which inputs have the most "Yes" answers. Perhaps volunteers might help us do something like that. Or farther down the line, if the feature is in the wikis, we could look at revert rates for images from different inputs.
But absent that quantitative approach, it's good to have some expectations. It sounds like all three of you expect the Commons category input to be the least reliable. That makes sense to me, and aligns with what we saw when we evaluated a couple hundred of these matches. Because many images can be in a Commons category, there can be plenty of images that are only peripherally relevant.
I agree that the other two inputs both seem pretty strong. Usually a Wikidata item only has one P18 image, and so I would imagine the person applying it to the Wikidata item believes that it is an appropriate image to illustrate the concept as a whole. Is that your expectation as well?
For images from articles in other languages, this one usually seems to work well, but has some wrinkles to it. The main issue can be illustrated with the following example, in which the article where the image is drawn from is a lot more extensive than the unillustrated article. In English Wikipedia, the article "Economy of North Rhine-Westphalia" has no image. It's a pretty short article. Its counterpart in German Wikipedia is a long article with lots of images, which makes sense because that region is in Germany. Which image would we recommend? One idea is to choose the first or "lead" image in the article, which in this case is a photo of a factory. Perhaps that would work as the single image in the English article, but only with a good caption explaining why it belongs in the article.
But overall, I do agree that the "images from articles in other languages" approach has the great benefit of another Wikipedian having consciously decided that the image is appropriate for a Wikipedia article on the same topic.