Topic on Talk:Citoid

Improving citation quality

18
Folly Mox (talkcontribs)

A few of us at en-wiki are about 55% through a massive cleanup project that resulted from the careless use of a user script that used Citoid to overwrite existing references with automatically generated ones. The current phase involves manually checking about 2400 diffs from the period January through April 2023. We haven't yet identified the full scope of the cleanup.

Through the course of this cleanup, we've determined that Citoid's references do not require improvement by human editors for only a small percentage of sources. As one of the involved editors with a more technical background (although decades ago), I'm trying to understand how the whole thing works, so we can improve the quality of references on en-wiki and avoid repeats of this sort of cleanup. I've personally invested probably over 100 hours in this over the past few weeks.

Folly Mox (talkcontribs)

What I'm understanding from the documentation available here, Citoid uses a fork of the Zotero translators. Is that correct? If so, how recently was it forked, and/or how often is it forked? Citoid/Determining if a URL has a translator in Zotero states In production for wikimedia, we've enabled three translators.

Mvolz (WMF) (talkcontribs)

Sorry, that's outdated. We recently reactivated the fork, but not to make changes to whether it's supported in translation-server or not. I've now fixed that page. (In the past we enabled 3 translators that Zotero didn't have enabled, not 3 total)

Folly Mox (talkcontribs)

The repository at GitHub shows a great number of javascript files, such that I can't figure out how to get to the end of "A" in the alphabetical listing. How does this align with having "enabled three translators"?

Talk:Citoid/Determining if a URL has a translator in Zotero has a comment from 2015 stating Citoid uses its own HTML meta-data scraper as a fall-back when Zotero doesn't return any result. Is there any way to record / indicate this? Like a hidden comment before the closing ref tag along the lines of "citation created by generic translator", or a warning message to the editor along the lines of "automated citations to this site may contain errors, please double check"?

Citoid is a very powerful library, and during the course of my cleanup efforts I've dropped into the visual editor a couple times to make use of it (in cases where the reference had been generated from a URL where Citoid's behaviour is suboptimal, but which contained a DOI that could be used to create a complete citation). However, at en-wiki at least, there's a culture of trusting code to function perfectly in all cases where it doesn't generate any warnings or errors. Effecting cultural change is difficult, and creating references manually is time-consuming, so I'm exploring all avenues. I don't think my technical skills are high enough to start writing Zotero translators, and I'm not sure how to get Citoid to incorporate those translators in its dependencies.

Also, citations created from Google Books never include editors, misattributing their contribution as authorship, and I'm not sure if that's something that can only be addressed by improving the translator or if it's something that is going on within Citoid. Thanks in advance for your answers. Kindly,

Mvolz (WMF) (talkcontribs)

There's a "powered by Zotero" message in the citation picker if it's from Zotero, but Zotero also has a generic translators now, so that is probably not super useful to you (historically they did not- it's very rare to have a purely citoid response now).

The github repo you've found is a good place to look to see what's available (I note the aws tests haven't run in a while!) Note some of the poor citation quality is going to be from javascript loaded pages, so things that work well with Zotero the browser extension, which deals with the loaded page, won't necessarily give the same results after being scraped using translation-server.

Folly Mox (talkcontribs)

(I tried to post the above in one go, but kept tripping the abuse filter for "linkspam". I couldn't even get the third paragraph to post as a single comment. Maybe the filter settings are too strict?)

Whatamidoing (WMF) (talkcontribs)

I think the problem of mistaking a book's editors for its authors predates citoid and the visual editor. Part of the problem is that reality is complicated. If you look at page xix in https://books.google.com/books?id=bIIeBQAAQBAJ, you'll see that there is a main author, an editor, and a long list of people who wrote specific entries. The correct author's name depends on which bit you're actually citing, and you're supposed to notice the presence or absence of the author's initials at the end. In https://www.google.com/books/edition/The_Routledge_Encyclopedia_of_Mark_Twain/8BhUuxcKNPkC, however, Google correctly names the editors as being editors, and it would be nice if citoid/Zotero could figure that out.

Folly Mox (talkcontribs)
I definitely never expect automated referencing to identify chapter contributors. Sometimes the table of contents is not even available for preview. I've had some luck going directly to publishers for the info, but in a couple cases I've had to leave the author attribution empty. Identifying editors as editors seems like pretty low hanging fruit, by which I mean it's clearly stated at the bottom of the page, right by the publisher and isbn information already being correctly scraped.
Folly Mox (talkcontribs)

Now that I think of it, something that's entirely within Citoid's remit would be, when creating a book citation, to use the "authorn-first" and "authorn-last" aliases instead of "firstn" and "lastn", since they'd be considerably easier to change into "editorn-" form, without needing to erase and retype the full parameter names as currently.

Pppery (talkcontribs)
Folly Mox (talkcontribs)

In the interim, the CS1 templates have been updated to support "editor-lastn", "author-firstn" etc forms, so this particular suggestion is no longer relevant. Citoid still appears to suffer from the same issues, although I understand the maintainer has been tied up working on compatibility with other Wikimedia projects. A few lists of regexes for dealing with common failure states would go a long way.

Folly Mox (talkcontribs)

Citoid knows if it's using a Zotero translator or not. Does it know which one? If it does, and citation templates were updated to hold an appropriate hidden parameter, could the translator in use be surfaced and passed to the template? That could facilitate identifying which translators are consistently inaccurate, which seems like a good first step in trying to improve them or track them for manual correction.

Mvolz (WMF) (talkcontribs)

Zotero reports translator use in its logs, unfortunately the logging is not compatible with our infrastructure so we have those turned off. But if you run a version locally and try the url, it will tell you in the console.

Folly Mox (talkcontribs)

I should also mention I was informed a few days ago at en:Module talk:Wd#References mapping that |website= (in citation templates) "should only get the domain name when the source is best known by that name". Citoid always chooses to fill this parameter, even when it can't discern a human readable website name and falls back on the first part of the URL (which is often). Apparently this behaviour is not desirable in general.

Folly Mox (talkcontribs)
Folly Mox (talkcontribs)

Does Citoid do any error checking its values? It's not immediately clear where to find the source code, but it's pretty clear the user scripts downstream of Citoid don't double check it, so we get silly things like a perfectly formatted citation to a 404 page, or numeric data in an author name field. I understand that the parsing issues themselves stem from Zotero, but if basic error checking could be performed in-house, it could cut down on the amount of bogus citations added by good faith editors not cautious enough to double check script output.

Mvolz (WMF) (talkcontribs)

There is various error checking, for instance we check if a website sends a 404 page not found status code. However, unfortunately occasionally websites don't always comply with W3C standards and do silly things like report a 200 page OK status code and then in text write 404.

Folly Mox (talkcontribs)

Well, I guess it's fair that websites should probably follow standards and return 404 codes for their 404 pages, but since many of them don't, do you think it would be possible to check for "404", "page not found", "page does not exist", "we're sorry" etc. in the |title= parameter? An ounce of prevention saving a pound of cure, and all that.

I'm minded to return to this subtopic specifically because I thought another title nearly universally indicating a failed reference is "is for sale", which is what typically shows up when a site has been usurped by a domain squatter.

Reply to "Improving citation quality"