Talk:Citoid/2023
Add topic| This page used the Structured Discussions extension to give structured discussions. It has since been converted to wikitext, so the content and history here are only an approximation of what was actually displayed at the time these comments were made. |
Previous archives are at /Archive 1
Sometimes the visual editor citation tool changes tags from <ref> to <references>
[edit]It seems to happen most frequently when a large number of citations are added in a single edit.
This behaviour is undesired and confusing due to <references> generally being used for the reflist. Nathanielcwm (talk) 14:04, 8 June 2023 (UTC)
- Could you give me a diff in which this happened? Whatamidoing (WMF) (talk) 16:22, 13 June 2023 (UTC)
- @ESanders (WMF), I wonder if this is a recent problem. Whatamidoing (WMF) (talk) 16:22, 13 June 2023 (UTC)
- [1] Nathanielcwm (talk) 05:47, 14 June 2023 (UTC)
- I was just able to reproduce it with this edit https://en.wikipedia.org/w/index.php?title=User:Nathanielcwm/sandbox/referencestest&diff=prev&oldid=1160061664
- To reproduce it I did the following:
- Copied the contents of an old revision of the Vuze page into my sandbox
- Used the citation tool to automatically generate a citation using https://torrentfreak.com/former-vuze-developers-launch-biglybt-a-new-open-source-torrent-client-170803/
- Observed that after the citation was added the text of the citation was left highlighted in my editor
- Used the citation tool again to generate a citation for https://www.neowin.net/news/former-vuze-developers-launch-new-open-source-torrent-client-without-any-ads-or-bloat/
- Observed that the citation tool wrapped the second citation in a <references> tag
- This was done under Windows 11 and Firefox 114.0.1 with the modern Vector skin
- I have the following non default gadgets enabled:
- Focus the cursor in the search bar on loading the Main Page
- Twinkle
- Prosesize
- find-archived-section
- Display pages on your watchlist that have changed since your last visit in bold
- HotCat
- ProveIt (not used for this edit)
- MoreMenu
- Replace the "new section" tab text with "+"
- Change UTC-based times and dates, such as those used in signatures, to be relative to local time
- Display an assessment of an article's quality in its page header
- Dark mode toggle
- Display links to disambiguation pages in orange
- Strike out usernames that have been blocked
- XTools
- Make headers of tables display as long as the table is in view, i.e. "sticky"
- Dark mode styling
- And the following default gadgets disabled:
- Reference Tooltips
- refToolbar
- and https://en.wikipedia.org/wiki/User:Nathanielcwm/common.js Nathanielcwm (talk) 05:54, 14 June 2023 (UTC)
Improving citation quality
[edit]A few of us at en-wiki are about 55% through a massive cleanup project that resulted from the careless use of a user script that used Citoid to overwrite existing references with automatically generated ones. The current phase involves manually checking about 2400 diffs from the period January through April 2023. We haven't yet identified the full scope of the cleanup.
Through the course of this cleanup, we've determined that Citoid's references do not require improvement by human editors for only a small percentage of sources. As one of the involved editors with a more technical background (although decades ago), I'm trying to understand how the whole thing works, so we can improve the quality of references on en-wiki and avoid repeats of this sort of cleanup. I've personally invested probably over 100 hours in this over the past few weeks. Folly Mox (talk) 08:49, 11 June 2023 (UTC)
- What I'm understanding from the documentation available here, Citoid uses a fork of the Zotero translators. Is that correct? If so, how recently was it forked, and/or how often is it forked? Citoid/Determining if a URL has a translator in Zotero states In production for wikimedia, we've enabled three translators. Folly Mox (talk) 08:51, 11 June 2023 (UTC)
- Sorry, that's outdated. We recently reactivated the fork, but not to make changes to whether it's supported in translation-server or not. I've now fixed that page. (In the past we enabled 3 translators that Zotero didn't have enabled, not 3 total) Mvolz (WMF) (talk) 18:41, 11 June 2023 (UTC)
- The repository at GitHub shows a great number of javascript files, such that I can't figure out how to get to the end of "A" in the alphabetical listing. How does this align with having "enabled three translators"?
- Talk:Citoid/Determining if a URL has a translator in Zotero has a comment from 2015 stating Citoid uses its own HTML meta-data scraper as a fall-back when Zotero doesn't return any result. Is there any way to record / indicate this? Like a hidden comment before the closing ref tag along the lines of "citation created by generic translator", or a warning message to the editor along the lines of "automated citations to this site may contain errors, please double check"?
- Citoid is a very powerful library, and during the course of my cleanup efforts I've dropped into the visual editor a couple times to make use of it (in cases where the reference had been generated from a URL where Citoid's behaviour is suboptimal, but which contained a DOI that could be used to create a complete citation). However, at en-wiki at least, there's a culture of trusting code to function perfectly in all cases where it doesn't generate any warnings or errors. Effecting cultural change is difficult, and creating references manually is time-consuming, so I'm exploring all avenues. I don't think my technical skills are high enough to start writing Zotero translators, and I'm not sure how to get Citoid to incorporate those translators in its dependencies.
- Also, citations created from Google Books never include editors, misattributing their contribution as authorship, and I'm not sure if that's something that can only be addressed by improving the translator or if it's something that is going on within Citoid. Thanks in advance for your answers. Kindly, Folly Mox (talk) 08:51, 11 June 2023 (UTC)
- There's a "powered by Zotero" message in the citation picker if it's from Zotero, but Zotero also has a generic translators now, so that is probably not super useful to you (historically they did not- it's very rare to have a purely citoid response now).
- The github repo you've found is a good place to look to see what's available (I note the aws tests haven't run in a while!) Note some of the poor citation quality is going to be from javascript loaded pages, so things that work well with Zotero the browser extension, which deals with the loaded page, won't necessarily give the same results after being scraped using translation-server. Mvolz (WMF) (talk) 18:44, 11 June 2023 (UTC)
- (I tried to post the above in one go, but kept tripping the abuse filter for "linkspam". I couldn't even get the third paragraph to post as a single comment. Maybe the filter settings are too strict?) Folly Mox (talk) 08:54, 11 June 2023 (UTC)
- I think the problem of mistaking a book's editors for its authors predates citoid and the visual editor. Part of the problem is that reality is complicated. If you look at page xix in https://books.google.com/books?id=bIIeBQAAQBAJ, you'll see that there is a main author, an editor, and a long list of people who wrote specific entries. The correct author's name depends on which bit you're actually citing, and you're supposed to notice the presence or absence of the author's initials at the end. In https://www.google.com/books/edition/The_Routledge_Encyclopedia_of_Mark_Twain/8BhUuxcKNPkC, however, Google correctly names the editors as being editors, and it would be nice if citoid/Zotero could figure that out. Whatamidoing (WMF) (talk) 16:32, 13 June 2023 (UTC)
- I definitely never expect automated referencing to identify chapter contributors. Sometimes the table of contents is not even available for preview. I've had some luck going directly to publishers for the info, but in a couple cases I've had to leave the author attribution empty. Identifying editors as editors seems like pretty low hanging fruit, by which I mean it's clearly stated at the bottom of the page, right by the publisher and isbn information already being correctly scraped. Folly Mox (talk) 16:54, 13 June 2023 (UTC)
- Now that I think of it, something that's entirely within Citoid's remit would be, when creating a book citation, to use the "authorn-first" and "authorn-last" aliases instead of "firstn" and "lastn", since they'd be considerably easier to change into "editorn-" form, without needing to erase and retype the full parameter names as currently. Folly Mox (talk) 19:14, 13 June 2023 (UTC)
- The specific names of the parameters to use are controlled on-wiki via the "maps" parameter in the TemplateData of the relevant citation template, i.e https://en.wikipedia.org/wiki/Template:Cite_book/TemplateData. I'm not convinced this is a good idea, but it is implementable easily. * Pppery * it has begun 02:29, 23 December 2023 (UTC)
- In the interim, the CS1 templates have been updated to support "editor-lastn", "author-firstn" etc forms, so this particular suggestion is no longer relevant. Citoid still appears to suffer from the same issues, although I understand the maintainer has been tied up working on compatibility with other Wikimedia projects. A few lists of regexes for dealing with common failure states would go a long way. Folly Mox (talk) 22:09, 23 December 2023 (UTC)
- Citoid knows if it's using a Zotero translator or not. Does it know which one? If it does, and citation templates were updated to hold an appropriate hidden parameter, could the translator in use be surfaced and passed to the template? That could facilitate identifying which translators are consistently inaccurate, which seems like a good first step in trying to improve them or track them for manual correction. Folly Mox (talk) 16:57, 13 June 2023 (UTC)
- Zotero reports translator use in its logs, unfortunately the logging is not compatible with our infrastructure so we have those turned off. But if you run a version locally and try the url, it will tell you in the console. Mvolz (WMF) (talk) 11:35, 3 July 2023 (UTC)
- I should also mention I was informed a few days ago at en:Module talk:Wd#References mapping that |website= (in citation templates) "should only get the domain name when the source is best known by that name". Citoid always chooses to fill this parameter, even when it can't discern a human readable website name and falls back on the first part of the URL (which is often). Apparently this behaviour is not desirable in general. Folly Mox (talk) 17:09, 13 June 2023 (UTC)
- I've written a projectspace page about this at en:Wikipedia:WikiProject Citation cleanup/Repairing algorithmically generated citations. Corrections and additions are welcome. I really don't want to misinform anyone, and I have an incomplete understanding of the architecture and stack. Folly Mox (talk) 19:20, 15 June 2023 (UTC)
- Does Citoid do any error checking its values? It's not immediately clear where to find the source code, but it's pretty clear the user scripts downstream of Citoid don't double check it, so we get silly things like a perfectly formatted citation to a 404 page, or numeric data in an author name field. I understand that the parsing issues themselves stem from Zotero, but if basic error checking could be performed in-house, it could cut down on the amount of bogus citations added by good faith editors not cautious enough to double check script output. Folly Mox (talk) 16:22, 19 June 2023 (UTC)
- There is various error checking, for instance we check if a website sends a 404 page not found status code. However, unfortunately occasionally websites don't always comply with W3C standards and do silly things like report a 200 page OK status code and then in text write 404. Mvolz (WMF) (talk) 11:28, 3 July 2023 (UTC)
- Well, I guess it's fair that websites should probably follow standards and return 404 codes for their 404 pages, but since many of them don't, do you think it would be possible to check for "404", "page not found", "page does not exist", "we're sorry" etc. in the |title= parameter? An ounce of prevention saving a pound of cure, and all that.
- I'm minded to return to this subtopic specifically because I thought another title nearly universally indicating a failed reference is "is for sale", which is what typically shows up when a site has been usurped by a domain squatter. Folly Mox (talk) 18:46, 21 July 2023 (UTC)
PopulatingCite_report for relevant reference types
[edit]Currently the visuala editor citoid formatts reports using cite_journal. It'd be highly useful to wrap them instead in cite_report (especially when a Qikidata QID is provided, since such wikidata items will state that the instance of (P31)=report). It'd also be ideal in those ccasees to also include location data since the location of the publisher / commissioning organisation / authoring organisation is often highly relevant (indeed usually more relevant than a book's publisher's city!). Either drawing from the country (P17) of the publisher (P123) or maybe the location (P276) of the cited item itself?
See here for an example where cite_report would be helpful in formatting. T.Shafee(Evo﹠Evo)talk 06:08, 27 June 2023 (UTC)
- Hello, this is configurable.
- See: Citoid/Enabling_Citoid_on_your_wiki#Step_2:_Configure_Citoid
- To change this you need to add template data to Cite report, and then change report -> Cite report in the config (
- wiki:en:MediaWiki:Citoid-template-type-map.json). Mvolz (WMF) (talk) 11:20, 3 July 2023 (UTC)
- @Mvolz (WMF) Thanks! I think it already had the necessary templatedata, so I'e put in an edit request at MediaWiki_talk:Citoid-template-type-map.json. T.Shafee(Evo﹠Evo)talk 04:57, 17 July 2023 (UTC)
Jstor citations
[edit]On English Wikipedia the {{Cite journal}} template has a jstor parameter. Can Citoid be changed to extract the relevant stable link for the Jstor URL instead of copying the provided URL into the URL field? Ifly6 (talk) 03:14, 4 July 2023 (UTC)
- If given a JSTOR link, it gives the stable JSTOR url.
- For most other links, it doesn't typically know the JSTOR identifier, so it can't use that to then get the JSTOR link. Most links to journal articles, if they include extra identifiers will include the DOI, but not typically JSTOR. Mvolz (WMF) (talk) 13:02, 5 July 2023 (UTC)
- If I wasn't clear, I'm discussing how it treats Jstor input only: eg input like
https://www.jstor.org/stable/45019299. Ifly6 (talk) 13:43, 5 July 2023 (UTC) - Ah, I misinterpreted you - you want the jstor url to go into the jstor field instead of into the url field?
- That's a little tricky. We could definitely return a JSTOR parameter in the api; the problem is that TemplateData and Citoid extension only does really basic mapping, so then the jstor link would end up in the url as well and so it'd be linked in both. In the API we return a url no matter what because it's a required parameter (api guarantees return of a url in the url field) and for other language wikis that don't have separate parameters, they need it. We've had this issue as well with people not liking we return both the doi and the resolved doi link in the url field, though personally it doesn't bother me.
- That kind of per-wiki customisation might have to be per-wiki user script / common.js kind of solution rather than something that goes in the back-end or the extension, which is designed to be fairly agnostic about the citation templates being used. Mvolz (WMF) (talk) 11:36, 17 July 2023 (UTC)
- Yea, it would pretty much be just changing where the JSTOR parameter ends up. On English Wikipedia there's been an issue basically where there are three interrelated issues. ¶ First, the citation bot adds Jstor parameters given the stable Jstor URL but this causes unnecessary duplication... which some people want retained just in case someone meant the URL to be there (even though nobody means much of anything when using these citation generators). ¶ Second, the Internet Archive bot then can be run to "archive" the live Jstor URLs (but not the parameters) because the URL is there... even though, because Jstor is paywalled, the "archive" is just a landing page. Naturally some people don't want these useless archive links remove either. ¶ Third, Jstor because it's paywalled isn't always the best free full-text source and putting a URL there would on first glance seem misleading.
- Anyway, I understand the technical issues involved, though I think the real solution in this instance is the root cause, which is the unthinking addition of Jstor URLs to templates that end up triggering all of the downstream clutter. A user script would have insufficient adoption to go much of anywhere in nipping the issue. Ifly6 (talk) 14:45, 17 July 2023 (UTC)
- Just noting here that it's been my practice to remove url parameters when they point to jstor, and put the stable jstor identifier in the jstor parameter instead, to avoid the unnecessary archive and access-date cruft that follow-on scripts produce. I understand if it's not possible not to return a url parameter though.
- en-wp's own in-house tools could be a vector for correction here, although the maintainers have been too busy to maintain them for a long time. Honestly given how popular automated referencing has become, we could use about four times as much staffing at every point in the stack. Folly Mox (talk) 20:56, 20 July 2023 (UTC)
- Yea, when I was reading that Village Pump discussion about people claiming that an editor might have placed the Jstor URL there on purpose, my first thought was "lmao nobody formats citations manually anymore; there's no purpose involved". Ifly6 (talk) 21:11, 20 July 2023 (UTC)
- I missed that discussion, but there's no reason to duplicate a link (to jstor content) in the cruft-inducing url field when it can be safely tucked into the parameter specifically included to hold it.
- These days if I'm citing a journal article, I'll usually swap into Visual Editor to generate the citation with Citoid, but I swap back into source editing to touch it up afterwards.
- I do find it worrisome how proliferate automated referencing has become when weighed against the accuracy of its output. I spend probably eighty per cent of my time on wiki cleaning up after thoughtless automatic references, but even with a team of fifteen or twenty the references would be flowing in at a rate we couldn't handle them, given the huge backlog currently present. Folly Mox (talk) 21:34, 20 July 2023 (UTC)
- I agree, which is why I was thinking to get to (at least) one of the sources of those automatic reference generators. Is it possible, Mvolz, to add some kind of post-processing to trigger with Jstor? Or is that actually technically infeasible? Ifly6 (talk) 02:59, 21 July 2023 (UTC)
A few questions
[edit]1. Would you be at all willing to maintain a brief list of known paywalled sources such that Citoid can apply a "url-access" parameter to citations to such domains. I'm thinking places like nytimes.com, ft.com, forbes.com, stltoday.com, latimes.com, etc. At present url-access always needs to be added manually, usually after a failed attempt to verify a claim.
2. I've noticed that citations to The Guardian consistently render the website / work parameter as "the Guardian". Would you be willing to uppercase the first letter in the website / work parameter for all sources that don't equal the first bit of the domain name? There may be sources who prefer a different case styling, but it looks weird in the rendered template. Alternatively, could you uppercase the first letter of the website/ work parameter when the first word is "the"?
3. An astute unregistered editor noticed at en:Help talk:Citation Style 1#Unix epoch that many sources using the date "1970-01-01" (the unix epoch) are doing so in error. Would you be willing to discard this date as bogus for sources that are not books, journals, or periodicals?
4. Is this a good place to discuss improvements to Citoid, or would Phabricator work better? I've recently registered an account there. Folly Mox (talk) 20:49, 20 July 2023 (UTC)
- 3. So I realised I'm dumb, and web sources should not report a date prior to c. 1995 in any case. So the unix epoch should probably just be discarded regardless of spurce type. Folly Mox (talk) 18:42, 21 July 2023 (UTC)
- 3. The CS1 templates will all reject an
|access-date=before Wikipedia's inception regardless of type (see this discussion), so we must be talking about publication date. The bigger problem is, unlike a physical-media citation type, if a website shows a timestamp of 1970-01-01 on some page (which I cannot prove, but believe with near-certainty, happens somewhere in the wild), then that's the only date we have for that source. IOW, it's arguably "correct" to use it in the citation, despite its obvious impossibility. FeRDNYC (talk) 12:01, 22 July 2023 (UTC)- Yeah I am talking about |date=, not access-date=
- The Zotero translators seem to lean pretty heavily into HTML metadata, so it's possible the hypertext document could have a date listed as the unix epoch, with an actual publication date somewhere in the byline or footer, but the more common scenerio is probably like this one I fixed yesterday at en:Yuan Dynasty: https://www.academia.edu/2439642
- Here, the service hosting the source (academia) reports a bogus unix epoch date, which any parser will pick up, but inspecting the actual source document reveals a publication date in 2010. Folly Mox (talk) 17:15, 22 July 2023 (UTC)
- I'd say that if a genuine web based source has the only available publication date set prior to the deployment of the world wide web in the early 1990s, it's safest to ignore the date rather than use a known incorrect value.
- The nice thing about book and journal sources is that they'll have more than one service documenting their existence, so if one site is erroneously reporting a unix epoch date for the source, it can be cross-checked and corrected. Folly Mox (talk) 18:32, 22 July 2023 (UTC)
- I don't think you would be happy with that. Google Books returns publication dates, such as
|date=1982for https://www.google.com/books/edition/Chocolate_the_Consuming_Passion/egLRDF36ayoC, rather than webpage dates. PubMed and doi entries also get their proper publication dates. Whatamidoing (WMF) (talk) 00:02, 23 August 2023 (UTC) - Rereading my comment now a month later, I definitely wasn't clear about what constitutes "a genuine web based source". I miscommunicated similarly in a completely different discussion about overlinking, also by employing the term "genuine" as if I hadn't put a lot of assumptions behind it. Probably time to choose my words more carefully.
- In any case, as regards the topic I was initially trying to discuss, the unix epoch date "1970-01-01", it makes more sense to have citation templates add it to a tracking category rather than never return a date from Citoid, purely for visibility reasons. It's easy (although time-consuming) to run through a maintenance category full of likely bad data and fix it; it's much more difficult to find every citation without a publication date and ensure there actually is none provided. The second set is probably three or four orders of magnitude larger than the first, so my initial idea was probably uh ill-considered 🙃 Folly Mox (talk) 19:48, 24 August 2023 (UTC)
- That sounds like a good idea to suggest at w:en:Help talk:Citation Style 1. Whatamidoing (WMF) (talk) 23:48, 28 August 2023 (UTC)