Jump to content

Talk:Citoid/2016

Add topic
From mediawiki.org
Latest comment: 8 years ago by SMcCandlish in topic Confusing publisher and via parameters?

Previous archives are at /Archive 1

Problems with JAMA Network urls

[edit]

I'm not sure if the problem lies in Citoid or not, so here's the url: http://archfaci.jamanetwork.com/article.aspx?articleid=479760

The data returned by Citoid has two problems:

The capitalization problem doesn't seem limited to that link. Here's another example: http://archinte.jamanetwork.com/article.aspx?articleid=2204033 (the data returned)

A third one: http://archderm.jamanetwork.com/article.aspx?articleid=2214161(the data) The RedBurn (talk) 22:40, 5 March 2016 (UTC)Reply

Created a bug report for this: https://phabricator.wikimedia.org/T134698 Mvolz (WMF) (talk) 21:43, 8 May 2016 (UTC)Reply

List of translators available in Citoid

[edit]

The only URL with a list of translators that I could find was updated almost a year ago. Is there any way to find out what translators are avilable in Wikimedia? Lsanabria (talk) 13:43, 6 April 2016 (UTC)Reply

Ping Mvolz. I'm also curious about how often the translators are updated from https://github.com/zotero/translators/ since I just submitted my first translator there :) Danmichaelo (talk) 21:21, 8 May 2016 (UTC)Reply
Very intermittently. Once your translator gets merged feel free to ping me somewhere and I'll get it merged upstream :). Mvolz (WMF) (talk) 21:37, 8 May 2016 (UTC)Reply
Thanks, Mvolz, it was merged now :) Danmichaelo (talk) 07:09, 20 May 2016 (UTC)Reply
https://github.com/wikimedia/mediawiki-services-zotero-translators is the list of our translators, unfortunately these also include translators which don't work with citoid (ones which don't have the 'v' under browser support flag i.e. this one works and this one doesn't.) There used to be tests run daily and the results online, which better showed you which ones work and which ones don't, but those have been broken for awhile now. Mvolz (WMF) (talk) 21:37, 8 May 2016 (UTC)Reply
It would be great to have that translator results tool back up now that Refill is using Citoid. Even a quick and dirty indicator would be good. Otherwise it's harder to diagnose why links aren't expanding properly. czar 17:00, 9 June 2016 (UTC)Reply
Tracking in phabricator here: https://phabricator.wikimedia.org/T137440 Mvolz (WMF) (talk) 15:01, 29 July 2016 (UTC)Reply
For posterity, here's the link that tests the Citoid server for site/translator compatibility:
https://zotero-translator-tests.s3.amazonaws.com/index.html czar 14:15, 12 July 2017 (UTC)Reply

PDF support

[edit]

Any idea whether PDF support ([https://citoid.wikimedia.org/api?format=mediawiki&search=http%3A%2F%2Fwww.imf.org%2Fexternal%2Fpubs%2Fft%2Ffandd%2F2016%2F06%2Fpdf%2Fostry.pdf example]) is planned and if so when it may be expected? ~ 97.118.191.34 (talk) 18:10, 1 June 2016 (UTC)Reply

It's not currently on the board anywhere. I created a task for it: https://phabricator.wikimedia.org/T136722
But, I don't think it's going to happen anytime soon unless someone really wants to take crack at it. It either involves modifying translation-server to somehow access this functionality in Zotero, or to build it in scratch inside citoid using the same approach Zotero did. (Content mine can process pdfs for metadata, I believe, but I don't think we're going to support it due to dependency issues i.e. phantomjs) Mvolz (WMF) (talk) 18:29, 1 June 2016 (UTC)Reply

Embedded Metadata translator in Citoid

[edit]

Does anyone know what is blocking Zotero's EM from being used in Citoid? Can you point me to the relevant discussions? I have a enwp talk page request to make a translator for HighBeam (a database that is frequently used on enwp) but EM already handles HighBeam nicely if Citoid will support it. (Also see GitHub thread.) czar 23:38, 27 June 2016 (UTC)Reply

During wikicite I talked about this with a Zotero developer, Sebastian, about this, and he didn't know either, so he sent a query to the listserv about it: https://groups.google.com/forum/#!topic/zotero-dev/fJdp0NK_2ec
I think what is most likely is that it's something that needs to be done in https://github.com/zotero/translation-server. Maybe we should try submitting a bug report there? Mvolz (talk) 10:47, 28 June 2016 (UTC)Reply
Thanks—I didn't know about that exchange. I know this thread is in a few different places now, but it looks like dstillman refreshed translation-server (as announced) to update from the 2013 to the 2016 version of Zotero. Has Citoid pulled those updates? That might resolve the EM issue, but if it doesn't and it would help, I can open a ticket czar 16:02, 28 June 2016 (UTC)Reply
Thanks for following up on this! I didn't think updating would work (since translation-server has very specific behaviour for no translators being available), so I just checked right now with the updated version and sadly it doesn't work. It's something that will have to be fixed in translation-server so I opened up a ticket there: https://github.com/zotero/translation-server/issues/31 Mvolz (talk) 17:31, 28 June 2016 (UTC)Reply

Citoid → Wikidata reference

[edit]

Is there a tool for creating references for Wikidata items via Citoid (e.g., URL to formatted reference on Wikidata entry)? czar 21:34, 11 July 2016 (UTC)Reply

That would be Phab:T131661 Izno (talk) 23:00, 11 July 2016 (UTC)Reply

The real solution for generating references in one click

[edit]

Citoid generates incomplete references - it doesn't fill the author name, publication date, and it doesn't remove the trash in the title (like the " - BBC News" suffix). The solution is to create an extremely simple web standard and to convince the publications to use it. Something like "<span class="publication date">January 25, 2000</span> and "<span class="author name">Joe Sixpack</span>. And then citoid will be able to fill all the necessary data for a well formatted reference. In time, slowly but steadily, the publications will implement this standard, because it will benefit them - the Wikipedians will cite the newspapers that implement the standard and will avoid those who don't. I created a bookmarklet script (en:User:Ark25/RefScript) that generates references in one click in one second (not in 10 seconds like citoid) and it fills all the data - but it's huge, bulky, and only knows some 30 newspapers websites. With a standard like this, my script (and citoid too) would be super-short and will work for any website that implements the standard. First, the blogs of the wikipedians can implement the standard (by the way, anyone knows wikipedians that have blogs), and I'm quite sure the people WMF board know people in the media and can convince some of them to implement the standard. I've launched this proposal some time ago at meta:Talk:Community_Liaisons/Process_ideas#Make life easier for the editors - generate references in one click. We can stop wasting gazillions of hours for painstakingly fill the data in the references, it all depends on the WMF board. Please WMF board, we want to write reliable articles by providing plenty of references but you need to get us out of the Stone Age! Thanks. —  Ark25  (talk) 19:43, 23 July 2016 (UTC)Reply

w:Zotero is the standard that you want. I wish that you would add your list of sources to it. Whatamidoing (WMF) (talk) 17:28, 29 July 2016 (UTC)Reply
Continued at https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(proposals)#Generate_references_in_one_click czar 20:42, 29 July 2016 (UTC)Reply
Zotero is not a standard. Zotero is a bulkish script, just like RefScript, that requires an enormous amount of energy to maintain. The more translators (pieces of code that teach it to handle a new website) you add, you make Zotero harder to maintain and to update. The "better" Zotero (and RefScript too) is, the more you have to update it every single day, to keep up with the changes the newspapers are doing in their formatting of the articles they are publishing. If you have 10.000 translators, then you have to update about 30 translators every day, because every day, some 0,3% of the websites Zotero knows are changing their formatting style. This is not a solution, this is a nightmare. —  Ark25  (talk) 22:07, 29 July 2016 (UTC)Reply
Let's keep the discussion on the other page czar 01:21, 30 July 2016 (UTC)Reply

Confusing publisher and via parameters?

[edit]

Someone mentioned to me that this tool is incorrectly outputting values like |publisher=Google Books, at least for en.wikipedia citation templates. If it's still doing that, it needs to be fixed ASAP to correctly use the |via= parameter for such intermediary distributors as Google Books, YouTube, Project Gutenberg, PubMed, JSTOR, etc. (if it hasn't been fixed in this regard already). I don't use VE, so I'm not even sure how to test this.  — SMcCandlish ¢ ≽ʌⱷ҅ʌ≼  04:55, 24 July 2016 (UTC)Reply

Last I checked, Project Gutenberg retypes the text, they don't just scan it. So their books are new editions and they are the publisher. Citation templates don't provide any mechanism to show that an earlier edition was published by a different publisher. Jc3s5h (talk) 11:38, 25 July 2016 (UTC)Reply
Wikipedia would still not treat them as a publisher, and having tools automatically do so is misleading and wrong for our implementation of source citations. Project G. is a republisher, and that is what |via= is for, even if they did some hand cleanup of their OCR (and, yes, they do use OCR). It's no different from converting a book to PDF and then eBook format. That doesn't make you magically a new publisher, it just means you've done the work (including any after-automation cleanup) to format-shift something.  — SMcCandlish ¢ ≽ʌⱷ҅ʌ≼  18:50, 27 July 2016 (UTC)Reply
I think the via parameter would imply that the republication is page-for-page and line-for-line identical to the original publication. Frequently in the past republications would be repaginated, so a passage that appeared on page 100 in the original might be on page 95 in the republication. I believe this was the case in the early days of Project G., although maybe not the more recent publications. Certainly an edition with different page numbering than the original should be treated as new editions. If in doubt, the presumption should be it is a new edition, to avoid making a false claim about what page a passage occurs in the original (which the citing editor has never seen). Jc3s5h (talk) 19:50, 27 July 2016 (UTC)Reply
Well, it doesn't imply that. Electronic versions of documents are very often not "page-for-page and line-for-line identical" to the paper version, unless painstakingly made that way, usually in PDF form. If I write a book and release in PDF form through O'Reilly by special arrangement with them, and (within our licensing parameters) you use some tool to convert it to Kindle format, and this changes the layout in some ways, you don't get to claim to be my book's publisher. Doing so would actually reduce the apparent reliability of the source, since you're just some random person, not a well-known publisher. Per
en:w:WP:SAYWHEREYOUGOTIT
we do want a |via= parameter identifying that this is a copy from some intermediary source and not straight from the actual publisher.
Citing specific page numbers in e-documents is generally pointless unless they are in fact exact PDF scans; we have the |at= parameter to identify where in an electronic document the material can be found. E.g., I would use this to cite the online edition of the Chicago Manual of Style by section number, since it doesn't even have page numbers. Intelligent use of |at= allows people to find the same part in a paper edition, too.  — SMcCandlish ¢ ≽ʌⱷ҅ʌ≼  22:31, 28 July 2016 (UTC)Reply
SMcCandlish wrote " If I write a book and release in PDF form through O'Reilly by special arrangement with them, and (within our licensing parameters) you use some tool to convert it to Kindle format, and this changes the layout in some ways, you don't get to claim to be my book's publisher."
Yes, I do. Whatever contractual agreements got put in place among you, me, and O'Reilly allows me to. Of course, if I'm violating copyright, my version shouldn't be cited at all. Or if it happens in the year 2200 and your copyright has expired, then I don't need anyone's permission to create a new edition.
Really no different than Bloomsbury being the original Harry Potter books but Scholastic being the publisher for the English North America editions. Jc3s5h (talk) 15:04, 31 July 2016 (UTC)Reply
Taking a file and running a conversion program on it is nothing at all like Scholastic typesetting, design a new cover for, creating new frontmatter for, printing, and distributing a NAm edition of book originally by Bloomsbury. I repeat: What you are talking about is nothing but format-shifting. It is no different from you posting a piece of digital art at DeviantArt, and me (pursuant to permissive licensing terms) putting a copy of it on my Facebook feed; which entails a new copy there, and a re-encoding, i.e. a format shift, and me and Facebook distributing the work to new people. Neither I nor Facebook become the publisher; DeviantArt remains the publisher, Facebook is the |via=. I suppose a philosophical argument can be made that they are two different kinds of publishing really, but who cares? The format-shifting and additional distribution isn't "publishing" for WP citation purposes.
This distinction is the very reason that the |via= parameter was created, to stop mis-attributing format-shifted and other repostings by random pseudo-publishers and content aggregators as the |publisher=, but retain the name of the actual publisher as such, and the name of the online distributor, so that people can find the work in the original form, not just on some possibly short-lived website, but can also use that website for convenience, and not be confused about the difference. For all we know, Google Books or Project Gutenberg could disappear tomorrow forever. The distinction is especially important for any entity that both reformats and distributes (|via=) material on behalf of external, traditional publishers, and also act as the publisher itself, for new (generally amateur) content. Amazon is already doing this, and this kind of business model shift can happen at any time (e.g. HBO, Netflix, and Amazon are all publishers of original television and e-TV series, when formerly they were, respectively, a cable redistributor, a by-mail and later online stream redistributor, and an e-tailer, of previously published content. So, already, any such entity could appear as a |publisher= or a |via=, for different sources in the same article, and the distinction in each case would matter.
When it comes to historical sources, the original publisher information is also often of pertinent, even of crucial value, since significant difference can exist between the 1645 version of something from a London publisher, and a 1672 edition produced in Dublin, without any intermediary e-distributor like Project Gutenberg even being aware of it. Or – and this is telling – they often are aware of it, and so is Google Books, and take pains to note the actual publisher. Neither service claims to be the publisher of such works, and it is a weird form of original research for WP to insist that they are.
With that, I'm kind of tired of arguing round in circles on this stuff, and don't need to keep at it. We have separate parameters for these things for both a citation accuracy and utility reason (helping readers find and use sources) and a policy reason, en:w:WP:SAYWHEREYOUGOTIT, and neither the separation of these parameters nor the rationales for the separation are going to go away just because you don't see it the same way. I could even be totally wrong about every single ting I've said other than the last sentence and it wouldn't make any difference, since there's already a consensus to keep them separate, and it is not necessary for my analysis of why to be correct (though it is).  — SMcCandlish ¢ ≽ʌⱷ҅ʌ≼  10:40, 1 August 2016 (UTC)Reply
PS: I posted a cross-reference to this discussion, at en:w:Help talk:Citation Style 1, and it also turns out there is already an active thread about this over there:
  https://en.wikipedia.org/wiki/Help_talk:Citation_Style_1#A_Meta_discussion_on_the_difference_between_via_and_publisher
 — SMcCandlish ¢ ≽ʌⱷ҅ʌ≼  14:23, 1 August 2016 (UTC)Reply
Perhaps the point has been missed here in the back-and-forth. Wikipedia citations use "publisher" to mean the original publisher of an edition of a work. Some of our information providers use the same keyword for a different meaning, the most recent content provider. We should not mix up these two meanings merely because they use the same keyword. If information providers are using "publisher" to mean something different than what we want it to mean, Citoid should not be blindly copying them. David Eppstein (talk) 18:46, 12 July 2017 (UTC)Reply
Agreed, entirely. If the most recent content provider isn't the real publisher of the content, the former should be in the |via parameter. I don't know if there's a practical way to make Citoid aware of a big list of journal aggregators, news aggregators, book scanning sites, etc., to code them as |via instead of |publisher, but I hope so. If WP can maintain a URL blacklist that includes virtually all known URL redirectors (tinyurl.com, etc.), I would think that it could maintain a list of content aggregators (pseudo-republishers).  — SMcCandlish ¢ ≽ʌⱷ҅ʌ≼  20:10, 13 July 2017 (UTC)Reply
It isn't quite as simple as original publisher vs. republisher. A republisher that simply copys images of the original publication and makes them available online should probably be named with the via parameter, or similar. But a publisher who re-typesets, and perhaps repaginates an older work should be regarded as a full-fledged publisher. Some citation styles call for naming the original publisher in this situation, but the Wikipedia citation templates do not have a parameter for this purpose. Jc3s5h (talk) 20:49, 13 July 2017 (UTC)Reply
We already covered this above; one of the hazards of "necroposting" on a year-old thread. WP cites sources to help readers identify and find them and to help editors verify our content. We do not do so as a bibliographic database service; the purpose is not to track the history of a work. So, WP has no need of being able to identify a previous publisher's details. If you have a genuinely republished version with new typesetting and pagination, or even just a new foreword/introduction, this is the work you are citing, by that particular publisher. We don't care who published the first edition that had different font, page numbers, or lack of a "50th anniversary" foreword or whatever. It's just not relevant.
[Conceptual aside: It's really no different from a quote being in a New York Times article, perhaps with an "[editorial tweak]" in it, and a reporter's introduction ("According to X. Y. Zounds in The Zounds Method,"). We cite the newspaper article we found the quote in, not the original primary source of the statement (unless we also have that, and have checked it, and it's appropriate to "double-up" the citation for some reason, e.g. because another source misquoted it and caused a controversy). A new edition, a real republication, of a work is a similar matter; the original material being included is essentially a giant quotation, may have been editorially altered in the course of republication, and may have new lead-in material, a big "Foreword" or "Introduction to the Nth Edition" version of a journalist prefacing a quoted statement from a speech or document.]
By contrast, |via is important, for actual WP purposes and in addition to |publisher, to use for cases of pseudo-republishing, i.e. redistribution or format-shifting, such as if you got something via a scanning site or a content aggregator:
  1. That intermediary is incidental and has no effect we care about on the content itself (e.g., we DGaF if it has an aggregator's watermark on it; that isn't substantive and does not constitute an "edition" or a new "publishing" for WP purposes).
  2. The URL or the entire aggregator itself might not be there tomorrow. I have no insider info on the budgets of Project Gutenberg, Internet Archive, Google Books, or the journal aggregators, but these things cost money to operate. We do know that at least the first two of these have had funding struggles in the past, and still publicly seek donations to keep them going. The latter two are things a profit-minded business entity could axe at any moment, or start paywalling, as a simple business decision. The only consequence of such a failure is a dead URL. The actual citation is to the original work and remains valid; the work still exists and can be found. The dead link info is removed from the citation; we do not remove from citations the names of actual publishers who have ceased operation.
  3. It may not be the most convenient or effective way for a particular reader to get the work.Examples: if someone has taken a print-out of the WP article to a public library and all its Internet access kiosks, if there are any, are in use, but the library may have the original work on its shelves; or in a place where Internet access is costly and schlepping down a huge PDF is not practical, but looking at a paper copy you got via inter-library loan is free; or when a journal aggregator is not free for full text, with that only accessible for pay or at institutions with a subscription; and ... insert numerous other scenarios.
[Second conceptual aside: If I have a blog that I publish, and someone cites it, and the site goes down permanently, and it wasn't archived by Wayback machine or something equivalent, then that site is gone; i.e., it cannot be used by readers/editor for verification, ergo it is no longer a valid source citation. A conduit for a copy of a publication (e.g. Wayback.Archive.org), and the publication itself (McCandlishWorldNews.com or whatever): a big and clear difference. People seem to have unreasonable difficulty with the distinction, just "because Internet", i.e. because "a website is a website" in many minds; they're confusing the medium for the message, the delivery format for the content.]  — SMcCandlish ¢ ≽ʌⱷ҅ʌ≼  22:50, 13 July 2017 (UTC)Reply

Examples of different itemType URLs

[edit]

Hello, I'm testing Citoid in a Wikipedia and I'm trying to gather several URL samples with the more different itemTypes the better (thesis, report, email, etc.) for testing template integration (ref.:. MediaWiki:Citoid-template-type-map.json). Is there a reference list anywhere? Thanks! Toniher (talk) 12:06, 11 August 2016 (UTC)Reply

I had asked the same thing once and got no luck with that. @Mvolz (WMF) or others, are you aware of such a resource? Elitre (WMF) (talk) 12:13, 11 August 2016 (UTC)Reply
I just found where I could get that information: https://github.com/zotero/translators grepping itemType we can get a list of cases. I would create a manual example list in Citoid/itemTypes. Toniher (talk) 15:16, 11 August 2016 (UTC)Reply
This is simply great. Thank you so much. Elitre (WMF) (talk) 09:16, 12 August 2016 (UTC)Reply
That's great! Be aware that some of these examples may not work in citoid because some of them required you to be logged in (as Zotero was/is originally a browser extension)
Edit: See that you've already done a great job :D Mvolz (WMF) (talk) 13:47, 12 August 2016 (UTC)Reply

Different cite template behaviour of websites

[edit]

Different websites gets different citation templates from Citoid. E.g. nytimes.com gets Template:Cite news and spiegel.de gets Template:Cite web

In enwp all the templates from the cite* family are somewhat similar, but in German Wikipedia there is a huge difference between the template for online sources and for offline sources. Unfortunately "Cite news" gets translated to the template for offline sources. (More details at my German userpage)

May you please tell me, where Citoid determines the "target template"? Tibbe Tibbe (talk) 13:13, 19 September 2016 (UTC)Reply

You probably want to see https://de.wikipedia.org/wiki/Wikipedia:Technik/Text/Edit/VisualEditor/Rückmeldungen/Archiv/1#Welt_Online . Elitre (WMF) (talk) 15:17, 19 September 2016 (UTC)Reply
We use the Zotero itemType (full list of types here: http://aurimasv.github.io/z2csl/typeMap.xml) and then map these to templates.
If there is a translator in Zotero, it can often correctly detect that a website is a news site. However, if there is no support for it in Zotero, we don't know that it's a news site from just the metadata, and as a fallback it is a website.
You can see what the "type" is for any given link in citoid by going to https://citoid.wikimedia.org/ and putting the link in. The Zotero type is in the field "itemType".
Zotero itemTypes are further mapped to Templates in this message: https://de.wikipedia.org/wiki/MediaWiki:Citoid-template-type-map.json
If you want to know if there is a translator for a particular newspaper in Zotero, it may be helpful to look for tests here: http://zotero-translator-tests.s3-website-us-east-1.amazonaws.com/testTranslators.html#date=2016-09-17&browser=v&version=4.0.SOURCE (we only get results from the enabled ones)
(http://zotero-translator-tests.s3-website-us-east-1.amazonaws.com/ has the most updated tests)
Related bugs on phabricator:
https://phabricator.wikimedia.org/T94170 (Poor support for non english newspapers) Mvolz (WMF) (talk) 16:54, 19 September 2016 (UTC)Reply

Handling language codes

[edit]

For those interested in this topic, please weigh in at https://phabricator.wikimedia.org/T115326 . Thanks! Elitre (WMF) (talk) 13:30, 13 October 2016 (UTC)Reply

Date formatting

[edit]

Is there a way to change how Citoid locally handles date formatting when you automatically add a citation from an URL? By now it adds ISO 8601 dates (YY-MM-DD), but the Portuguese Manual of Style recommends that dates should follow "dia de mês de ano" (16 de dezembro de 2016). ArgonSim (talk) 07:37, 16 December 2016 (UTC)Reply

You could change the template logic as suggested by Citoid/Enabling_Citoid_on_your_wiki#Access_date_is_formatted_differently_on_my_wiki. Right now I can't find a task or example with actual instructions about that, sorry. Elitre (WMF) (talk) 11:57, 16 December 2016 (UTC)Reply
w::en:Module:Citation/CS1 at the English Wikipedia supports this. If you've got current versions of these templates, then use |df= to set your preferred date format. You can adjust the templates to do this automatically/in all cases except when locally over-ridden. Whatamidoing (WMF) (talk) 17:54, 16 December 2016 (UTC)Reply
The current pt-version does seem to support |df=, but I'm afraid if I try to correct it myself, I'll end up doing it wrong and breaking everything. ArgonSim (talk) 17:58, 23 December 2016 (UTC)Reply
I wonder if User:He7d3r or User:Dbastro would like to look into this. Since the code exists, it'd probably be good for the Portuguese Wikipedia to make the most desirable format be the default anyway. Whatamidoing (WMF) (talk) 18:28, 23 December 2016 (UTC)Reply