Topic on Talk:Citoid

website specifications for citations

11
Bluerasberry (talkcontribs)

Where are the specifications for what websites need to do to enable citoid to generate citations from URLs?

I presume these are wc3 specifications and have to do with html tags, but what documentation do we use to explain what website operators need to do to make themselves aligned with citoid? Is there anything in the Wikimedia platform?

I am imagining tags like author, title, date, etc. Where is the full list of what citoid takes and instructions.

PerfektesChaos (talkcontribs)

I suggest you read en:Zotero as introduction and continue by crawling zotero.org, somewhere there is the official documentation, since you ask for a full list.

There are known websites where information can be retrieved individually, and if unknown then general methods like en:Dublin Core will be tried.

Bluerasberry (talkcontribs)

I might be lost. I was expecting Wikimedia documentation, and I think you are suggesting that compliance with Zotero is equivalent to compliance with Wikipedia. Is this the case, and will this be the case for the foreseeable future?

I looked in Zotero's FAQ. They seem to be speaking to users, not to webmasters. I am looking for advice for webmasters.

I was looking for the specifications which Wikipedia citation lookup services use to generate citations. Like for example, if I input a New York Times URL, then somehow the tool knows that NYT has specified title, author, date, and other fields. Many other less-developed websites do not give this information the wiki lookup tool. I want to know what NYT is doing, or rather, where the authoritative web recommendations are which specify what websites should do to be machine readable. I want there to be a Wikipedia recommendation page for webmasters telling them what to do.

If the answer is "there seems to be no discussion of this for MediaWiki / Wikimedia", then that is useful information. Is Dublin Core the recommendation which Wikipedia editors should give to the world's web developers in Wikimedia documentation for how they should maximize their compatibility with this platform and receive their best citations?

thanks

Mvolz (WMF) (talkcontribs)

This sort of thing used to be handled by citoid, now it's handled by Zotero translation-server directly. If not handled by a specific website translator (as with the new york times - full list is at https://github.com/zotero/translators) the website is handled by the embedded metadata translator (https://github.com/zotero/translators/blob/master/Embedded%20Metadata.js) which has support for highwire metadata, as well as the following ontologies:

bib: "http://purl.org/net/biblio#", bibo: "http://purl.org/ontology/bibo/", dc: "http://purl.org/dc/elements/1.1/", dcterms: "http://purl.org/dc/terms/", prism: "http://prismstandard.org/namespaces/1.2/basic/", foaf: "http://xmlns.com/foaf/0.1/", vcard: "http://nwalsh.com/rdf/vCard#", link: "http://purl.org/rss/1.0/modules/link/", z: "http://www.zotero.org/namespaces/export#", eprint: "http://purl.org/eprint/terms/", eprints: "http://purl.org/eprint/terms/", og: "http://ogp.me/ns#", // Used for Facebook's OpenGraph Protocol article: "http://ogp.me/ns/article#", book: "http://ogp.me/ns/book#", music: "http://ogp.me/ns/music#", video: "http://ogp.me/ns/video#", so: "http://schema.org/", codemeta: "https://codemeta.github.io/terms/", rdf: "http://www.w3.org/1999/02/22-rdf-syntax-ns#"


And then as a fallback some lower quality metadata.

Of the available ones, I think from a webmaster's perspective eprints would give the highest quality results because this standard is really designed for citations and most closely matches zotero's internal standard and would be good for journal articles, newspaper articles, and websites; by contrast dublin core is more common but doesn't always map that nicely. For music and video facebook's open graph metadata standard might be better but I'm not really sure.

In terms of what format the metadata should be included in the page in, including the metadata tags in the html itself is safest. Zotero has support for rdf but citoid doesn't; citoid has support for json-ld but Zotero doesn't.

Diegodlh (talkcontribs)

Hi, @Mvolz (WMF)!

You say that "citoid has support for json-ld but Zotero doesn't". I'd appreciate it if you could elaborate on this, please.

I see that citoid's Scraper uses (here) html-metadata lib's parseAll function which does support JSON-LD; it returns a promise that resolves to a metadata object with a jsonLd property.

However, this metadata object is passed to the matchIDs function (here), which does not seem to use this jsonLd property.

The metadata object is then passed to the addMetadata function (here), and inside it to the addItemType function (here), none of which seem to use its jsonLd property either.

Finally, the data in the jsonLd property doesn't show in the final citoid response (see T270816).

Am I missing something? Thanks!

Mvolz (WMF) (talkcontribs)

For a while we used citoid's native translator for a lot of websites, but at some point we switched to using zotero for everything unless zotero fails/goes down. So since the zotero translator doesn't support json-ld it won't show up unless zotero fails (which happens rarely). The issue is tracked in zotero here: https://github.com/zotero/translators/issues/917

Diegodlh (talkcontribs)

Thanks, Marielle. I switched Zotero translation off in my local citoid instance (setting `zotero` to `false` in the `config.yaml` file), but although the output citation changed (as expected) the JSON-LD is still not present.

I revised the `addMetadata` function in the `Scraper` module and I still don't see where the `html-metadata`'s `jsonLd` is being used. I see there are custom (citoid) translators for highwire, bepress, opengraph, etc, but I don't see a translator for JSON-LD.

Am I missing something?

Mvolz (WMF) (talkcontribs)

Which website are you scraping? We only parse json-ld if it's in the html, if it's in a linked file it won't get scraped.

Diegodlh (talkcontribs)

Hi, @Mvolz (WMF). Sorry for the delay.

See for example https://www.perlego.com/book/1431388/qualitative-research-practice-a-guide-for-social-science-students-and-researchers-pdf?queryID=8d25693afbbc254b9927e5d0f7dac19f&searchIndexType=books.

The correct item type (book) and author names are available in one of the JSON-LD objects, and are in fact available in the metadata object returned by html-metadata's parseAll (see my original comment above).

However, Citoid (with Zotero turned off) returns a wrong item type (webpage) and no author names.

PerfektesChaos (talkcontribs)

As a provider of HTML documents I would offer multiple general metadata simultaneously.

and more.

  • They do not cause conflicts since they have separate naming schemes.
  • Leave it to the audience and let them pick up what they understand.
  • Zotero etc. have some heuristics and will make their choice.
Bluerasberry (talkcontribs)

Wow thanks this is what I wanted. This is a bit heavy for me so I will read and think for a while. Thanks a lot for the answers, opinions, and the links.

Reply to "website specifications for citations"