Talk:Citoid/2014
Add topic| This page used the Structured Discussions extension to give structured discussions. It has since been converted to wikitext, so the content and history here are only an approximation of what was actually displayed at the time these comments were made. |
Previous archives are at /Archive 1
Handling multidimensional field mappings
[edit]- cite-from-id fields need to be able to be mapped to citation fields using TemplateData. However, not all fields have a one-to-one relationship.
- The best example of this issue is with authors. Zotero, for instance, returns the param "creators" which is an array of objects ([{"firstName":"Jessica","lastName":"Trinoskey","creatorType":"author"},{"firstName":"Frances A.","lastName":"S.","creatorType":"author"}])
- Citation templates can handle author data in a variety of ways. For instance, CS1 can accept multiple authors with the format first1, last1, first2, last2, etc. It can accept a string of authors as a single param; last1 is a synonym for last, author, and authors. However, returning a flat "author" string is not ideal. Mvolz (talk) 07:25, 10 June 2014 (UTC)
- Mvolz: One problem that is popping up is that Zotero uses the same field names for types that have different field names in say, Template:Citation. So the fields "journal", "newspaper", and "magazine" in Template:citation are each "publicationTitle" in Zotero (which makes sense) . The way I've been doing things so far is listing a set of Template:Citation fields and seeing which ones are available in the Zotero citation, but now it seems doing the reverse might be the best option. That carries its own problems too... because Zotero has a very large number of fields, and plus, that approach doesn't lend itself to the template data solution very well. This is less problematic for "smaller" templates that don't contain potentially overlapping fields like the master Citation template does. Mvolz (talk) 22:30, 30 June 2014 (UTC)
- Would it be reasonable to accept flattened strings as a first approximately-good attempt, storing the structured data somehow so we can do it better late? Jdforrester (WMF) (talk) 16:38, 18 June 2014 (UTC)
- Jdforrester (WMF): Sure! The current iteration doesn't have the structured data in it and only returns flattened data, but there's no reason we can't keep the structured data in (involves commenting out one line of code at present!)
- The more complicated/worrisome case is having to split or join fields but I haven't come across that use case yet. Mvolz (talk) 18:34, 18 June 2014 (UTC)
- Mvolz: Yeah; I think coming up with a sufficiently expressive mapping system that's not too confusing to map things will be hard, though:
- <creatorType:author, position:1st, firstName> → "first"
- <creatorType:author, position:1st, lastName> → "last"
- <creatorType:author, position:Nth-not-1st, firstName> → "first<N-1>"
- <creatorType:author, position:Nth-not-1st, lastName> → "last<N-1>"
- …
- Fun. Jdforrester (WMF) (talk) 21:03, 18 June 2014 (UTC)
- Mvolz: Yeah; I think coming up with a sufficiently expressive mapping system that's not too confusing to map things will be hard, though:
Standard format for cite-from-id fields
[edit]- There are a number of options for standardizing the kind of fields that cite-from-id could return:
- Zotero fields types - Used by Zotero
- CSL (Citation Style Language) - Translations available in Zotero
- MARC XML - Available as export from OCLC
- Dublin Core - Available as export from OCLC
- Those used by CrossRef
- CoINS?
- Use CS1 fields as a basis on which to have our own Mvolz (talk) 07:34, 10 June 2014 (UTC)
- Mvolz: Having just spent 30 minutes reading about all of these... I have no clue. If I was going to choose any from what I've seen, it would probably be CSL as it seems to be well defined.
- Gwicke did find a description of all fields in Zotero, but the best documentation is to "read the sql" [1]. That seems non desirable to me.
- Dublin Core is supposed to also solve this problem; but to me it looked very focused on libraries and print media (which makes sense because it's a OCLC initiative.)
- MARC doesn't seem to have widespread adoption anywhere, and I was unable to tell what differentiated it (in semantic terms) from dublin core.
- [1] http://zomark.github.io/zotero-marc/schema/trunk/ Mwalker (WMF) (talk) 08:17, 17 June 2014 (UTC)
- So there's a generated list of the fields and their CSL equivalents here:
- http://aurimasv.github.io/z2csl/typeMap.xml
- I think I generally agree with you that it's the best standard; except that we may lose some information by translating Zotero to CSL because not all Zotero fields have CSL equivalents. So I guess the question is, is it better to use a well-defined standard, or to lose a few fields here and there? I guess if the fields aren't discoverable it doesn't matter if they get discarded because they won't be used anyway. Mvolz (talk) 09:29, 17 June 2014 (UTC)
- Mvolz: Do you still need help with this? I could ask around to see if anyone at the English Wikipedia knows anything about this. Whatamidoing (WMF) (talk) 17:07, 14 July 2014 (UTC)
- I think for the time being we're sticking with the zotero format for expediency's sake. We can pass those back through Zotero to get the CSL style instead anyway if it becomes desirable at some point. Mvolz (talk) 22:42, 14 July 2014 (UTC)
Title not being picked up
[edit]Using citoid to retrieve a cite for this URL, the resulting title param is just a repeat of the URL. Looking at the page source, there is a <title> element in the <head> (with a value of "HISTORY"). Shouldn't it retrieve this as the title instead? ~ —[AlanM1(talk)]— 10:18, 10 October 2014 (UTC)
- Hi,
- I've looked into this and it looks like that particular website automatically bans scrapers. (After trying it a few times I get a "403: Access Forbidden
- Your IP has been blacklisted.
- Repeat offender (Autobanned)") error on the website. I think that's likely the cause! Thanks for reporting this! Mvolz (talk) 14:08, 10 October 2014 (UTC)
- I see. My client script should probably better indicate the status. Thanks. —[AlanM1(talk)]— 21:02, 10 October 2014 (UTC)
- Unfortunately there's currently no way to indicate this- the default behaviour when it's not possible to get the actual title is to just set the URL as the title and citoid will return a 200 OK with that info. The 403 is what I'm getting from the page itself.
- Changing how this is done is something to consider (i.e. whether it might be better to return with no title). Mvolz (talk) 21:36, 10 October 2014 (UTC)
- I see. My client script should probably better indicate the status. Thanks. —[AlanM1(talk)]— 21:02, 10 October 2014 (UTC)
Pipes in values need to be encoded
[edit]On many sites, the title of all pages ends with a pipe character, followed by the site name. This currently goes into the value of the title parameter unchanged, which results in a complaint by the ref parser (at least) about ignoring the text after the pipe. The solution is that any values containing a pipe need to translate it to the HTML entity | instead.
When I do these manually, though, I trim off this pipe and site name, since they really are not a useful part of the title. This would be even better, though a bit of logic may be necessary to trim the left side of the sitename and the URL down to the domain name alone.
Example site: B2B Marketplace Kinnek Secures $10M In Series A Funding | PYMNTS.com. —[AlanM1(talk)]— 09:45, 11 October 2014 (UTC)
- The responses are JSON so I'm not sure we should do any additional encoding server side.
- Roan recently added the ability to get requests in HTML, so it's possible that if people prefer we can offer the option of urlencoded values as well.
- Related bug: bugzilla:69482.
- There it suggests using the magic word {{!}} instead of <nowiki>
- Dropping content after a pipe automatically to get a best-guess of the title though is certainly an option too, but not a solution to the general case. Mvolz (talk) 10:38, 11 October 2014 (UTC)
- Generally any wikitext escaping should be handled in Parsoid. Currently it uses <nowiki> escaping, but
{{!}}could indeed be a nicer alternative in templates. Gabriel Wicke (GWicke) (talk) 15:12, 17 October 2014 (UTC) - I modified the client script, using the HTML entities:
var wt = value.wt.replace(/\|/g, "|").replace(/}}/g, "}}");—[AlanM1(talk)]— 12:16, 21 October 2014 (UTC)
- Generally any wikitext escaping should be handled in Parsoid. Currently it uses <nowiki> escaping, but
Author name sometimes includes date
[edit]Some of the times that an author is actually returned, the name includes extraneous information – often part of the date. It would be nice to try to filter some of these a bit more.
Example: this site results in first1="Michael Carney On September" and last1="15". It seems that, before chopping into first and last, if any of the "words" are numbers, you can remove them and also look for a month or abbreviation and remove it if the result is at least one word. If you don't already have the date, this could also be a source for it. —[AlanM1(talk)]— 10:18, 11 October 2014 (UTC)
- If there weren't about million English-speaking women whose names were months (about a quarter million American women are named "April"), then this might be feasible. There are more than a hundred thousand people whose last names are months, too. Whatamidoing (WMF) (talk) 19:11, 30 October 2014 (UTC)
Like Quora?
[edit]This sounds like a possible (at least partial) solution to enhancement request 65540, where I suggest a Quora-like DWIM approach to pasting URLs - guessing what the target is and auto-converting it to an appropriate link, which may or may not be a citation template. Amir E. Aharoni {{🌎🌍🌏}} 10:27, 29 October 2014 (UTC)
- This could potentially inform the VE UI for citoid;
- I.e. we could have have one reference dialog, (i.e. the basic reference dialog) and then if a user pastes a link *into* the reference dialog, that inserts the citation guessed from the link. And then keep the drop down menu for individual templates in the ref dialog.
- This does take some user discovery; however, I see this commonly being done with references already- i.e. a ref tag containing only an unformatted url. So it seems people are already just pasting links into the ref field and this is a natural enhancement of that work flow.
- This, by the way, is what both G+ and Facebook do for links as well, not just Quora. So there may be already some user comfort with a pasted link becoming formatted in some fashion. (x-posted this to bug 65540 as well) Mvolz (talk) 11:02, 29 October 2014 (UTC)
- My impression is that Quora is smarter than Facebook, but yes, it's indeed similar. Amir E. Aharoni {{🌎🌍🌏}} 12:58, 29 October 2014 (UTC)
Issue tracker
[edit][moved from article]
Where is the issue tracker for this project? There should be a component in bugzilla, "MediaWiki extensions" product, given the location of its repo. See Bug management/Project maintainers for how to request a component. Jdforrester (WMF) (talk) 17:00, 29 October 2014 (UTC)
- Right now it's being managed as part of the VisualEditor product in Bugzilla; in time we'll spin it out. Note that it's a service, not an extension. Jdforrester (WMF) (talk) 17:01, 29 October 2014 (UTC)
- Yes, there should!
- The issue is that there's mediawiki/extension/Citoid, and also mediawiki/services/citoid, so I'm not sure where they should live? Theoretically reporters won't really know which one anyway, unless the service is being used as a standalone... Mvolz (talk) 17:28, 29 October 2014 (UTC)
- It's probably worth waiting for Phabricator at this point… Jdforrester (WMF) (talk) 22:26, 29 October 2014 (UTC)
- (Wait, where was this moved from?) Mvolz (talk) 17:30, 29 October 2014 (UTC)
- It was added to the cite-from-URL page here. Jdforrester (WMF) (talk) 22:25, 29 October 2014 (UTC)
- @Nemo bis: , @AlanM1: , @Jdforrester (WMF): , @Whatamidoing (WMF): , there's now an issue-tracker on phabricator: https://phabricator.wikimedia.org/project/view/62/ Mvolz (WMF) (talk) 09:42, 4 November 2014 (UTC)