Talk:Page Previews/API Specification

About this board

Why is PP stripping parenthetical statements?

2
Nux (talkcontribs)

The problem is this removes the dates (born, died). Date is often important to know when a person was born. If he/she lived in the modern era or how old is a sportswomen in preview. Seems like Popups is doing a better job here (showing users crucial information).

If this is important for some articles, then maybe you could get date (and maybe calculate age) from WD data (for items in which P569 is available)?

Phuedx (WMF) (talkcontribs)

The decision to strip all content inside of balanced parentheses was formalised in https://phabricator.wikimedia.org/T91344#3327008. Rereading that task there seems to be some that support complete removal with no exceptions and others that support no removal at all and the current situation is somewhere in the middle.


@OVasileva (WMF): Is there a decision record detailing why we took that position?

Reply to "Why is PP stripping parenthetical statements?"
Jdlrobson (talkcontribs)

While the wikidata summaries are not implemented I ask that we remove these from the specification, or at least add to a draft section. I'd like to talk about them some more. I'm not convinced they will be needed, right now.

Phuedx (WMF) (talkcontribs)
Jdlrobson (talkcontribs)

Draft section sounds good to me!

Summary by Phuedx (WMF)

Moved to Page Previews/API Specification but might redirect to Specs/Summary/1.3.0 as that'll be pointed to by other documentation.

CKoerner (WMF) (talkcontribs)

Is this documentation ready to be moved to the main namespace? I had some feedback from the last monthly Readers update that we didn't link to anything when referencing our work. This item being one of the two without links. That's fair and I'd love to be able to link to this in a more 'official' manner than a user subpage. :)

MHolloway (WMF) (talkcontribs)

Moving it to mainspace seems appropriate to me.

Phuedx (WMF) (talkcontribs)
MHolloway (WMF) (talkcontribs)
CKoerner (WMF) (talkcontribs)

I'm being bold.

Redirects are cheap if ya'll think of something better. :)

BSitzmann (WMF) (talkcontribs)

I've linked to it from Specs/Summary/1.3.0 but maybe we should merge these somehow? This URL is used in the Content-Type of the responses and therefore would be the more advertised URL.

MHolloway (WMF) (talkcontribs)

Good point, @BSitzmann (WMF). Maybe it makes more sense to have the spec live there (and the page just created can redirect to it).

Phuedx (WMF) (talkcontribs)

@MHolloway (WMF) @BSitzmann (WMF): Feels like the spec is the most likely to see traffic as it'll be automatically linked to by the auto-generated RESTBase API documentation, right?

I think that this topic can be closed as the documentation has been moved into mainspace.

Disambiguation links are not always links

5
Jdlrobson (talkcontribs)

Consider the links in the article https://en.m.wikipedia.org/wiki/Colombo_Airport

Shipping links for disambiguation pages gives the impression that the links are all possible pages, but in that page Colombo and Sri Lanka would be unrelated.

OVasileva (WMF) (talkcontribs)

I'm not sure I understand what the issues is here. We should not distinguish links within previews. Would the preview display links as links?

Phuedx (WMF) (talkcontribs)

@CFloyd (WMF): The disambiguation links was something that we talked about at the beginning of writing this spec. I think @Jdlrobson's point is that while the links on that disambiguation page are valid links, they lose meaning when taken out of context.

Jdlrobson (talkcontribs)

@kaldari I know a while back there were attempts to fix disambiguation pages... maybe this could be an opportunity to fix it.

E.g. This links to Page title but this term is ambiguous and may refer to one of many things. You can fix this by editing the page.

Jdlrobson (talkcontribs)
Reply to "Disambiguation links are not always links"

Is Sentence Boundary Detection (SBD) required?

5
Phuedx (WMF) (talkcontribs)

Given the current definition of an intro, it's not clear that we'll need to return a number of sentences as before. AFAIK the apps request 5 sentences and Page Previews requests 525 characters.

Should the intro be limited to 5 sentences if the first paragraph of the lead section is longer?

Phuedx (WMF) (talkcontribs)
Jdlrobson (talkcontribs)

I was working on the basis that we will only consider the first paragraph which means sentence detection is not necessary.

I'm fairly confident first paragraph is enough for a summary and I would hate and push back strongly against introducing this kind of technical complexity.

Phuedx (WMF) (talkcontribs)

@Jdlrobson:

I would have and push back strongly against introducing this kind of technical complexity.

I hope that your push back would be: weighing up the pros and cons of the approach, e.g. an obvious is minimising the amount of data that we're sending and that clients are receiving, which is a genuine concern; and an investigation being done on how complex existing solutions actually are.

Jdlrobson (talkcontribs)

The examples on http://jdlrobson.com/summaries show that relying on the first paragraph only, appears to for the most part generate shorter summaries for all the examples compared with the existing approach.

At worse they may be double the length.

Reply to "Is Sentence Boundary Detection (SBD) required?"
Summary by Phuedx (WMF)

Rightly or wrongly, we're looking to make the new service's response consistent with the existing summary API.

Jdlrobson (talkcontribs)

Shouldn't responses be consistent with the existing summary API?

This way no client side changes are needed and the new endpoint will be interchangeable with the rest base summary API...

Phuedx (WMF) (talkcontribs)

Not necessarily, no. This spec revolved around the idea that RESTBase's ability to version APIs was easy and cheap. We now know that the former is true but the latter isn't. It'd be foolish to believe that the APIs are actually the same (both semantically and in terms of their content) but it's what we have to do to minimise the const of getting the new service exposed in production…

Add markup for chemical formulas

6
Summary by Jdlrobson

TextExtracts removes `span` from the output whereas the new endpoint retains them. This means Template:chem works fine with the new endpoint extracts.

OVasileva (WMF) (talkcontribs)
Phuedx (WMF) (talkcontribs)

@OVasileva (WMF): Just to note that those examples are from enwiki. HTML previews are only enabled on the Beta Cluster wikifarm. This may explain the behavior in your last two examples.

Now, the chem template produces spans with inline styles, not super- or subscript tags (sup and sub respectively). For the former case, TextExtracts will strip the inline styles from the span, thus losing its formatting; for the latter case, TextExtracts will preserve the tags.

The reason that the chem template produces spans with inline styles is so that it can correctly format things like charge. Consider the example in the template's documentation: if the template were to produce sup and sub tags, then the 4 and 2- in the example wouldn't be above and below one another.

I see two ways of solving this.

  1. The chem template does wrap its output in a span class="chemf" element. We could disable the inline style stripping behavior for this wrapper element and its children.
  2. Create a page-preview-preserve or pp-preserve class, which, like the above, would disable all stripping behavior in the API for this element. We'd then ask the chem template author(s) to use the class in their template.

I'm leaning towards something like #2 as it'd also allow editors to mark, say, parentheticals that they want preserved in the preview… OTOH #1 would be quicker to implement but puts the burden of finding and adding exceptions to the whitelist firmly on the maintainers of Popups.

Also, I've updated the spec to add an exception for sup and sub tags.

Phuedx (WMF) (talkcontribs)

@OVasileva (WMF): As we discussed today, there is a third alternative:

We could conditionally disable the inline style stripping and span flattening behaviour on one or more wikis – these are the two processing steps that break the output of the chemf template for HTML previews – and test whether it causes more harm than good.


P.S. As a Sandman fan, you may be thinking of the same scene as I am when I say "there is a third alternative".

Jdlrobson (talkcontribs)

If we're not using TextExtracts this example works fine. The new endpoint generates a great summary for this example page.

Phuedx (WMF) (talkcontribs)

@Jdlrobson: For reference, please could you link a paste or dump the summary?

Jdlrobson (talkcontribs)

Summaries look like so

http://0.0.0.0:6927/en.wikipedia.org/v1/page/preview-html/Hydrogen_peroxide

<p style="font-size:100%; line-height:1;"><b>Hydrogen peroxide</b> is a <a href="/wiki/Chemical_compound" title="Chemical compound">chemical compound</a> with the formula <span class="chemf nowrap">H<span style="display:inline-block;margin-bottom:-0.3em;vertical-align:-0.4em;line-height:1em;font-size:80%;text-align:left"><br>
2</span>O<span style="display:inline-block;margin-bottom:-0.3em;vertical-align:-0.4em;line-height:1em;font-size:80%;text-align:left"><br>
2</span></span>. In its pure form, it is a colourless <a href="/wiki/Liquid" title="Liquid">liquid</a>, slightly more <a href="/wiki/Viscosity" title="Viscosity">viscous</a> than <a href="/wiki/Properties_of_water" title="Properties of water">water</a>. Hydrogen peroxide is the simplest <a href="/wiki/Peroxide" title="Peroxide">peroxide</a> (a compound with an oxygen–oxygen <a href="/wiki/Single_bond" title="Single bond">single bond</a>). It is used as an <a href="/wiki/Oxidizer" class="mw-redirect" title="Oxidizer">oxidizer</a>, <a href="/wiki/Bleach" title="Bleach">bleaching</a> agent and <a href="/wiki/Disinfectant" title="Disinfectant">disinfectant</a>. Concentrated hydrogen peroxide, or "<a href="/wiki/High-test_peroxide" title="High-test peroxide">high-test peroxide</a>", is a <a href="/wiki/Reactive_oxygen_species" title="Reactive oxygen species">reactive oxygen species</a> and has been used as a <a href="/wiki/Propellant" title="Propellant">propellant</a> in <a href="/wiki/Rocket" title="Rocket">rocketry</a>.<sup id="cite_ref-4" class="reference"><a href="#cite_note-4">[4]</a></sup> Its chemistry is dominated by the nature of its unstable <a href="/wiki/Peroxide" title="Peroxide">peroxide</a> bond.</p>

http://0.0.0.0:6927/en.wikipedia.org/v1/page/preview-html/Dioxygenyl

<p>The <b>dioxygenyl</b> <span>ion</span>, <span class="chemf nowrap">O<span style="display:inline-block;margin-bottom:-0.3em;vertical-align:-0.4em;line-height:1em;font-size:80%;text-align:left">+<br>2</span></span>, is a rarely-encountered <span>oxycation</span> in which both <span>oxygen atoms</span> have a formal <span>oxidation state</span> of +<span class="nowrap"><span> </span></span><span class="frac nowrap"><sup>1</sup><span>⁄</span><sub>2</sub></span>. It is formally derived from <span>oxygen</span> by the removal of an <span>electron</span>:</p>

Wiping out brackets is problematic

2
Summary by Jdlrobson

We decided to only wipe brackets out which contain at least one space.

Jdlrobson (talkcontribs)

Consider the following example:

A googolplex is the number 10googol, or equivalently, 10(10100).

Wiping out brackets gives us the very confusing summary:

A googolplex is the number 10googol, or equivalently, 10.

Jdlrobson (talkcontribs)

We can look for spaces in parenthetical and only remove them if there is one to avoid this issue.

Why is response 'type' necessary?

2
Summary by Phuedx (WMF)

Following the principle of moving as much logic to the server as possible so that the client(s) can be as dumb as possible, we move the "type" of a preview from the Page Previews client to the server (whether we should show the disambiguation preview, for example).

Jdlrobson (talkcontribs)

Why do we care?

Phuedx (WMF) (talkcontribs)

Following the principle of moving as much logic to the server as possible so that the client(s) can be as dumb as possible, we move the "type" of a preview from the Page Previews client to the server (whether we should show the disambiguation preview, for example).

Currently, the client tests for properties of the response to figure out what it should do. Those checks may be safely removed if it accepts the server's notion of the type of the preview.

Descope lang and dir fields

1
Jdlrobson (talkcontribs)

These seem to be related to the page - not the summary. Doing them inside the summary endpoint doesn't make much sense unless we expect the summary language to differ from the actual page language.

I'd suspect doing this at a higher level e.g. on all endpoints would make more sense.

What is the goal with them?

Reply to "Descope lang and dir fields"