Talk:Reading/Web/PDF Functionality

Jump to navigation Jump to search

About this board

About giving feedback

Please read Reading/Web/PDF Functionality and comment on the plans we lay out there, to tell us what you need from the PDF service. We're especially interested in what you need in the future that doesn't exist in the plans laid out there – if there's a bug with something that should work right now (e.g. you get an error message when you try to create a PDF), we need to fix it, of course, but that would have been on the agenda.

Update: (23 April 2018) PediaPress will take over the development of the books-to-PDF functionality. See Reading/Web/PDF Functionality for more information.

Updates: (24 February 2018)

- Kerning and spacing issues (https://phabricator.wikimedia.org/T178665): there has been a few reports on spacing issues within PDF rendering. The readers web team is currently looking into a solution. We will first be updating the fonts for PDFs (https://phabricator.wikimedia.org/T181200) over the week of November 27. This will resolve some but not all of the spacing issues. We'll be looking further into the remaining issues after the initial fix.

173.56.103.114 (talkcontribs)

page "King's Indian Defense" cannot download as a PDF

This post was hidden by Johan (WMF) (history)
85.180.249.213 (talkcontribs)

Opera 12.18 (works mostly good, but out of date) dosn't download PDF-Files Vivaldi 1.14 (my opinion, may be the follow up of Opera) download of PDF is ok.

Reply to "Cannopt download PDF"
138.246.2.199 (talkcontribs)

No PDF Availiable

This post was hidden by 177.249.178.158 (history)
This post was hidden by Clump (history)
This post was hidden by Clump (history)
67.82.116.198 (talkcontribs)

I had no trouble with the PDF download. It was complete and of good quality.

93.30.137.24 (talkcontribs)

Bonjour 10 3 2018 10h10

VIDIANI Fontaine les Dijon

j'ai aussi essayé car la page wiki telle quelle refuse de s'IMPRIMER

Steelpillow (talkcontribs)

When posting to this topic, please specify whether you mean no pdf for a single article or no pdf for a whole book. You should be able to download an individual article. You can not download a whole book at the moment because the software is disabled. This is expected. Please only post here if your experience differs. Steelpillow (talk) 11:59, 11 March 2018 (UTC)

Reply to "No PDF Availiable"

WHEN the hell will it be possible to use the book generator again ??

5
Marcus-wilke (talkcontribs)

You are highly impertinent, you beg for donations at all times but are not even able to repair such a tiny function ??

TheDJ (talkcontribs)

The tiny function was a completely separate parser running on old servers that hadn't seen maintenance in 5 years.. It's gonna take a bit more time to replace it completely and your patience is highly appreciated.

Marcus-wilke (talkcontribs)

well, still NOT running yet, huh ?? So, how much patience do you expect from your users? 1 year ? 2 ? or even more ?

It was the main reason for any donation, and now, afzter I finally paid some bucks, it doesn't work any more, already for well over a year !!!

That's almost kinda like fraud ... at least I, peronally and subjectively, perceive it that way.

Dirk Hünniger (talkcontribs)

In the meantime you can use http://mediawiki2latex.wmflabs.org/. You can use to create PDF file as well as other formats from articles or collections. The only problem is that the capacities on the server are very limited, so it will only work for single articles or collections of a very few articles. If you want more you will have to install the current version of ubuntu and use the mediawiki2latex package from the command line. Good luck.

196.21.98.134 (talkcontribs)

Hey Marcus,

I'm sure they are trying their best, and you are coming off as being a bit rude about it. Not sure if you meant it that way, just saying your tone is a bit harsh.

Also, a donation means to give freely without expecting something back, it was not purchasing the option for generating books. They might also be using the money to keep the servers running, paying employees, generating and writing up information, contracting people to fix the book generating function, etc.

"Update: (23 April 2018) PediaPress will take over the development of the books-to-PDF functionality. See Reading/Web/PDF Functionality for more information."

"Updates: (24 February 2018) [....] - Update on the book creator. We're still in the process of performance testing the new renderer (https://phabricator.wikimedia.org/T178278). Once this stage is complete, we will be able to provide more details on its capacity to render books."

If you would like to see what the donations are used for, for example, check out:

1) https://donate.wikimedia.org/w/index.php?title=Special:LandingPage&country=ZA&uselang=en&utm_medium=sidebar&utm_source=donate&utm_campaign=C13_en.wikipedia.org

"Where your donation goes

Technology: Servers, bandwidth, maintenance, development. Wikipedia is one of the top 10 websites in the world, and it runs on a fraction of what other top websites spend.

People and Projects: The other top websites have thousands of employees. We have about 300 staff to support a wide variety of projects, making your donation a great investment in a highly-efficient not-for-profit organization."

2) https://wikimediafoundation.org/wiki/FAQ/en

"Why should I donate and where does my money go?

Donations to the Wikimedia Foundation help sustain free knowledge through Wikipedia and our ecosystem of projects. Your contributions support technology to keep the sites fast, secure, and accessible; for Wikimedia programs and initiatives to expand access and support free knowledge globally; and for grants to volunteer contributors to improve and enrich the knowledge on Wikipedia and the Wikimedia sites. Your donations support this work, and so much more, to ensure Wikipedia remains accessible and valuable for many generations to come."

3) On the same page linked at 2, just a bit further down:

"Where can I find more financial information?

The Wikimedia Foundation's 2016 - 2017 annual report covers the fiscal year from 1 July 2016 to 30 June 2017. The Foundation's annual report shares some of the voices of the hundreds of thousands of people who make the Wikimedia movement possible.

The Wikimedia Foundation 2017-2018 Annual Plan describes our budget for the current fiscal year. It contains a summary of our strategic goals as an organization, financial details on spending and revenue, and detailed explanations and risk analysis."

Reply to "WHEN the hell will it be possible to use the book generator again ??"
212.185.85.78 (talkcontribs)

PDF export: Please include the widows and orphan rule.

TheDJ (talkcontribs)

As far as I know, that rule is included (as it is in normal print). However, the behaviour of widows and orphans only works WITHIN a single paragraph. As Wikipedia pages often have lots of paragraphs and headers on the page as well as many floating images and other elements, the behaviour you expect might not be possible to achieve for the renderer.

If you have specific examples that go wrong, you are welcome to upload and share them with us, for evaluation.

Dirk Hünniger (talkcontribs)

mediawiki2latex does respect the widows and orphan rule

Reply to "Widows and orphans rule"
Steelpillow (talkcontribs)

The old OCG could output in a variety of formats. Is it correct to assume that headless Chrome must first have all the wikitext pages, copyright small print, etc. pre-processed into HTML+CSS before rendering? If so, is it possible to intercept the intermediate HTML/CSS format and offer an HTML download option? That might be say CHTM or ePub or just a raw zip. Or if the book is assembled in the DOM or whatever, could that be persuaded to spit out the HTML? This would then allow client-side conversion to other formats, which is next to impossible from PDF.

Bert Niehaus (talkcontribs)

Yes, any format that can be parsed and post-process would be a great support for generating derived products as w:Open Educational Resources. Spencer Kelly is currently doing a great job in developing wtf_wikipedia.js further. It allows the generation of JSON for MediaWiki-article (see demo https://niebert.github.io/Wiki2Reveal/wtf_wiki2html.html ) conversion to plain text and may be other formats will follow. The conversion can be done on client side even in a browser just by contacting the MediaWiki-API, download the Wiki source of the article and parse it into whatever is content product is needed. Nevertheless the PDF generation is great and very much appreciated.

Bert Niehaus (talkcontribs)
Reply to "HTML output"

out of the frying pan and into the fire

10
BerlinSight (talkcontribs)

While I appreciate that WP got rid of the two column layout and the lack of tables in PDF output, I miss the quality of the LaTeX generated PDFs. The typographical problems already are discussed but there is another issue. Even vector Images like (used on page "Spiral model", inserting the link doesn't work here "page does not exist") are included in a rasterized version. The resolution is appropriate only for screen display but generally too low, to e.g. read contained text in the printout. Thus they often are useless.

TheDJ (talkcontribs)

This is a known problem, tracked as T178664

BerlinSight (talkcontribs)

OK, I see. Yet I think WP is wasting time reinventing the wheel. TeX/LaTeX is a professional quality typesetting system, which is Free / Open Source Software and contains decades of work. My prediction is Electron will never reach the point TeX already is WRT typographic quality. IMHO leaving the user with two options for generating PDFs (one with high quality typesetting and one with tables) would be a better choice than the current status.

TheDJ (talkcontribs)

I think the problem here is that the systems are fundamentally different. HTML is made for flexible and dynamic layouting, adapting to any situation where it is asked to render and (these days) a lot of interactivity.

LaTeX is fundamentally designed for very reproducible and specific layouting in controlled circumstances, mostly for non-interactive situations. You can't make websites with LaTeX (it's hard to put jello into a straightjacket), and therefore you cannot print them with it either. And HTML cannot do what LaTeX can do.

BUT HTML is catching up. There are specs for adding print specific context (page size, pagebreak info, etc) to HTML for instance, but they are not yet supported. It's also a technology that is closer to what we are used to within our own ecosphere, making it easier to support for the engineers that have to do the incidental work to support it, and we have to duplicate less work in both stacks, since most of the time, the easy stuff will just work.

Neither is perfect, neither will be perfect, but one is sustainable for us, and the other is not.

BerlinSight (talkcontribs)

Sorry, but it looks like you are missing the point completely. I did not ask to rewrite WP in LaTeX. The former PDF engine used LaTeX as a backend to create high quality PDFs, alas lacking tables. As the new engine has tables but an awful typographic and image quality and quite likely will never match the output quality of the old PDF engine, I would prefer to have the choice, which one to use (or better the old one with tables and single column layout, but that does not seem possible).

TheDJ (talkcontribs)

I was just talking about the technology stack:

  • normal: wikicode -> html
  • old engine: wikicode -> LaTeX -> PDF
  • new engine: wikicode -> html -> PDF

We removed one very expensive translation step from the system, that had no maintainers and no experts available that were able to keep it online.

THAT is the only thing that matters. It's a resourcing decision. If you want to quit your existing job and for free improve the old system, then that's fine.

Dirk Hünniger (talkcontribs)
Debenben (talkcontribs)

@Dirk Hünniger great work!

I am also disappointed by the typographic quality of the chromium rendering engine. Especially mathematical formulas look horrible (). I did not know about mediawiki2latex, why don't we mention it as an alternative and let the user decide what they prefer?

Debenben (talkcontribs)

I tested the claim that it can handle tables on the article schwarzschild-metric which was mentioned somewhere below:

mediawiki2latex -m -g -u https://de.wikipedia.org/wiki/Schwarzschild-Metrik -o "schwarzschild.pdf"

result: All tables are rendered perfectly. Mathematical formulas look perfect, only one drawback: some urls don't get any line-breaks, so they sometimes extend beyond the page margins

Quiddity (WMF) (talkcontribs)

Posting to bump cache, and hopefully fix missing comments.

Reply to "out of the frying pan and into the fire"
Kaartic (talkcontribs)

The PDF output for the w:en:Glossary of Sudoku page is badly formatted in a few ways:

  • There are several instances of overlapping text in the whole article and they are more pronounced in the 'Other terminology' section.
  • There are instances of 'Spaces between words missing (visually)' issue in some places e.g. there should be a space between 'givens' and 'for' in the definition for 'Minimum number of clues' in the 'Other terminology' section. I say (visually) because copying and pasting the text in a text editor does reveal the space between the two words.
  • In the 'Notes' section the text is linkified when the are links but the URL is also displayed which seems to be redundant and spoiling the readability of the text. It might be better to avoid the showing URL when the text is linkified.
This seems to be happening when the links are created using the {{cite web}} template.
TheDJ (talkcontribs)

The first two are the same issue and is tracked as phab:T178665. The latter is intentional, as usually in print you dont have links, and since they are so critical for the sourcing, they are always added in the print version. Since there is no special CSS PDF medium and also because many of the pdfs are actually used later on for printing, its probable that the links in the referencing will remain this way for the foreseeable future

Kaartic (talkcontribs)

Thanks. Regarding the links, in that case why aren't the URLs of the links created using [URL label] revealed but just linkified?

PMiazga (WMF) (talkcontribs)

The link is added by using [ link link ] and it's rendered correctly. The best/easiest way is to edit the page and change the link name.

There is an unconditional rule in the CSS (https://github.com/wikimedia/mediawiki/blob/master/resources/src/mediawiki.legacy/commonPrint.css#L69). Each time the browser renders a link in print view - it adds (href) after it. There is no way to do the conditional checking for link content in CSS, because of that we cannot check if link href and content are identical. This would require an additional parsing (or JS processing before we print the document).

Kaartic (talkcontribs)

I wasn't referring of the links added using [link link] in the "Notes" section. I was referring to the "MAA Math Games – Sudoku Variations" link in the References section of the "Glossary of Sudoku" article. It's actually created using [URL label] but the style you point to doesn't seem to applied to it while it should have been. (BTW, thanks for the link)

PMiazga (WMF) (talkcontribs)

All external links will have label and the url. That's how the desktop print worked for pretty long time. The new print mode (for mobile, please try to print article using mobile site) will no display the URLs as those PDFs are designed to be read on mobile devices. The desktop PDF mode (using vector skin) is used both for reading on computer (where you can click a link) and for printing (where we need to show full URL if user wants to visit that website).

Kaartic (talkcontribs)

OK, didn't expect there was confusion about which printing I was referring to. Sorry about that. I was stating the issues I was facing with the PDF generated using the "Download as PDF" feature in desktop. I was elaborating on that as I found that behaviour odd. I guess it's time to speak with some screen shots.

I initially downloaded the PDF version of the "Glossary of Sudoku" using the "Download as PDF" feature. I noticed that the links in the "Notes" section of the PDF were both linkified and the corresponding URLs were also shown (which has been reasoned, fine). I also noticed that a link in the "References" section(the "MAA Math Games – Sudoku Variations" link) of the same article were styled differently linkified but URLs not shown. I find this odd because of the inconsistency in styling links in two sections.

Reply to "Bad formatting of text in PDF output"
2601:248:527F:1B40:55B7:8051:4527:D98A (talkcontribs)

The page I was reading I needed translated into English, which worked well until I went to download it. Because of the way you have it set up, it didn't download it in English, it downloaded the original page in German. That is no help whatsoever.

79.236.79.170 (talkcontribs)

Hello everyone,

empty pages in printout. Page 1 with text, page 2 empty, page 3 few text, etc.

Regards,

Frank

Johan (WMF) (talkcontribs)

You mean that you used a machine translation function in your browser? I'm afraid it'll continue to be the case that you can only download PDFs in the language of the page, as the function gets its information from our article as it looks on our server and not any changes you've made to it on your computer.

Reply to "autodownloading"
Archimedic (talkcontribs)

Since a very, very long time we see Wiki's warning about the still unsolved problems with pdf's. Regardless the fact I've had never problems with these docouments created by Wiki, this problems seem to keep on for a very, very long time. In situations like this, it is better to crush the old system (because you give it a try over years), and to set up an absolutely new pdf-creator. Be happy with it!

Johan (WMF) (talkcontribs)

We're actually doing exactly that.

89.204.139.173 (talkcontribs)

Sorry No, you don't. What was meant here obviously is to crush the newer old system which solved one issue creating equally bad new problems. Alas, that's not going to improve the situation in the short run. I'd rather recommend just giving the user the choice between the current ugly typographic and image quality with tables and the older high quality output without tables. I.e. reactivate the old PDF export extension and the option to use either that or Electron, just like it was when Electron was introduced. Then WM should in fact look for a brand new solution - but don't press it.

Steelpillow (talkcontribs)

The original system is not an option any more. The hosting hardware has gone, the software is no longer supportable. It is not worth obtaining old-style hardware from somewhere and commissioning it just as a temporary measure, even if the software were supportable.

Reply to "change the system"
Steelpillow (talkcontribs)

Back in April, you and PediaPress were "working on the details and schedule". Has anything happened since then? Is there a public repos or somewhere that this is being tracked?

Johan (WMF) (talkcontribs)

There's been some work – over the last week we've been making sure the design details make sense, for example – but the short answer is "not really", as PediaPress are working on this on the side, and have other full-time work. The goal is to have something presentable by Wikimania in a couple of weeks.

(We understand the delay is unfortunate. The problem, mainly, is that this isn't something we wanted to do and properly had all the resources for – the PDF renderer was breaking down, and then our solution didn't perform well for books.)

Steelpillow (talkcontribs)

Thank you. Best of luck at Wikimania and I look forward your conference report. :)

Reply to "Progress?"