User:Sohom data/The Final Report

Jump to navigation Jump to search

Code contributions: Skip to section


The pagelist syntax is a custom syntax used by the ProofreadPage extension (deployed at Wikisource) to document page number metadata for individual books. The syntax uses a combination of ranges, labels and individual page number assignments to create a compact representation of each and every page number of every page of a book (which may be over a few hundred pages long). For example, a 200 page book with 1 cover page and 10 blank/non-numbered pages (for example the title, pictures and the publisher pages) at the start can be represented using this efficient one liner: <pagelist 1="Cover" 2to11="-" />. Pretty neat!

Around the end of 2019, the Wikisource community participated in the Community Wishlist Survey 2020 and voted to replace the old interface with a new editing widget that wouldn't require the use of external software by the user. The task received 51 votes supporting it's implementation and was ranked 6th among all other wishes which is how it became a project at Google Summer of Code 2020 :)

Identifying problems[edit]

The pagelist syntax, however, posed a problem. While it was efficient, it required a significant amount of mental calculation to get the pagelist to display exactly what the editor wanted it to display. This was especially pronounced when a book contained complex page numbering system (for example say a book where the numbering restarts after every chapter or one with interleaved images).

Additionally, the old interface had no way by which an editor could look at the individual pages of the book (something that is essential for looking up the page numbering). This meant the user had to download the book separately and go back and forth between the browser and downloaded book quite a few times to verify if the numbering was as required.

Coming up with solutions - Image display[edit]

From the start, it was pretty clear how the second problem would be solved. We clearly needed some way to display the image of any/all of the pages alongside the pagelist itself. However, since displaying 300 or so images all in one screen would sort of defeat the purpose of the images in the first place (the images would be 30x30px which would make seeing the page number incredibly difficult) we went for displaying one image at a time in a separate pane side by side the pagelist input field. This left us with one decision to make, how would we allow the user to choose the page to be displayed.

Commons allows the user to go a different page by using a dropdown menu containing a list of all the page numbers. Internet Archive, on the other hand, has a sliding slider that determines which page number the user wants to go to. However, neither of those two really fit the brief. We needed a way to tell the user explicitly what scan number they were on and what page number had currently been set for the page. We thus settled on parsing the pagelist syntax’s HTML output and creating a representation of the HTML output using buttons.

Coming up with solutions - Visual mode[edit]

Coming up with a proper way to solve the first problem turned out to be tougher. The solution was obviously to implement a visual mode, but how? For one, we only had space to display one image at a time, this meant that we had to somehow allow the user to enter a range from the current page to some other page without knowing anything about where the other/ending page was. Looking at the pagelist syntax, we realized that there were slightly different ways of representing the same pagelist. For example, a book with 10 non-numbering pages at the start with three images at page number 3, 4 and 5 could be represented using  <pagelist 1to10=”-” 3to5=”image” /> as well as <pagelist 1to2=”-” 3to5=”image” 6to10=”-” />. Both pagelists are totally valid ways of representing the page numbers, however both are quite different in how they structured the data. This means that we could possibly create a pagelist based on just the user telling us where the numbering changes are. Given the user input, we would make a informed decision on how the pagelist would look based on the old pagelist, doing this every time the editor updates the pagelist allows us to perform the merging of ranges in O(n) time and make the most(more or less) optimized pagelist while allowing us to provide a live updating preview of how the pagelist looks.


Its been a really fun journey this summer. My mentors Satdeep Gill and Sam Wilson helped me in every part of the way, helping me figure stuff out and putting me back on track when I went down dead ends. Special thanks to Pavithra Eswamoorthy and Srishti Sethi, both of whom were very patient during the initial period when I was still trying to figure out how Wikimedia worked.

What's next[edit]

I intend to work exclusively on the ProofreadPage extension and the Pagelist Widget for quite a bit resolving community feedback and addressing bugs that may surface. Additionally, I do want to keep on contributing to other areas to Wikimedia and help out with various technical issues at Wikimedia.

Code Commits[edit]

Further information[edit]