Moved from page during rewrite
To get e.g. the current wikitext of a given page in the main namespace:
SELECT LEFT(old_text,1024) FROM page INNER JOIN revision ON page_latest = rev_id INNER JOIN text ON rev_text_id = old_id WHERE page_title = 'Your page title' AND page_namespace=0;
Note: The use of LEFT(old_text,1024) allows the first 1024 characters to be printed because, old_text being a blob, your query would probably only display "[BLOB - <n>.<n> KiB]" where <n>.<n> is the size in kilobytes of your page content. If you are using this query in a program and need the entire content in old_text, be sure to only reference old_text and not LEFT(old_text,1024).
"varchar(255)" vs. "varchar(255) binary"
Is it really necessary to have "varchar(255) binary" as column type for page_title (and some other columns of this and other tables) instead of "varchar(255)"? I just spent three evenings debugging trying to find out why sqlalchemy (one python ORM framework) does not give me unicode strings anymore (after an update). I finally found out that the "binary" after the varchar(255) caused the problem. I tried to find the "varchar(255) binary" datatype in the MySQL datatype definition documentation but it is not described (as it was no supported datatype): http://dev.mysql.com/doc/refman/5.5/en/create-table.html. I guess that this datatype, as MySQL does not complain when creating a table using it, is either a hidden feature of MySQL or a relict.
I think that the motivation making the page_title column "varchar(255) binary" was to avoid encoding problems with lating1 databases having to store UTF8. Wouldn't it be better to use CHARACTER SET UTF8 in the CREATE TABLE statement?
- I am also investigating that. I found the following discussion .
- MySQL 5.0 or 5.1 does not support all UTF-8 characters (called the four-byte utf-8). MySQL 5.5 introduced a character set utf8mb4 which solves the issue .
- But it seems that for 5.0 and 5.1, a workaround is to add "binary". --Kipmaster 15:46, 30 November 2011 (UTC)
- I couldn't get "varchar(255) binary" on my system either, but it seems that it is the same as varbinary(255); --Kipmaster 21:09, 30 November 2011 (UTC)
When I import this I get titles in the form of '41666768616e697374616e50656f706c65'. How do i fix it so it is UTF-8 imported?
page_is_redirect vs redirects table
The documentation for the page table states:
to find pages that are not redirects add "p.page_id not in (select rd_from from wikidb.redirect)"
Yet, the page_is_redirect column in the page table is supposed to be 1 if the page is a redirect, and 0 otherwise.
Running some queries on the database, one can see that there are big discrepancies between the redirect table and the page_is_redirect column.
For example, using the English Wikipedia dump from April 2011, if we query the number of pages whose page_is_redirect attribute is set to 1, but whose page_id does not appear in the redirect table, we get 2303 such pages:
mysql> SELECT COUNT(page_id) FROM page WHERE page_is_redirect = 1 AND NOT page_id IN (SELECT rd_from FROM redirect); +----------------+ | COUNT(page_id) | +----------------- | 2303 | +----------------+ 1 row in set (16.97 sec)
If we then search for all pages who page_is_redirect attribute is set to 0, but whose page_id does appear in the redirect table, we get 368269 such pages:
mysql> SELECT COUNT(page_id) FROM page WHERE page_is_redirect = 0 AND page_id IN (SELECT rd_from FROM redirect); +----------------+ | COUNT(page_id) | +----------------- | 368269 | +----------------+ 1 row in set (19.78 sec)
It would seem, then, that the only reliable way to determine all redirect pages would be to scan the text table, look at the Wikitext for the latest revisions of all pages, and determine whether or not they contain the string "#REDIRECT". Due to the size of this table, this will take quite awhile, so I'm hoping for a better / quicker way to determine whether or not a page is a redirect. Can I trust the redirect table, even though some pages are not listed in it but have their page_is_redirect attribute set to 1? Please advise. Thank you! 22.214.171.124 21:52, 9 June 2011 (UTC)
page table = "core of the wiki"?
- Indeed. And, if anything, page is the one of the three you need the least, too - all it has that the others don't is titles. -— Isarra ༆ 08:14, 9 October 2013 (UTC)
- Could someone please fix the link includes/Defines.php in section page_namespace which redirects to phabricator's general search page.
- Could someone please add a short definition of new to the section page_is_new.
- How's that? – Robin Hood (talk) 21:04, 27 May 2016 (UTC)
No Text Table
The text of the page itself is stored in the text table.
- Well, that's impossible. Even if you have configured External Storage, there's still a text table. --Ciencia Al Poder (talk) 09:36, 10 June 2018 (UTC)
- Those column names have its history, but anyway, documentation exists so you don't have to infer where or how things are stored by looking at how they're named. --Ciencia Al Poder (talk) 09:25, 12 June 2018 (UTC)
How to query live pages?
Is this correct selection to get only live pages?
This query returns results, but seems to be missing some pages (based on my wiki content):
SELECT page.page_title FROM `text`, `page` WHERE `old_text`LIKE '%test%' AND page.page_latest = text.old_id
This also works. Wondering how to check if index is up-to-date.
SELECT page_id, page_namespace, page_title FROM `page`,`searchindex` WHERE page_id=si_page AND MATCH(si_text) AGAINST('+''gross''' IN BOOLEAN MODE) AND page_is_redirect=0 https://www.mediawiki.org/wiki/Manual:Searchindex_table
- This works only if page text is not compressed (Manual:$wgCompressRevisions). --Ciencia Al Poder (talk) 09:17, 13 June 2018 (UTC)
- Text table will be compressed if the setting is set to true. --Ciencia Al Poder (talk) 19:39, 14 June 2018 (UTC)
Is there a way to get the most viewed pages list?
In former mediawiki versions I used to get the most viewed pages list with the mysql query:
SELECT DISTINCT page_title FROM page WHERE page_namespace=0 AND page_counter >= 0 ORDER BY page_counter DESC;
The page_counter field is not there any more. Is there a way to accomplish the task I were used to?
How does MediaWiki exactly store the contents of a page inside of a blob in the database? What would be the best technique to search through those contents for searching the site?
Hello everyone, so I am trying to create my own website for a specific goal and one of the big parts of the site involves a "wiki" pretty much the same as MediaWiki/Wikipedia except with differences to obviously fulfill my goal. I came to this page to ask a couple questions about how pages on MediaWiki are stored. I tied asking these kinds of questions on discussion forums but they didn't really help that much. So since this is MediaWiki I thought that asking my questions here would give me much better answers. My first question is, How are MediaWiki/Wikipedia articles/pages exactly stored in a database. I downloaded MediaWiki and searched some on the internet and found that the page contents are stored inside a JSON file which is then stored into a blob on the database. But when asking on reddit about how I should search through these without any kinds of problems or delays I just got the answer that I should just not do it and store the JSON contents into the table and use JSON commands to search through them. That made me question though, if this method of storing was so "ineffective", then why would the MediaWiki developers, who seem to be very knowledgeable about this subject, use such a method? I would also like to point out that I am just going to develop my own kind of visual editor instead of having two different ones just to make it much easier and faster to do, and store the contents as HTML instead of XML which is what MediaWiki does (I am saying this just to make sure that what I am doing could work).
My second question is, How, with all these layers to go through to get to the page content (the database, the blob, etc), do you actually search through that data without any kinds of delays or trouble, while searching the website? someone told me that Wikipedia uses its own search indexing system which makes me ask, how does it exactly work and how can I create one of my own?
p.s. do not try to convince me to "just use MediaWiki" because I have had trouble with getting some things to work and again, the wiki part of my website is just one portion while the other is beyond MediaWiki's use.
- MediaWiki is used by Wikipedia, a very large wiki site, with lots of requirements of load balancing, etc. Storing a blob for each page would be more efficient if you don't have a history of revisions where you can see previous versions of the page, pages with hundreds of thousands of revisions, and a lot of users both reading and editing the website.
- I don't know where did you read that contents are stored in a json blob in the database. That's plainly false. There's a thing called page (this table), a revision, slots for a revision (currently only a main one, but in the future there may be more), and the page text in the Manual:text table. There's no json involved.
- Also MediaWiki doesn't store contents in XML (that's a contradictory statement in your post). XML however is used for importing and exporting pages.
- What wikipedia uses for search is Extension:CirrusSearch that is using ElasticSearch as its engine. --Ciencia Al Poder (talk) 12:49, 15 July 2020 (UTC)