Manual talk:Page table

The following discussion has been transferred from Meta-Wiki.
Any user names refer to users of that site, who are not necessarily users of (even if they share the same username).

Sample MySQL code is all incorrect starting on Mediawiki 1.35[edit]

Every query currently on the Sample MySQL code uses `rev_text_id` which was deprecated on Mediawiki 1.31 and removed on Mediawiki 1.35. You have to link `page`.`page_latest` = `slots`.`slot_revision_id` then `slots`.`slot_content_id` = `content`.`content_id` then SUBSTR(`content`.`content_address`, 4) = `text`.`old_id`

--AniLeo (talk) 19:08, 21 October 2020 (UTC)Reply[reply]

Cannot execute the query: Unknown column 'Mrs._Harford'[edit]

I'm trying to access the information but I get an error when I place something in the Page_Title spot

wiki_page_title = 'Mrs._Harford' Cannot execute the query: Unknown column 'Mrs._Harford' in 'where clause'

What could be causing this?

if (! $thiscontent = mysql_query('SELECT old_text from wiki_page, wiki_revision, wiki_text where page_title = "Mrs._Harford" and page_latest = rev_id and rev_text_id = old_id')) { die("Cannot execute the query: ".mysqlerror()); } else {echo $thiscontent;}

I figured that the quotes were causing problems, but now I only get Resource Id #3


when you ask for »"Mrs._Harford"«, you're actually asking for a column named literally »Mrs._Harford«. this is the effect of double quoting: protecting the content from further interpretation.

what you want is the value »Mrs._Harford«, a string typed value. the way to tell this to the sql parser is by using single quotes: »'Mrs._Harford'«.

obviously, since you are writing this in a string being delimited by single quotes itself, you must escape them.

an other way (a good practice) would be to use double quotes to delimit sql commands.

your text would then look like this:

if (! $thiscontent = mysql_query("SELECT old_text from wiki_page, wiki_revision, wiki_text where page_title = 'Mrs._Harford' and page_latest = rev_id and rev_text_id = old_id")) { die("Cannot execute the query: ".mysqlerror()); } else {echo $thiscontent;}

Moved from page during rewrite[edit]

I moved this from Page table: Titoxd(?!?) 08:47, 5 May 2007 (UTC)Reply[reply]

To get e.g. the current wikitext of a given page in the main namespace:

   INNER JOIN revision ON page_latest = rev_id 
   INNER JOIN text ON rev_text_id = old_id
   page_title = 'Your page title' 

Note: The use of LEFT(old_text,1024) allows the first 1024 characters to be printed because, old_text being a blob, your query would probably only display "[BLOB - <n>.<n> KiB]" where <n>.<n> is the size in kilobytes of your page content. If you are using this query in a program and need the entire content in old_text, be sure to only reference old_text and not LEFT(old_text,1024).

"varchar(255)" vs. "varchar(255) binary"[edit]

Is it really necessary to have "varchar(255) binary" as column type for page_title (and some other columns of this and other tables) instead of "varchar(255)"? I just spent three evenings debugging trying to find out why sqlalchemy (one python ORM framework) does not give me unicode strings anymore (after an update). I finally found out that the "binary" after the varchar(255) caused the problem. I tried to find the "varchar(255) binary" datatype in the MySQL datatype definition documentation but it is not described (as it was no supported datatype): I guess that this datatype, as MySQL does not complain when creating a table using it, is either a hidden feature of MySQL or a relict.

I think that the motivation making the page_title column "varchar(255) binary" was to avoid encoding problems with lating1 databases having to store UTF8. Wouldn't it be better to use CHARACTER SET UTF8 in the CREATE TABLE statement?

I am also investigating that. I found the following discussion [1].
MySQL 5.0 or 5.1 does not support all UTF-8 characters (called the four-byte utf-8). MySQL 5.5 introduced a character set utf8mb4 which solves the issue [2].
But it seems that for 5.0 and 5.1, a workaround is to add "binary". --Kipmaster 15:46, 30 November 2011 (UTC)Reply[reply]
... more on this...
I couldn't get "varchar(255) binary" on my system either, but it seems that it is the same as varbinary(255); --Kipmaster 21:09, 30 November 2011 (UTC)Reply[reply]

When I import this I get titles in the form of '41666768616e697374616e50656f706c65'. How do i fix it so it is UTF-8 imported?

page_is_redirect vs redirects table[edit]

The documentation for the page table states:

 to find pages that are not redirects add "p.page_id not in (select rd_from from wikidb.redirect)"

Yet, the page_is_redirect column in the page table is supposed to be 1 if the page is a redirect, and 0 otherwise.

Running some queries on the database, one can see that there are big discrepancies between the redirect table and the page_is_redirect column.

For example, using the English Wikipedia dump from April 2011, if we query the number of pages whose page_is_redirect attribute is set to 1, but whose page_id does not appear in the redirect table, we get 2303 such pages:

 mysql> SELECT COUNT(page_id) FROM page WHERE page_is_redirect = 1 AND NOT page_id IN (SELECT rd_from FROM redirect);
 | COUNT(page_id) |
 |           2303 |
 1 row in set (16.97 sec)

If we then search for all pages who page_is_redirect attribute is set to 0, but whose page_id does appear in the redirect table, we get 368269 such pages:

 mysql> SELECT COUNT(page_id) FROM page WHERE page_is_redirect = 0 AND page_id IN (SELECT rd_from FROM redirect);
 | COUNT(page_id) |
 |         368269 |
 1 row in set (19.78 sec)

It would seem, then, that the only reliable way to determine all redirect pages would be to scan the text table, look at the Wikitext for the latest revisions of all pages, and determine whether or not they contain the string "#REDIRECT". Due to the size of this table, this will take quite awhile, so I'm hoping for a better / quicker way to determine whether or not a page is a redirect. Can I trust the redirect table, even though some pages are not listed in it but have their page_is_redirect attribute set to 1? Please advise. Thank you! 21:52, 9 June 2011 (UTC)Reply[reply]

This was a bug in MWDumper (bugzilla:38919). --Tgr (talk) 10:04, 31 October 2012 (UTC)Reply[reply]

page table = "core of the wiki"?[edit]

I would think that the text, page, and revision tables together are the core of the wiki. Leucosticte (talk) 02:08, 7 October 2013 (UTC)Reply[reply]

Indeed. And, if anything, page is the one of the three you need the least, too - all it has that the others don't is titles. -— Isarra 08:14, 9 October 2013 (UTC)Reply[reply]

Two requests[edit]

  • Could someone please fix the link includes/Defines.php in section page_namespace which redirects to phabricator's general search page.
  • Could someone please add a short definition of new to the section page_is_new.

Thank you very much. --Achim (talk) 18:20, 27 May 2016 (UTC)Reply[reply]

How's that? Robin Hood  (talk) 21:04, 27 May 2016 (UTC)Reply[reply]
Robin Hood, fine, thank you. --Achim (talk) 07:58, 28 May 2016 (UTC)Reply[reply]

No Text Table[edit]

The text of the page itself is stored in the text table.

My wiki does not contain a text table. Did this architecture change? Johnywhy (talk) 16:13, 9 June 2018 (UTC)Reply[reply]

Well, that's impossible. Even if you have configured External Storage, there's still a text table. --Ciencia Al Poder (talk) 09:36, 10 June 2018 (UTC)Reply[reply]

Found it. But the only fields i see are old_id, old_text, and old_flags. That doesn't sound like live data. — Preceding unsigned comment added by Johnywhy (talkcontribs)

Those column names have its history, but anyway, documentation exists so you don't have to infer where or how things are stored by looking at how they're named. --Ciencia Al Poder (talk) 09:25, 12 June 2018 (UTC)Reply[reply]

How to query live pages?[edit]

Is this correct selection to get only live pages?

This query returns results, but seems to be missing some pages (based on my wiki content):

SELECT page.page_title FROM `text`, `page` WHERE `old_text`LIKE '%test%' AND 
page.page_latest = text.old_id

This also works. Wondering how to check if index is up-to-date.

SELECT page_id, page_namespace, page_title 
FROM `page`,`searchindex`
WHERE page_id=si_page
AND MATCH(si_text) AGAINST('+''gross''' IN BOOLEAN MODE)
AND page_is_redirect=0

— Preceding unsigned comment added by Johnywhy (talkcontribs)

This works only if page text is not compressed (Manual:$wgCompressRevisions). --Ciencia Al Poder (talk) 09:17, 13 June 2018 (UTC)Reply[reply]

Which query? Which table? searchindex is the table being searched, and that's never compressed. Correct? Johnywhy (talk) 12:15, 13 June 2018 (UTC)Reply[reply]

Text table will be compressed if the setting is set to true. --Ciencia Al Poder (talk) 19:39, 14 June 2018 (UTC)Reply[reply]

Is there a way to get the most viewed pages list?[edit]

In former mediawiki versions I used to get the most viewed pages list with the mysql query:

SELECT DISTINCT page_title FROM page WHERE page_namespace=0 AND page_counter >= 0 ORDER BY page_counter DESC;

The page_counter field is not there any more. Is there a way to accomplish the task I were used to?

DonPaolo (talk) 14:17, 30 October 2018 (UTC)Reply[reply]

How does MediaWiki exactly store the contents of a page inside of a blob in the database? What would be the best technique to search through those contents for searching the site?[edit]

Hello everyone, so I am trying to create my own website for a specific goal and one of the big parts of the site involves a "wiki" pretty much the same as MediaWiki/Wikipedia except with differences to obviously fulfill my goal. I came to this page to ask a couple questions about how pages on MediaWiki are stored. I tied asking these kinds of questions on discussion forums but they didn't really help that much. So since this is MediaWiki I thought that asking my questions here would give me much better answers. My first question is, How are MediaWiki/Wikipedia articles/pages exactly stored in a database. I downloaded MediaWiki and searched some on the internet and found that the page contents are stored inside a JSON file which is then stored into a blob on the database. But when asking on reddit about how I should search through these without any kinds of problems or delays I just got the answer that I should just not do it and store the JSON contents into the table and use JSON commands to search through them. That made me question though, if this method of storing was so "ineffective", then why would the MediaWiki developers, who seem to be very knowledgeable about this subject, use such a method? I would also like to point out that I am just going to develop my own kind of visual editor instead of having two different ones just to make it much easier and faster to do, and store the contents as HTML instead of XML which is what MediaWiki does (I am saying this just to make sure that what I am doing could work).

My second question is, How, with all these layers to go through to get to the page content (the database, the blob, etc), do you actually search through that data without any kinds of delays or trouble, while searching the website? someone told me that Wikipedia uses its own search indexing system which makes me ask, how does it exactly work and how can I create one of my own?

Thanks, Splat --TheSplatGuy (talk) 20:48, 14 July 2020 (UTC)Reply[reply]

p.s. do not try to convince me to "just use MediaWiki" because I have had trouble with getting some things to work and again, the wiki part of my website is just one portion while the other is beyond MediaWiki's use.

MediaWiki is used by Wikipedia, a very large wiki site, with lots of requirements of load balancing, etc. Storing a blob for each page would be more efficient if you don't have a history of revisions where you can see previous versions of the page, pages with hundreds of thousands of revisions, and a lot of users both reading and editing the website.
I don't know where did you read that contents are stored in a json blob in the database. That's plainly false. There's a thing called page (this table), a revision, slots for a revision (currently only a main one, but in the future there may be more), and the page text in the Manual:text table. There's no json involved.
Also MediaWiki doesn't store contents in XML (that's a contradictory statement in your post). XML however is used for importing and exporting pages.
What wikipedia uses for search is Extension:CirrusSearch that is using ElasticSearch as its engine. --Ciencia Al Poder (talk) 12:49, 15 July 2020 (UTC)Reply[reply]

Hello! I greatly thank you for your response, I got the info that Wikipedia stores it's pages in JSON files from a YouTube video of someone playing around with the Wikipedia API using JavaScript. If it doesn't store everything in a JSON file, then what exactly is stored inside the blob? I presume that its XML (seeing your previous message), but how would I exactly do that when I am just using an HTML editor? Would it act the same and be just fine if I just store it as HTML (since I am not using a formatting language like MediaWiki)? I also would like to thank you for telling me about the search extension, I am looking into Elastic search and am thinking of using it for my site. Last thing, is there any resources that I could look more into to get more information about this subject besides the Manual:Database_layout and Manual:MediaWiki_architecture?
Thanks, --TheSplatGuy (talk) 14:06, 15 July 2020 (UTC)Reply[reply]
Manual:Database layout offers the complete scheme (and hence also the format of each table column). --Malyacko (talk) 20:10, 15 July 2020 (UTC)Reply[reply]