Topic on Project:Support desk

Deleted data still in wiki_text and shrink Mariadb database

5
Beardedfool (talkcontribs)

Hi,

I've been trying to work out why our wiki_text table is 600+Mb with only 461 pages but am past my level of expertise, help please!


I did have an over-zealous script running through editpage.php that made a ton of edits - I've deleted that page but I can still see data from it through a mysql > select *


Pages with the most revisions shows the largest having 235 revisions. That page had thousands (I think) but maybe it's hidden as it was through editpage.php, or a template?


Is there some way to get rid of this old data and shrink the database?

Problem: Database larger than expected, or is it

Assuming that I'm right and the mediawiki text table only has text in it, it seems large for what we have in it

<root:/var/lib/mysql/wiki# ls -alS | head -2
total 775M
-rw-rw---- 1 mysql mysql 664M Sep 21 21:10 wiki_text.ibd
Database changed
MariaDB [wiki]> SELECT
    ->      table_schema as `Database`,
    ->      table_name AS `Table`,
    ->      round(((data_length + index_length) / 1024 / 1024), 2) `Size in MB`
    -> FROM information_schema.TABLES
    -> ORDER BY (data_length + index_length) DESC;
+--------------------+----------------------------------------------------+------------+
| Database           | Table                                              | Size in MB |
+--------------------+----------------------------------------------------+------------+
| wiki               | wiki_text                                          |     659.55 |
| wiki               | wiki_revision                                      |       7.78 |

This seems... large

MariaDB [wiki]> select COUNT(*) from wiki_text;
+----------+
| COUNT(*) |
+----------+
|    15530 |
+----------+
1 row in set (0.06 sec)

Page Statistics

Page statistics
Content pages	461
Pages
(All pages in the wiki, including talk pages, redirects, etc.)	697
Uploaded files	512
Edit statistics
Page edits since XXXX was set up	15,577
Average edits per page	22.35

sqldump

is a fair bit smaller but still large

mysqldump wiki > wiki.sql
-rw-r--r-- 1 root root 417M Sep 21 22:04 wiki.sql
dumpBackup.php --full

is only about 41Mb

Old data

It does seem to have to data in the table that has been deleted

select COUNT(*) from wiki_text where old_text like '%Areca%';.
 +----------+
| COUNT(*) |
+----------+
|       13 |
+----------+
1 row in set (3.06 sec)

decribe wiki_text

MariaDB [wiki]> describe wiki_text;
+-----------+------------------+------+-----+---------+----------------+
| Field     | Type             | Null | Key | Default | Extra          |
+-----------+------------------+------+-----+---------+----------------+
| old_id    | int(10) unsigned | NO   | PRI | NULL    | auto_increment |
| old_text  | mediumblob       | NO   |     | NULL    |                |
| old_flags | tinyblob         | NO   |     | NULL    |                |
+-----------+------------------+------+-----+---------+----------------+
3 rows in set (0.00 sec)

Maintenance tools tried

deleteOldRevisions.php

root:/var/lib/mediawiki/maintenance# php deleteOldRevisions.php
Delete old revisions

Searching for active revisions...done.
Searching for inactive revisions...done.
3552 old revisions found.

purgeOldText.php

root:/var/lib/mediawiki/maintenance# php purgeOldText.php
Searching for active text records via contents table...done.
done.
Searching for inactive text records...done.
0 inactive items found.

deleteArchivedRevisions.php

This should be another table anyway but just in case someone comes across it as a solution

root:/var/lib/mediawiki/maintenance# php deleteArchivedRevisions.php
Found 0 revisions to delete.

Shrinking Maridb

Not convinced this is actually doing anything. It went up at one point after deleting old revisions

MariaDB [wiki]> OPTIMIZE TABLE wiki_text;
+----------------+----------+----------+-------------------------------------------------------------------+
| Table          | Op       | Msg_type | Msg_text                                                          |
+----------------+----------+----------+-------------------------------------------------------------------+
| wiki.wiki_text | optimize | note     | Table does not support optimize, doing recreate + analyze instead |
| wiki.wiki_text | optimize | status   | OK                                                                |
+----------------+----------+----------+-------------------------------------------------------------------+
2 rows in set (18.67 sec)
Bawolff (talkcontribs)

so the page in question that you deleted - is it totally gone (from the maintenance script you ran) or are there still references to it in revision or achive. The text table is used by both the revision and the archive table, so if still in archive table the purge script wont delete.

> Pages with the most revisions shows the largest having 235 revisions. That page had thousands (I think) but maybe it's hidden as it was through editpage.php, or a template?

That special page wont include deleted pages stored in archive table, and i think not all namespaces.

Beardedfool (talkcontribs)

Thanks for coming back to me!


That deleted data was not in the archive table but still hanging round in the main wiki_text table, despite not showing up through the webpage in any way. I'm saying that from the archive table being small in size.

deleteArchivedRevisions.php showing nothing also confirms that it wasn't in the archive table.. I think ??


In the end I found a search term for that page and did a delete straight from mariadb. That solved it and optimize database got it all down to 40Mb!


But what can cause data to 'be deleted' but still be in the table and not show up on purgeOldtext or deleteOldRevisions.php ?


I'm happy to do some restoring and testing if it's helpful to devs.

Honestly, as is often the way, I was trying a few things (compressing old revisions etc) along this so I can't be 100% on my route to solve it all.

Bawolff (talkcontribs)

certain types of compression (compressOld.php) can mess with the ability of purgeOldText.php to find records (because the records become mixed).

The archive table need not be big for there to be dangling references to large parts of the text table.

Sergtk (talkcontribs)

I was not able to create scenario so that some records deleted from the table `text' or `content`

I tried to create new page for testing with uncompressed revisions.

Then I delete it, run purgeOldText.php and run deleteArchivedRevisions.php --delete.

I tried different combinations, experiments, and I never get the number of records reduced in the table text or content. I checked the number with SQL queries (SELECT COUNT(*) FROM `text`, SELECT COUNT(*) FROM `content`).

And mentioned scripts output all the time the following

0 inactive items found.


Could you or somebody else describe the scenario as an example to demonstrate that number of records in table text and content reduced?


Thank you

UPDATE:

After investigating the source code of MediaWiki, I didn't realise how records in table `text` may be deleted.

deleteArchivedRevisions.php deletes archived records only:

https://github.com/wikimedia/mediawiki/blob/master/maintenance/deleteArchivedRevisions.php#L54

Then `text` table should be cleaned up with the following line after this:

$this->purgeRedundantText( true );

Link: https://github.com/wikimedia/mediawiki/blob/master/maintenance/deleteArchivedRevisions.php#L59


Note, this is the same if we run the file purgeOldText.php, if we want to delete records directly: https://github.com/wikimedia/mediawiki/blob/master/maintenance/purgeOldText.php#L40


But if we check the function `purgeRedundantText` here: https://github.com/wikimedia/mediawiki/blob/master/maintenance/includes/Maintenance.php#L1311 , we may see that records in the `text` table are found based on `content` table. But `content` table is not modified.

So according to the implementation `text` and `content` tables should not be modified when we just run 'deleteArchivedRevisions.php --delete'.

But according to the documentation they should be trimmed.


I missed something.

Could you please clarify what is wrong with this?

Thank you.


IMHO, it looks like the file deleteArchivedRevisions.php is outdated. Last significant update was in 2014.

But the following document which describes changes in database structure was created in 2016: Multi-Content Revisions/Content Meta-Data


FYI. I created bug here https://phabricator.wikimedia.org/T271635

Reply to "Deleted data still in wiki_text and shrink Mariadb database"