Extension:SphinxSearch/Page rank

From MediaWiki.org
Jump to: navigation, search

This is to document the work on better sorting for SphinxSearch results. Current default is to sort by sphinx internal weight, which is calculated based on number of matches within the text, whether they are in the title or in the body, etc. This is how it would be possible to also sort by number of incoming links to the article and article popularity.

Default SearchSphinx results[edit]

Here is an example of the default SphinxSearch search results pulled via SphinxQL. Notice that in this wiki many of the pages ended up with the same weight. The examples below show how to improve the ordering of the results.

mysql> SELECT weight(), * FROM wiki_main,wiki_incremental WHERE MATCH('msis') AND page_namespace=0 AND page_is_redirect=0;
+----------+-------+----------------+------------------+--------+----------+
| weight() | id    | page_namespace | page_is_redirect | old_id | category |
+----------+-------+----------------+------------------+--------+----------+
|     2632 | 47671 |              0 |                0 | 100135 |          |
|     2631 | 44830 |              0 |                0 |  97294 |          |
|     2631 | 44831 |              0 |                0 |  97295 |          |
|     2631 | 44852 |              0 |                0 |  97316 |          |
|     2631 | 44857 |              0 |                0 |  97321 |          |
|     2631 | 44867 |              0 |                0 |  97331 |          |
|     2631 | 44875 |              0 |                0 |  97339 |          |
|     2631 | 44876 |              0 |                0 |  97340 |          |
|     2631 | 44877 |              0 |                0 |  97341 |          |
|     2631 | 44878 |              0 |                0 |  97342 |          |
|     2631 | 44879 |              0 |                0 |  97343 |          |
|     2631 | 44880 |              0 |                0 |  97344 |          |
|     2631 | 44881 |              0 |                0 |  97345 |          |
|     2631 | 44882 |              0 |                0 |  97346 |          |
|     2631 | 45106 |              0 |                0 |  97570 |          |
|     2631 | 45107 |              0 |                0 |  97571 |          |
|     2631 | 45108 |              0 |                0 |  97572 |          |
|     2631 | 45109 |              0 |                0 |  97573 |          |
|     2631 | 45110 |              0 |                0 |  97574 |          |
|     2631 | 45111 |              0 |                0 |  97575 |          |
+----------+-------+----------------+------------------+--------+----------+
20 rows in set (0.01 sec)

Basic example - Adding sort parameters[edit]

The example below will sort results first by Sphinx weight, but after that by number of links TO that article and by number of article views. Feel free to experiment with the order of these arguments, or to add additional ones.

In sphinx.conf file, make the following changes:

Description Original Updated
In the src_wiki_main section, ADD another query that creates a temporary table with incoming link counts.
    # pre-query, executed before the main fetch query
    sql_query_pre = SET NAMES utf8
    # pre-query, executed before the main fetch query
    sql_query_pre	= SET NAMES utf8

    # temporary table for storing link counts
    sql_query_pre = CREATE TEMPORARY TABLE link_count AS \
                        SELECT \
                            page_id AS link_count_id, \
                            COUNT(*) AS link_cnt \
                        FROM page \
                        INNER JOIN pagelinks \
                            ON page_title = pl_title \
                            AND page_namespace = pl_namespace \
                        GROUP BY page_id
In the src_wiki_main section, CHANGE sql_query attribute to pick this data up, and also to get page_counter (aliased to page_cnt) field which stores numbers of article hits.

Note that the original line from the file has been reformatted to better fit this page.

    # main document fetch query - change the table names if you are using a prefix
    sql_query = SELECT \
                    page_id, \
                    page_title, \
                    page_namespace, \
                    page_is_redirect, \
                    old_id, \
                    old_text \
                FROM page, revision, text \
                WHERE rev_id=page_latest \
                AND old_id=rev_text_id
    # main document fetch query - change the table names if you are using a prefix
    sql_query = SELECT \
                    page_id, \
                    page_title, \
                    page_namespace, \
                    page_is_redirect, \
                    old_id, \
                    old_text, \
                    page_counter AS view_cnt, \
                    link_cnt \
                FROM page \
                INNER JOIN revision \
                    ON rev_id = page_latest \
                INNER JOIN text \
                    ON old_id = rev_text_id \
                LEFT JOIN link_count \
                    ON page_id = link_count_id
In the src_wiki_main section, ADD the two new attributes that were added to the query to the attribute columns list below the query.
    # attribute columns
    sql_attr_uint = page_namespace
    sql_attr_uint = page_is_redirect
    sql_attr_uint = old_id
    # attribute columns
    sql_attr_uint = page_namespace
    sql_attr_uint = page_is_redirect
    sql_attr_uint = old_id
    sql_attr_uint = view_cnt
    sql_attr_uint = link_cnt
In the src_wiki_incremental section, CHANGE the sql_query to match the main query (your query page_touched... part may vary, and note that we are moving rev_id = page_latest and old_id = rev_text_id from WHERE to INNER JOINs).
    # adjust this query based on the time you run the full index
    # in this case, full index runs at 7 AM UTC
    sql_query = SELECT \
                    page_id, \
                    page_title, \
                    page_namespace, \
                    page_is_redirect, \
                    old_id, \
                    old_text \
                FROM page, revision, text \
                WHERE rev_id = page_latest \
                AND old_id = rev_text_id \
                AND page_touched >= DATE_FORMAT(CURDATE(), '%Y%m%d070000')
    # adjust this query based on the time you run the full index
    # in this case, full index runs at 7 AM UTC
    sql_query = SELECT \
                    page_id, \
                    page_title, \
                    page_namespace, \
                    page_is_redirect, \
                    page_counter, \
                    old_id, \
                    old_text, \
                    page_counter AS view_cnt, \
                    link_cnt \
                FROM page \
                INNER JOIN revision \
                    ON rev_id = page_latest \
                INNER JOIN text \
                    ON old_id = rev_text_id \
                LEFT JOIN link_count \
                    ON page_id = link_count_id \
                WHERE page_touched >= DATE_FORMAT(CURDATE(), '%Y%m%d070000')

In LocalSettings.php file, make the following changes after the SphinxSearch.php inclusion:

Description Original Updated
ADD two new variables for controlling the Sphinx sort options.
$wgSearchType = 'SphinxMWSearch';
require_once( "$IP/extensions/SphinxSearch/SphinxSearch.php" );
$wgSearchType = 'SphinxMWSearch';
require_once( "$IP/extensions/SphinxSearch/SphinxSearch.php" );
$wgSphinxSearch_sortmode = SPH_SORT_EXTENDED;
$wgSphinxSearch_sortby = '@weight DESC, pl_count DESC, page_counter DESC';

And finally:

  • Reindex the wiki content
  • Restart the Sphinx search service
  • Restart the Apache service

Basic results[edit]

Verify the query results using Extension:SphinxSearch/SphinxQL. Notice that the results have changed slightly and have aligned to the ORDER BY parameters in LocalSettings.php.

mysql> SELECT @weight, * FROM wiki_main,wiki_incremental WHERE MATCH('msis') AND page_namespace=0 AND page_is_redirect=0  ORDER BY @weight desc, view_cnt desc, link_cnt desc;
+---------+-------+----------------+------------------+--------+----------+----------+----------+---------+
| @weight | id    | page_namespace | page_is_redirect | old_id | view_cnt | link_cnt | category | @weight |
+---------+-------+----------------+------------------+--------+----------+----------+----------+---------+
|    2632 | 47671 |              0 |                0 | 100135 |        3 |        0 |          |    2632 |
|    2631 | 46557 |              0 |                0 |  99021 |       60 |        2 |          |    2631 |
|    2631 | 44879 |              0 |                0 |  97343 |       20 |        0 |          |    2631 |
|    2631 | 57023 |              0 |                0 | 126664 |       12 |        0 |          |    2631 |
|    2631 | 44830 |              0 |                0 |  97294 |       12 |        0 |          |    2631 |
|    2631 | 45555 |              0 |                0 |  98019 |       11 |        1 |          |    2631 |
|    2631 | 44876 |              0 |                0 |  97340 |       11 |        0 |          |    2631 |
|    2631 | 45487 |              0 |                0 |  97951 |       11 |        0 |          |    2631 |
|    2631 | 45564 |              0 |                0 |  98028 |       10 |        1 |          |    2631 |
|    2631 | 44867 |              0 |                0 |  97331 |        9 |        0 |          |    2631 |
|    2631 | 44831 |              0 |                0 |  97295 |        8 |        0 |          |    2631 |
|    2631 | 45114 |              0 |                0 |  97578 |        8 |        0 |          |    2631 |
|    2631 | 44877 |              0 |                0 |  97341 |        7 |        0 |          |    2631 |
|    2631 | 46532 |              0 |                0 |  98996 |        6 |        2 |          |    2631 |
|    2631 | 47438 |              0 |                0 |  99902 |        6 |        2 |          |    2631 |
|    2631 | 44875 |              0 |                0 |  97339 |        6 |        0 |          |    2631 |
|    2631 | 44878 |              0 |                0 |  97342 |        5 |        0 |          |    2631 |
|    2631 | 44881 |              0 |                0 |  97345 |        5 |        0 |          |    2631 |
|    2631 | 45108 |              0 |                0 |  97572 |        4 |        0 |          |    2631 |
|    2631 | 45155 |              0 |                0 |  97619 |        4 |        0 |          |    2631 |
+---------+-------+----------------+------------------+--------+----------+----------+----------+---------+
20 rows in set (0.01 sec)

Advanced example[edit]

The basic example works fairly well and provides some additional sorting options but it still relies heavily on the Sphinx @weight. For some wikis, @weight might not be representative of the true value of the page. Changing the order of the fields doesn't work well either. For instance, the examples on this page are from a wiki that had a large number of technical documents imported from an external system. The results that were listed at the top of the default and basic searches are actually of very little use to the end user. A better formula for calculating a page rank might be:

weight
+ # of views (rewards popularity of the page)
+ # of inbound links * multiplier (reward for being linked to)
+ # of edits (rewards effort put into the page)
+ # of categories * multiplier (reward for being categorized)
- # of days since last update (penalty for pages that haven't been edited in a while)

In sphinx.conf file, make the following changes:

Description Original Updated
In the src_wiki_main section, ADD additional queries to create temporary tables to calculate link count, edit count and category count.
    # pre-query, executed before the main fetch query
    sql_query_pre = SET NAMES utf8
    # pre-query, executed before the main fetch query
    sql_query_pre   = SET NAMES utf8
    
    # temporary table for storing link counts
    sql_query_pre = CREATE TEMPORARY TABLE link_count AS \
                        SELECT \
                            page_id AS link_count_id, \
                            COUNT(*) AS link_cnt \
                        FROM page \
                        INNER JOIN pagelinks \
                            ON page_title = pl_title \
                            AND page_namespace = pl_namespace \
                        GROUP BY page_id
    
    # temporary table for storing edit counts
    sql_query_pre = CREATE TEMPORARY TABLE revision_count AS \
                        SELECT \
                            rev_page AS revision_count_id, \
                            COUNT(*) AS edit_cnt \
                        FROM page \
                        INNER JOIN revision \
                            ON page_id = rev_page \
                        GROUP BY page_id
    
    # temporary table for storing category counts
    sql_query_pre = CREATE TEMPORARY TABLE category_count AS \
                        SELECT \
                            cl_from AS category_count_id, \
                            COUNT(*) AS category_cnt \
                        FROM page \
                        INNER JOIN categorylinks \
                            ON page_id = cl_from \
                        GROUP BY page_id
In the src_wiki_main section, CHANGE sql_query attribute to pick this data up, and also to get page_counter (aliased to page_cnt) field which stores numbers of article hits.

Note that the original line from the file has been reformatted to better fit this page.

    # main document fetch query - change the table names if you are using a prefix
    sql_query = SELECT \
                    page_id, \
                    page_title, \
                    page_namespace, \
                    page_is_redirect, \
                    old_id, \
                    old_text \
                FROM page, revision, text \
                WHERE rev_id=page_latest \
                AND old_id=rev_text_id
    # main document fetch query - change the table names if you are using a prefix
    sql_query = SELECT \
                    page_id, \
                    page_title, \
                    page_namespace, \
                    page_is_redirect, \
                    old_id, \
                    old_text, \
                    page_counter AS view_cnt, \
                    link_cnt, \
                    edit_cnt, \
                    DATEDIFF( CURDATE(), STR_TO_DATE ( rev_timestamp, '%Y%m%d%H%i%s') ) + 1 AS edit_age,  \
                    category_cnt \
                FROM page \
                INNER JOIN revision \
                    ON rev_id = page_latest \
                INNER JOIN text \
                    ON old_id = rev_text_id \
                LEFT JOIN link_count \
                    ON page_id = link_count_id \
                LEFT JOIN revision_count \
                    ON page_id = revision_count_id \
                LEFT JOIN category_count \
                    ON page_id = category_count_id
In the src_wiki_main section, ADD the two new attributes that were added to the query to the attribute columns list below the query.
    # attribute columns
    sql_attr_uint = page_namespace
    sql_attr_uint = page_is_redirect
    sql_attr_uint = old_id
    # attribute columns
    sql_attr_uint   = page_namespace
    sql_attr_uint   = page_is_redirect
    sql_attr_uint   = old_id
    sql_attr_uint   = view_cnt
    sql_attr_uint   = link_cnt
    sql_attr_uint   = edit_cnt
    sql_attr_uint   = edit_age
    sql_attr_uint   = category_cnt
In the src_wiki_incremental section, CHANGE the sql_query to match the main query (your query page_touched... part may vary, and note that we are moving rev_id = page_latest and old_id = rev_text_id from WHERE to INNER JOINs).
    # adjust this query based on the time you run the full index
    # in this case, full index runs at 7 AM UTC
    sql_query = SELECT \
                    page_id, \
                    page_title, \
                    page_namespace, \
                    page_is_redirect, \
                    old_id, \
                    old_text \
                FROM page, revision, text \
                WHERE rev_id = page_latest \
                AND old_id = rev_text_id \
                AND page_touched >= DATE_FORMAT(CURDATE(), '%Y%m%d070000')
    # adjust this query based on the time you run the full index
    # in this case, full index runs at 7 AM UTC
    sql_query = SELECT \
                    page_id, \
                    page_title, \
                    page_namespace, \
                    page_is_redirect, \
                    old_id, \
                    old_text, \
                    page_counter AS view_cnt, \
                    link_cnt, \
                    edit_cnt, \
                    DATEDIFF( CURDATE(), STR_TO_DATE ( rev_timestamp, '%Y%m%d%H%i%s') ) + 1 AS edit_age,  \
                    category_cnt \
                FROM page \
                INNER JOIN revision \
                    ON rev_id = page_latest \
                INNER JOIN text \
                    ON old_id = rev_text_id \
                LEFT JOIN link_count \
                    ON page_id = link_count_id \
                LEFT JOIN revision_count \
                    ON page_id = revision_count_id \
                LEFT JOIN category_count \
                    ON page_id = category_count_id \
                WHERE page_touched >= DATE_FORMAT(CURDATE(), '%Y%m%d070000')

In LocalSettings.php file, make the following changes after the SphinxSearch.php inclusion:

Description Original Updated
ADD two new variables for controlling the Sphinx sort options. $wgSphinxSearch_sortby defines the algorithm for calculating page rank. Given that @weight was coming in around ~2600, a multiplier was given to bump up certain parameters. view_cnt and edit_age are actually bumped down to decrease the dramatic affect of top level pages that had +10K views and also pages that hadn't been updated in 6-7 years. Feel free to modify the formula as needed.
$wgSearchType = 'SphinxMWSearch';
require_once( "$IP/extensions/SphinxSearch/SphinxSearch.php" );
$wgSearchType = 'SphinxMWSearch';
require_once( "$IP/extensions/SphinxSearch/SphinxSearch.php" );
$wgSphinxSearch_sortmode = SPH_SORT_EXPR;
$wgSphinxSearch_sortby = '@weight + ((view_cnt + 1) / 10) + (link_cnt * 10) + edit_cnt + (category_cnt * 10) - (edit_age / 10)';

And finally:

  • Reindex the wiki content
  • Restart the Sphinx search service
  • Restart the Apache service

Advanced results[edit]

Verify the query results using Extension:SphinxSearch/SphinxQL. While it is hard to tell by just looking at the page_ids in the results, the new search results are filtered out high scoring but older irrelevant pages, and prioritized slightly lower scoring pages that are actively being updated and maintained by the users.

mysql> SELECT @weight AS page_weight, page_weight + ((view_cnt + 1) / 10) + (link_cnt * 10) + edit_cnt + (category_cnt * 10) - (edit_age / 10) AS page_rank, * FROM wiki_main,wiki_incremental WHERE MATCH('msis') AND page_namespace=0 AND page_is_redirect=0 ORDER BY page_rank DESC;
+-------------+-------------+-------+----------------+------------------+--------+----------+----------+----------+----------+--------------+-------------------------+
| page_weight | page_rank   | id    | page_namespace | page_is_redirect | old_id | view_cnt | link_cnt | edit_cnt | edit_age | category_cnt | category                |
+-------------+-------------+-------+----------------+------------------+--------+----------+----------+----------+----------+--------------+-------------------------+
|        2627 | 2847.100098 | 57881 |              0 |                0 | 110326 |      914 |        3 |      106 |      474 |            4 | 60885,62470,62528,62686 |
|        2602 | 2687.800049 | 44838 |              0 |                0 | 126927 |      409 |        2 |       32 |      372 |            3 | 62470,62515,62817       |
|        2624 | 2678.800049 | 62468 |              0 |                0 | 125750 |      204 |        1 |       32 |      377 |            3 | 60889,62470,62686       |
|        2615 | 2671.000000 | 57617 |              0 |                0 | 126389 |      235 |        5 |       20 |      376 |            0 |                         |
|        2619 | 2669.500000 | 57624 |              0 |                0 | 126391 |      100 |        7 |        8 |      376 |            0 |                         |
|        2624 | 2634.399902 | 57623 |              0 |                0 | 126390 |       99 |        2 |        8 |      376 |            1 | 62470                   |
|        2624 | 2625.500000 | 44856 |              0 |                0 | 126319 |       51 |        2 |        4 |      377 |            1 | 62470                   |
|        2622 | 2616.399902 | 45359 |              0 |                0 |  97823 |      129 |        2 |        7 |      556 |            1 | 62470                   |
|        2627 | 2612.000000 | 45366 |              0 |                0 | 126320 |       46 |        1 |        8 |      377 |            0 |                         |
|        2617 | 2605.600098 | 62472 |              0 |                0 | 128757 |       15 |        1 |        3 |      360 |            1 | 62470                   |
|        2619 | 2604.699951 | 45374 |              0 |                0 | 126321 |       53 |        1 |        8 |      377 |            0 |                         |
|        2631 | 2600.699951 | 57023 |              0 |                0 | 126664 |       12 |        0 |        6 |      376 |            0 |                         |
|        2613 | 2595.500000 | 45403 |              0 |                0 |  97867 |       30 |        2 |        3 |      536 |            1 | 62543                   |
|        2594 | 2592.200195 | 44948 |              0 |                0 |  97412 |       55 |        3 |        4 |      514 |            1 | 62562                   |
|        2594 | 2586.399902 | 46248 |              0 |                0 |  98712 |      229 |        0 |       15 |      556 |            1 | 62372                   |
|        2613 | 2585.199951 | 56360 |              0 |                0 | 108823 |       17 |        1 |        4 |      536 |            1 | 62543                   |
|        2613 | 2576.399902 | 56361 |              0 |                0 | 108824 |       29 |        0 |        4 |      536 |            1 | 62543                   |
|        2582 | 2573.100098 | 61031 |              0 |                0 | 113240 |      159 |        0 |       14 |      489 |            1 | 62372                   |
|        2582 | 2565.600098 | 45649 |              0 |                0 |  98113 |        5 |        3 |        2 |      490 |            0 |                         |
|        2582 | 2563.399902 | 58565 |              0 |                0 | 110979 |        8 |        2 |        2 |      515 |            1 | 62652                   |
+-------------+-------------+-------+----------------+------------------+--------+----------+----------+----------+----------+--------------+-------------------------+
20 rows in set (0.01 sec)

Additional parameters[edit]

Column weights[edit]

The weights assigned to individual columns in SphinxSearch.php may need to be tweaked to account for the content in the wiki. The defaults are:

# Weights of individual indexed columns. This gives page titles extra weight
$wgSphinxSearch_weights = array(
	'old_text' => 1,
	'page_title' => 100
);

To update the weights, update LocalSettings.php with the variables and modify the values:

# Weights of individual indexed columns. This gives page titles extra weight
$wgSphinxSearch_weights = array(
	'old_text' => 1,
	'page_title' => 10
);

Use the following query to replicate the column weights from the settings file:

SELECT @weight AS page_weight, page_weight + ((view_cnt + 1) / 10) + (link_cnt * 10) + edit_cnt + (category_cnt * 10) - (edit_age / 10) AS page_rank, * 
FROM wiki_main,wiki_incremental 
WHERE MATCH('msis') 
AND page_namespace = 0 
AND page_is_redirect = 0 
ORDER BY page_rank DESC 
OPTION field_weights = (page_title = 10, old_text = 1);

Indexing performance[edit]

Adding new columns and JOINs will definitely impact indexing performance and care must be taken to ensure that the new parameters do not have a negative impact. Listed below are the indexing times for a wiki with ~57,000 articles:

  • Default: ~2 minutes
  • Basic example (1 join): ~14 minutes
  • Advanced example (3 joins): ~43 minutes

See also[edit]