Extension:SphinxSearch/Page rank

From MediaWiki.org
Jump to: navigation, search

This is to document the work on better sorting for SphinxSearch results. Current default is to sort by sphinx internal weight, which is calculated based on number of matches within the text, whether they are in the title or in the body, etc. This is how it would be possible to also sort by number of incoming links to the article and article popularity.

In sphinx.conf file, source src_wiki_main section, after this line:

sql_query_pre = SET NAMES utf8

ADD another query that creates a temporary table with incoming link counts:

sql_query_pre = CREATE TEMPORARY TABLE pagelink_count AS \
        SELECT page_id AS pl_id, COUNT(*) AS pl_count FROM page \
        INNER JOIN pagelinks ON page_title=pl_title AND page_namespace=pl_namespace GROUP BY page_id

CHANGE sql_query attribute to pick this data up, and also to get page_counter field which stores numbers of article hits:

sql_query = SELECT page_id, page_title, page_namespace, page_is_redirect, page_counter, old_id, old_text, pl_count \
           FROM page INNER JOIN revision ON rev_id=page_latest INNER JOIN text ON old_id=rev_text_id \
           LEFT JOIN pagelink_count ON page_id=pl_id

ADD two new attributes to the list below the query:

sql_attr_uint   = page_counter
sql_attr_uint   = pl_count

CHANGE sql_query in source src_wiki_incremental section to match the main query (your query page_touched... part may vary, and note that we are moving rev_id=page_latest and old_id=rev_text_id from WHERE to INNER JOINs)

sql_query = SELECT page_id, page_title, page_namespace, page_is_redirect, page_counter, old_id, old_text, pl_count \
           FROM page INNER JOIN revision ON rev_id=page_latest INNER JOIN text ON old_id=rev_text_id \
           LEFT JOIN pagelink_count ON page_id=pl_id WHERE page_touched>=DATE_FORMAT(CURDATE(), '%Y%m%d070000')

Reindex your wikis and add this to LocalSettings.php after the SphinxSearch.php inclusion:

$wgSphinxSearch_sortmode = SPH_SORT_EXTENDED;
$wgSphinxSearch_sortby = '@weight DESC, pl_count DESC, page_counter DESC';

This will sort results first by sphinx weight, but after that by number of links TO that article and by number of article views. Feel free to experiment with the order of these arguments, or to add additional ones.

Personal tools
Namespaces

Variants
Actions
Navigation
Support
Download
Development
Communication
Print/export
Toolbox