Extension talk:SphinxSearch/Page rank

Combine pageweight + viewcount + pagelinks?
I am experimenting with SphinxSearch on an internal wiki used for technical documentation. I have it up and running successfully and have made the modifications documented on this SphinxSearch/Page rank page. I have also experimented successfully by changing the order of the $wgSphinxSearch_sortby parameters. However, no matter which order I try, I can't quite get results that I am completely comfortable with. While one combination works well for one search, it might not work well for another. What might work nicely is being able to combine pageweight + viewcount + pagelink (maybe with a multiplier) into one value that is used for sorting.

Pageweight is obviously the most valuable metric gained from using Sphinx but sometimes sorting on it first doesn't always work out the way I want. But sorting only by viewcount or pagelinks first tends to be pretty harsh and negates the value of having pageweight. In my mind, I could see where combining all three values might come up with an interesting score.

For example, here are the results for the same search criteria using different combinations for $wgSphinxSearch_sortby Top 5 By Weight (Weight, Views, Links) PageTitle       : Weight + Views + (Links * 20) = New Weight 1. 4684-BKUP7000   : 2632 +  4 + (2 * 10) = 2656 2. ProcessModel    : 2632 +  1 + (0 * 10) = 2633 3. 205-PAP         : 2631 + 59 + (2 * 10) = 2710 4. 1174-PAP        : 2631 + 20 + (0 * 10) = 2651 5. AcuteCareIPandLT : 2631 + 12 + (0 * 10) = 2643 The page that we really want (MSIS) doesn't even show up in the top 5 (or the top 20 because so many pages scored a 2631 for weight).

Top 5 By Views (Views, Links, Weight) PageTitle  : Weight + Views + (Links * 20) = New Weight 1. Encounters : 1607 + 2109 + (9 * 10)  = 3806 2. Needy      : 1582 + 1261 + (2 * 10)  = 2863 3. CaseTrakker : 1594 + 1017 + (5 * 10) = 2661 4. MSIS       : 2627 +  909 + (2 * 10)  = 3556 (This is the target page) 5. Portal     : 1560 +  880 + (13 * 10) = 2570 The target page (MSIS) cracks the top 5 but only because it had the right number of views.

Top 5 By PageLinks (Links, Weight, Views) PageTitle : Weight + Views + (Links * 20) = New Weight 1. 1228-EMS  : 1560 + 0 + (27 * 10) = 1830 2. 1232-R2000 : 1560 + 0 + (27 * 10) = 1830 3. 1232-R2001 : 2582 + 5 + (26 * 10) = 2847 4. 1234-R3000 : 1560 + 3 + (26 * 10) = 1823 5. 1234-R3001 : 1560 + 1 + (26 * 10) = 1821 Again, target page (MSIS) is nowhere to be found because these pages have diagrams and lots of links but they are really low on the value scale (hence the low number of views).

Top 5 By New Weight 1. Encounters : 1607 + 2109 + (9 * 10)  = 3806 2. MSIS       : 2627 +  909 + (2 * 10)  = 3556 (This is the target page) 3. Needy      : 1582 + 1261 + (2 * 10)  = 2863 4. 1232-R2001 : 2582 +    5 + (26 * 10) = 2847 5. 205-PAP    : 2631 +   59 + (2 * 10) = 2710 The New Weight brought the target page (MSIS) up to the highest position (#2) and helped keep other noisy pages lower in the rankings. That little bump of combining the weight + views helped push the page up past all of the noisy pages that had slight better weight (+4 points) but less views.

Combining the three is a little subjective and could obviously get out of scale pretty easily depending on a number of factors, but I would like to try it out. How do I go about making modifications to how the extension is configured to try to come up with a score that combines all three values (weight + views + (links * multiplier))?

Any insight is appreciated.

Thanks, Matt

Found it!
It was in the Sphinx doco for the sorting modes:

SPH_SORT_EXPR mode

Expression sorting mode lets you sort the matches by an arbitrary arithmetic expression, involving attribute values, internal attributes (@id and @weight), arithmetic operations, and a number of built-in functions. Here's an example:

$cl->SetSortMode ( SPH_SORT_EXPR,   "@weight + ( user_karma + ln(pageviews) )*0.1" );

So my LocalSettings.php parameters are: $wgSphinxSearch_sortmode = SPH_SORT_EXPR; $wgSphinxSearch_sortby = '@weight + page_counter + (pl_count * 10)';

My search results are coming back better than ever now. Still experimenting but I like what I see.

Thanks, Matt