Extension talk:SphinxSearch/Page rank

Combine pageweight + viewcount + pagelinks?
I am experimenting with SphinxSearch on an internal wiki used for technical documentation. I have it up and running successfully and have made the modifications documented on this SphinxSearch/Page rank page. I have also experimented successfully by changing the order of the $wgSphinxSearch_sortby parameters. However, no matter which order I try, I can't quite get results that I am completely comfortable with. While one combination works well for one search, it might not work well for another. What might work nicely is being able to combine pageweight + viewcount + pagelink (maybe with a multiplier) into one value that is used for sorting.

Pageweight is obviously the most valuable metric gained from using Sphinx but sometimes sorting on it first doesn't always work out the way I want. But sorting only by viewcount or pagelinks first tends to be pretty harsh and negates the value of having pageweight. In my mind, I could see where combining all three values might come up with an interesting score.

For example, here are the results for the same search criteria using different combinations for $wgSphinxSearch_sortby Top 5 By Weight (Weight, Views, Links) PageTitle       : Weight + Views + (Links * 20) = New Weight 1. 4684-BKUP7000   : 2632 +  4 + (2 * 10) = 2656 2. ProcessModel    : 2632 +  1 + (0 * 10) = 2633 3. 205-PAP         : 2631 + 59 + (2 * 10) = 2710 4. 1174-PAP        : 2631 + 20 + (0 * 10) = 2651 5. AcuteCareIPandLT : 2631 + 12 + (0 * 10) = 2643 The page that we really want (MSIS) doesn't even show up in the top 5 (or the top 20 because so many pages scored a 2631 for weight).

Top 5 By Views (Views, Links, Weight) PageTitle  : Weight + Views + (Links * 20) = New Weight 1. Encounters : 1607 + 2109 + (9 * 10)  = 3806 2. Needy      : 1582 + 1261 + (2 * 10)  = 2863 3. CaseTrakker : 1594 + 1017 + (5 * 10) = 2661 4. MSIS       : 2627 +  909 + (2 * 10)  = 3556 (This is the target page) 5. Portal     : 1560 +  880 + (13 * 10) = 2570 The target page (MSIS) cracks the top 5 but only because it had the right number of views.

Top 5 By PageLinks (Links, Weight, Views) PageTitle : Weight + Views + (Links * 20) = New Weight 1. 1228-EMS  : 1560 + 0 + (27 * 10) = 1830 2. 1232-R2000 : 1560 + 0 + (27 * 10) = 1830 3. 1232-R2001 : 2582 + 5 + (26 * 10) = 2847 4. 1234-R3000 : 1560 + 3 + (26 * 10) = 1823 5. 1234-R3001 : 1560 + 1 + (26 * 10) = 1821 Again, target page (MSIS) is nowhere to be found because these pages have diagrams and lots of links but they are really low on the value scale (hence the low number of views).

Top 5 By New Weight 1. Encounters : 1607 + 2109 + (9 * 10)  = 3806 2. MSIS       : 2627 +  909 + (2 * 10)  = 3556 (This is the target page) 3. Needy      : 1582 + 1261 + (2 * 10)  = 2863 4. 1232-R2001 : 2582 +    5 + (26 * 10) = 2847 5. 205-PAP    : 2631 +   59 + (2 * 10) = 2710 The New Weight brought the target page (MSIS) up to the highest position (#2) and helped keep other noisy pages lower in the rankings. That little bump of combining the weight + views helped push the page up past all of the noisy pages that had slight better weight (+4 points) but less views.

Combining the three is a little subjective and could obviously get out of scale pretty easily depending on a number of factors, but I would like to try it out. How do I go about making modifications to how the extension is configured to try to come up with a score that combines all three values (weight + views + (links * multiplier))?

Any insight is appreciated.

Thanks, Matt

Found it!
It was in the Sphinx doco for the sorting modes:

SPH_SORT_EXPR mode

Expression sorting mode lets you sort the matches by an arbitrary arithmetic expression, involving attribute values, internal attributes (@id and @weight), arithmetic operations, and a number of built-in functions. Here's an example:

$cl->SetSortMode ( SPH_SORT_EXPR,   "@weight + ( user_karma + ln(pageviews) )*0.1" );

So my LocalSettings.php parameters are: $wgSphinxSearch_sortmode = SPH_SORT_EXPR; $wgSphinxSearch_sortby = '@weight + page_counter + (pl_count * 10)';

My search results are coming back better than ever now. Still experimenting but I like what I see.

Thanks, Matt

Spoke too soon
I am able to get the results running from the commandline and modifying --sortexpr "@weight + page_counter + (pl_count * 10)" but I am not getting the same results through the wiki extension. Digging around, it looks like the extension is not configured to use the SPH_SORT_EXPR mode.

Ideal page rank
I think my ideal page rank formula using SPH_SORT_EXPR would be the following:

weight + views (rewards popularity of the page) + # of edits (rewards effort put into the page) + inbound links * multiplier (reward for being linked to) + outbound links * multiplier (reward for linking to other pages) + # of categories * multiplier (reward for being categorized) - # of days since last update (penalty for pages that haven't been edited in a while) - 500 for To Be Deleted or Archived categories (deduction for known pages that are old or no longer needed)

Based on the existing example, we already have weight, views, inbound links and categories. Outbound links is just another temp table on a different field. # of edits is probably another temp table. I think the only one that scares me is trying to get # of days since last update since it requires pulling the MediaWiki custom timestamp and converting it into a value that can be used in calculations. Might be easiest to grab the year and just do years since last update * multiplier.

Assuming that things play out the way I think they should, getting this formula in place would help significantly. In our case, we have a wiki that was stood up in 2005 and 50,000-60,000 pages were initially imported from a previous repository. Unfortunately, the majority of that content is stale, out of date, not useful, etc. but it overwhelms the search currently with all the noise and false hits. With this approach, we can reward the active and popular pages and try to significantly lower the score on those older pages. Going for that many factors is probably overkill, but at the same time, it hits all of the key points.

I need to take a break from this for a bit, but I should be back on it in early April 2013.

See you soon!

--Mattsmith321 (talk) 01:13, 28 March 2013 (UTC)