User:Jeblad/Missing link detection

Missing link detection is a bit of a pesky problem as it creates huge amounts of data for whats seems like a minor increase in usability. Given that it is a useful tool it is although interesting to see if it is solvable.

The only thing we want is a limited list of pages that the readers seems to use a lot before they visits a specific page. We want to identify the pages as there might be some hidden dependency between the the two pages. The problem is that the number of possible candidates can be very large and nearly unbound.

First we make a short history with K entries in the browser of previous visits to the wiki project. The entries in this list are weighted key-value pairs, more recent entries have a bigger weight than older ones. How long the list should be is uncertain, and also how the weighting should be done.

As the user visits new pages the history is transfered to a logging server. Either the history is filtered in the browser for ordinary inbound links, or the filtering is done at the logging server. There each key is hashed into a small set of M bins and the weights accumulated for those bins. All bins not hit will be decremented by a fraction that accumulates to the same amount.

What we now have is a hash table that identifies the hash keys for those often visited previous pages, because those bins have associated accumulators that grows very large compared to the other. We do not know the names of the pages except that they fold into bins so and so.

If a history entry is attempted to be inserted into a bin that is of the topmost N, then the entry is either pushed on a LRU -chain or the entry is reordered in the chain. This chain is specific for each destination article, and the identificators used on the chain is the article numbers.

The topmost entries on the LRU -chain is the hottest articles that hasn't already an ordinary inbound link to the destination. Articles on the tail of the LRU' -chain is more uncertain, and the overall length somehow describes how many of the entries has an usable confidence.

The entries in the plain LRU -chain can be changed to keep an additional weight. This can be the same weight as those made available from the history lists.

It is possible to drop the filtering of already existing inbound links. If so the LRU -chain will identify the pages where most of the traffic emerges. In this case any pages that isn't linked should be marked accordingly. Note that the numeric references must be checked for redirects.

After the missing links has been detected they can be inserted either manually by inspecting a special page that lists the links, the special page related pages, or by using the parser function related pages. The function will list the most probable links but can be overridden. The special page can be replaced by a dynamic gadget.

Generation of the lists for the source pages can be generated either as batch jobs or continuous. Probably it is better to run it as a batch job each day, possibly with an additional limit on how much statistics there must be before the lists are generated.

Because the lists are finite in size, and and the number of pages are also finite, it is possible to store the data in ordinary databases. Because of the involved inversion step, possible implemented as a MapReduce step, it is likely that the algorithm is best implemented with a NoSQL database like MongoDB.