User:Jeblad/Missing link detection

From mediawiki.org

Missing link detection is a bit of a pesky problem as it creates huge amounts of data for whats seems like a minor increase in usability. Given that it is a useful tool it is although interesting to see if it is solvable.

What we want is a limited list of pages that the readers seems to use a lot before they visits a specific page. We want to identify the pages as there might be some hidden dependency between the the two pages. The problem is that the number of possible candidates can be very large and nearly unbound.

In the previous history leading up to a page there is an other page , and this page has a given weight depending on how far into the history it is. The further it is into the history the lesser weight, and the closer to present time the bigger the weight.

The full log to produce the missing link table (or the related link table – whatever it is called) will be extremely large. To limit the size we introduce some small tricks.

Algorithm[edit]

History[edit]

First we make a short history with K entries in the browser of previous visits to the wiki project. The entries in this list are weighted key-value pairs, more recent entries have a bigger weight than older ones. How long the list should be is uncertain, and also how the weighting should be done, but a good guess is more than a handfull and tapered from a set weight until approaching nearly zero at .

Collection for history
This is an example structure designed for LocalStorage in the browser.
{
    chain : [ /* typically 8-16 chain items (this is typically a FIFO-chain) */
        { /* chain item, one of several */
            id : /* wgArticleID for previous visited items */,
            ns : /* namespace id for each item (used for filtering out unwanted items) */
        }
    ]
}

Only a subset of the items are used, those not used are excluded in the transfer or marked as unused. Typically all entries from special pages will be removed.

Logging[edit]

As the user visits new pages the history is transfered to a logging server. Either the history is filtered in the browser for ordinary inbound links, or the filtering is done at the logging server. There each weight is hashed into a small set of M bins and the weights accumulated for those bins. All bins not hit will be decremented by a fraction .

What we now have is a hash table that identifies the hash keys for those often visited previous pages, because those bins have associated accumulators some of them grows very large compared to the other. We do not know the names of the pages except that they fold into bins so and so.

If a history entry is attempted to be inserted into a bin that is of the topmost N, then the entry is either pushed on a LRU -chain or the entry is reordered in the chain. This chain is specific for each destination article, and the identificators used on the chain is the article numbers.

Collection for logging
This is an example structure designed for MongoDB at the server.
{
    _id : { /* wgArticleID for the article analyzed */,
        bins : [ /* typically 256 hash buckets with a single float */,
        chain : [ /* typically 128 wgArticleID for previous visited items */ ]
    }
}

The topmost entries on the LRU -chain, that is on the head of the list, is the most used articles in the history. Articles on the tail of the LRU -chain is more uncertain. The overall length somehow describes how many of the entries has an usable confidence. The chain must be long enough that often referenced articles remain on the chain. When analyzing the articles on the chain they can be assigned to each of the bins, thereby assigning the weights from the bins to each of the articles.

This will be unbound and not something that can be normalized.

The entries in the plain LRU -chain can be changed to keep an additional count or weight. This can be the same weight as those made available from the history lists. Those counts or weights are later used for sorting the entries, or recalculated to get relevance ranking. Note that the weight must be adjusted according to an expected value given by an overall count.

This can be normalized over the items in the chain.

Relations[edit]

After the missing links has been detected they can be inserted either manually by inspecting a special page that lists the links, the special page related pages, or by using the parser function related pages. The function will list the most probable links but can be overridden. The special page can be replaced by a dynamic gadget.

Collection for relations
This is an example structure designed for MongoDB at the server.
{
    _id: { /* the inverted id */
        id: /* */,
        count: /* */
    }
}

Generation of the lists for the source pages can be generated either as batch jobs or continuous. Probably it is better to run it as a batch job each day, possibly with an additional limit on how much statistics there must be before the lists are generated.

Because the lists are finite in size, and and the number of pages are also finite, it is possible to store the data in ordinary databases. Because of the involved inversion step, possible implemented as a MapReduce step, it is likely that the algorithm is best implemented with a NoSQL database like MongoDB.