User:Jeblad/Detect text segments with a convolving hash

Detecting text segments with a convolving hash is an attempt to describe an algorithm for text synchronization.

[Should add something about the hashing at each index leading to invariance on editing.]

It is easier to see whats going on in the basic algorithm than the more convolved approaches.

Basic algorithm
Assume there is a text $$A$$ and a smaller segment $$B$$. Let there be an index $$i$$ into $$B$$, and an offset $$k$$ into $$A$$. Assume the wanted needle digest comes from the smaller segment and is $$\mathbf{X}$$, and the heystack has the digest $$\mathbf{Y}$$. These digests are vectors, but can be reprocessed into binary numbers. For simplicity call the unprocessed digests for needle vector and heystack vector, and likewise the truncated version for a digest.

Calculate the needle digest as

\mathbf{foreach}~\mbox{char}~i~\mbox{in}~B~\mathbf{do}~X_{\operatorname{h}(B_{i})} \leftarrow X_{\operatorname{h}(B_{i})} + 1 $$

In this there is a hash function $$\operatorname{h}(\cdot)$$ that takes a short string of chars (or codepoints) and folds it into an alternate range. Something like Pearson hashing is used as speed is crucial, and a precomputed lookup table can be used instead of a modulus operation. The alternate range is the length of the wanted digest, or a vector in this case.

Calculate the heystack digest for a window as

\mathbf{foreach}~\mbox{char}~i+k~\mbox{in}~A~\mathbf{do}~Y_{\operatorname{h}(A_{i+k})} \leftarrow Y_{\operatorname{h}(A_{i+k})} + 1 $$

Compare $$\mathbf{X}$$ and $$\mathbf{Y}$$ with some distance metric and sum the distances

\mathbf{foreach}~\mbox{element}~j~\mbox{in}~\mathbf{X}~\mathbf{do}~z_{k} \leftarrow \operatorname{d} \left ( X_{j} - Y_{j} \right ) $$

Each of the elements in $$\mathbf{z}$$ will now hold the accumulated distance for a specific offset $$k$$ into the text $$A$$.

Adjust the distance metric
There are a lot of kernels for doing correlation, and as the digest should be a binary string it could simply sum elements from the heystack $$\mathbf{Y}$$ where the sign is inverted for zeroes in the needle $$\mathbf{X}$$

\mathbf{foreach}~\mbox{element}~j~\mbox{in}~\mathbf{X}~\mathbf{do}~z_{k} \leftarrow z_{k} + \begin{cases} Y_{j}, & \mbox{if}~X_{j} == 1 \\ -Y_{j}, & \mbox{if}~X_{j} == 0 \end{cases} $$

This can be calculated a lot more efficient as a summation over two masked vectors, although more efficient solutions exists.

Avoid repeating calculations
During the convolution a lot of hashing operations are redone again and again. This creates a lot of unnecessary load. Instead the hashed values can be updated in place. To do this there is a leading action and a trailing action. Those do in place updates according to the needle digest
 * to be continued…