User:Jeblad/Detect text segments with a convolving hash

Detecting text segments with a convolving hash is an attempt to describe an algorithm for text synchronization.

[Should add something about the hashing at each index leading to invariance on editing.]

It is easier to see whats going on in the basic algorithm than the more convolved approaches.

Basic algorithm
Assume there is a text represented as a vector of chars (or codepoints) $$\mathbf{a}$$ or a heystack string, and a shorter segment $$\mathbf{b}$$ or a needle string. Also assume the wanted needle digest comes from the smaller segment and is $$\mathbf{\hat{b}}$$, and the heystack has the digest $$\mathbf{\hat{a}}$$. These digests are vectors, but can be reprocessed into binary numbers. For simplicity call the unprocessed digests for a needle vector and a heystack vector, and likewise the truncated version for a needle digest and a heystack digest. It is easiest to assume that we have some function that truncates the elements in the vector into a binary digest, thus we can forget the distinction.

Calculate the needle digest as

\begin{array}{lcl} \mathbf{foreach}~\mbox{char}~i~\mbox{in}~\mathbf{b}~\mathbf{do} \\ \quad j \leftarrow \operatorname{hash} \left ( \mathbf{b} \left [ i : i+\delta \right ] \right ) \\ \quad \hat{b}_{j} \leftarrow \hat{b}_{j} + 1 \end{array} $$

Out of this there will be a needle vector $$\mathbf{\hat{b}}$$

In the previous there is a hash function $$\operatorname{hash}(\cdot)$$ that takes a short string of chars (or codepoints) and folds it into an alternate range. Often something like Pearson hashing is used as speed is crucial, while ideal properties are unimportant, and then a precomputed lookup table can be used instead of a modulus operations. The alternate range is the length of the wanted digest, or a vector in this case.

Calculate the heystack digest for a window as

\begin{array}{lcl} \mathbf{foreach}~\mbox{char}~i~\mbox{in}~\mathbf{a} \left [ k : k+\gamma \right ] ~\mathbf{do} \\ \quad j \leftarrow \operatorname{hash} \left ( \mathbf{a} \left [ i : i+\delta \right ] \right ) \\ \quad \hat{a}_{j} \leftarrow \hat{a}_{j} + 1 \end{array} $$

As part of the previous compare $$\mathbf{\hat{a}}$$ and $$\mathbf{\hat{b}}$$ with some distance metric and sum the differences. The overall difference would be a measure of how good a match the needle is to a particular location in the heystack

\begin{array}{lcl} \quad \mathbf{foreach}~\mbox{element}~j~\mbox{in}~\mathbf{\hat{b}}~\mathbf{do} \\ \qquad z_{k} \leftarrow \operatorname{dist} \left ( \hat{a}_{j}, \hat{b}_{j} \right ) \end{array} $$

Each of the elements in $$\mathbf{z}$$ will now hold the accumulated distance for a specific offset $$k$$ into the text $$\mathbf{a}$$.

Adjust the distance metric
There are a lot of kernels for doing correlation, and as the digest should be a binary string it could simply sum elements from the heystack $$\mathbf{\hat{a}}$$ where the sign is inverted for zeroes in the needle $$\mathbf{\hat{b}}$$

\begin{array}{lcl} \quad \mathbf{foreach}~\mbox{element}~j~\mbox{in}~\mathbf{\hat{b}}~\mathbf{do} \\ \qquad z_{k} \leftarrow z_{k} + \begin{cases} \hat{a}_{j}, & \mbox{if}~\operatorname{trunc} \left ( \hat{b}_{j} \right ) == 1 \\ -\hat{a}_{j}, & \mbox{if}~\operatorname{trunc} \left ( \hat{b}_{j} \right ) == 0 \end{cases} \end{array} $$

This can be calculated a lot more efficient as a summation over two masked vectors, although even more efficient solutions exists.

Avoid repeating calculations
During the convolution a lot of hashing operations are redone again and again. This creates a lot of unnecessary load. Instead the hashed values can be updated in place. To do this there is a leading action and a trailing action. Those do in place updates according to the needle digest

\begin{array}{lcl} \mathbf{foreach}~\mbox{char}~i~\mbox{in}~\mathbf{a}~\mathbf{do} \\ \quad j \leftarrow \operatorname{hash} \left ( \mathbf{a} \left [ i : i+\delta \right ] \right ) \\ \quad z_{j} \leftarrow z_{j} + \begin{cases} +1, & \mbox{if}~\operatorname{trunc} \left ( \hat{b}_{j} \right ) == 1 \\ -1, & \mbox{if}~\operatorname{trunc} \left ( \hat{b}_{j} \right ) == 0 \end{cases} \\ \quad j \leftarrow \operatorname{hash} \left ( \mathbf{a} \left [ i+l : i+l+\delta \right ] \right ) \\ \quad z_{j} \leftarrow z_{j} + \begin{cases} +1, & \mbox{if}~\operatorname{trunc} \left ( \hat{b}_{j} \right ) == 0 \\ -1, & \mbox{if}~\operatorname{trunc} \left ( \hat{b}_{j} \right ) == 1 \end{cases} \end{array} $$

[Not same z]

This assumes the $$\mathbf{\hat{b}}$$ digest is precomputed.

Out of this there will be a vector with correlation coefficients for each code point in the text.