User:Jeblad/Detect text segments with a convolving hash

Detect text segments with a convolving hash is an attempt to describe an $$ O(n) $$ algorithm for aligning text fragments. Even if the algorithm has linear order each step is pretty heavy.

The basic idea is to convolve a structure over a text. The haystack is the text and can be heavily edited. The needle is a special structure and represents the original fragment. The needle structure is created such that it works like locality-sensitive hashing, thereby accepting a lot of changes yet still being able to find the target.

During convolution the values will slowly accumulate until the whole needle is inside the candidate range. It will then stay high until the needle starts to move out of the range. If the candidate range is unmodified, then this will give a clean triangular shape. Width moderate editing the top will flatten out, and with heavy editing it might form an irregular plateau-ish landscape.

It is easier to see what's going on in the basic algorithm than the more convolved approaches.

Basic algorithm
Assume there is a text represented as a vector of chars (or codepoints) $$\mathbf{a}$$ or a haystack string, and a shorter segment $$\mathbf{b}$$ or a needle string. Also assume the wanted needle digest comes from the smaller segment and is $$\mathbf{\hat{b}}$$, and the haystack has the digest $$\mathbf{\hat{a}}$$. These digests are vectors, but can be reprocessed into binary numbers. For simplicity call the unprocessed digests for a needle vector and a haystack vector, and likewise the truncated version for a needle digest and a haystack digest. It is easiest to assume that we have some function that truncates the elements in the vector into a binary digest, thus we can forget the distinction.

Calculate the needle digest as

\begin{array}{lcl} \mathbf{foreach}~\mbox{char}~i~\mbox{in}~\mathbf{b}~\mathbf{do} \\ \quad j \leftarrow \operatorname{hash} \left ( \mathbf{b} \left [ i : i+\delta \right ] \right ) \\ \quad \hat{b}_{j} \leftarrow \hat{b}_{j} + 1 \end{array} $$

Out of this there will be a needle vector $$\mathbf{\hat{b}}$$

In the previous there is a hash function $$\operatorname{hash}(\cdot)$$ that takes a short string of chars (or codepoints) and folds it into an alternate range. Often something like Pearson hashing is used as speed is crucial, while ideal properties are unimportant, and then a precomputed lookup table can be used instead of a modulus operations. The alternate range is the length of the wanted digest, or a vector in this case.

Calculate the haystack digest for a window as

\begin{array}{lcl} \mathbf{foreach}~\mbox{char}~i~\mbox{in}~\mathbf{a} \left [ k : k+\gamma \right ] ~\mathbf{do} \\ \quad j \leftarrow \operatorname{hash} \left ( \mathbf{a} \left [ i : i+\delta \right ] \right ) \\ \quad \hat{a}_{j} \leftarrow \hat{a}_{j} + 1 \end{array} $$

As part of the previous compare $$\mathbf{\hat{a}}$$ and $$\mathbf{\hat{b}}$$ with some distance metric and sum the differences. The overall difference would be a measure of how good a match the needle is to a particular location in the haystack

\begin{array}{lcl} \quad \mathbf{foreach}~\mbox{element}~j~\mbox{in}~\mathbf{\hat{b}}~\mathbf{do} \\ \qquad z_{k} \leftarrow \operatorname{dist} \left ( \hat{a}_{j}, \hat{b}_{j} \right ) \end{array} $$

Each of the elements in $$\mathbf{z}$$ will now hold the accumulated distance for a specific offset $$k$$ into the text $$\mathbf{a}$$.

Adjust the distance metric
There are a lot of kernels for doing correlation, and as the digest should be a binary string it could simply sum elements from the haystack $$\mathbf{\hat{a}}$$ where the sign is inverted for zeros in the needle $$\mathbf{\hat{b}}$$

\begin{array}{lcl} \quad \mathbf{foreach}~\mbox{element}~j~\mbox{in}~\mathbf{\hat{b}}~\mathbf{do} \\ \qquad z_{k} \leftarrow z_{k} + \begin{cases} \hat{a}_{j}, & \mbox{if}~\operatorname{trunc} \left ( \hat{b}_{j} \right ) == 1 \\ -\hat{a}_{j}, & \mbox{if}~\operatorname{trunc} \left ( \hat{b}_{j} \right ) == 0 \end{cases} \end{array} $$

This can be calculated a lot more efficient as a summation over two masked vectors, although even more efficient solutions exists.

Avoid repeating calculations
During the convolution a lot of hashing operations are redone again and again. This creates a lot of unnecessary load. Instead the hashed values can be updated in place. To do this there is a leading action and a trailing action. Those do in place updates according to the needle digest

\begin{array}{lcl} \mathbf{foreach}~\mbox{char}~i~\mbox{in}~\mathbf{a}~\mathbf{do} \\ \quad j \leftarrow \operatorname{hash} \left ( \mathbf{a} \left [ i : i+\delta \right ] \right ) \\ \quad \hat{z}_{j} \leftarrow \hat{z}_{j} + \begin{cases} +1, & \mbox{if}~\operatorname{trunc} \left ( \hat{b}_{j} \right ) == 1 \\ -1, & \mbox{if}~\operatorname{trunc} \left ( \hat{b}_{j} \right ) == 0 \end{cases} \\ \quad j \leftarrow \operatorname{hash} \left ( \mathbf{a} \left [ i+l : i+l+\delta \right ] \right ) \\ \quad \hat{z}_{j} \leftarrow \hat{z}_{j} - \begin{cases} +1, & \mbox{if}~\operatorname{trunc} \left ( \hat{b}_{j} \right ) == 1 \\ -1, & \mbox{if}~\operatorname{trunc} \left ( \hat{b}_{j} \right ) == 0 \end{cases} \\ z_{k} \leftarrow {\sum}_{i} {\hat{z}_{i}} \end{array} $$

This assumes the $$\mathbf{\hat{b}}$$ digest is precomputed.

Out of this there will be a vector with correlation coefficients for each code point in the text. For this to be a proper correlation coefficient it should be normalized, but for this kind of use the algorithm can be speed up by dropping this.

Realign sequences
Even if the approximate starting and ending spots for the needle can be found, it is still only approximations. What we do know is that the convolution will increase as soon as the needle starts to enter the target range. This means the increase can be detected and after a short ramping up sequence the starting point can be detected. Likewise, the point where the decrease ends can be detected and this is the ending point.

To properly detect the increase and decrease the overall noise level must be detected. Ideally the noise level outside of the target fragment is -1, and the ideal signal level inside the target fragment is equal to the length of the needle. Thus, we can expect a pretty high signal-to-noise ratio for sequences of some length. A simple method to find the most likely target is to find the maximum correlation coefficient, and then scan forward for a distance up to the length of the needle, or perhaps even twice that. Find the slope at a level about one half of the maximum value, and estimate where it should pass zero. Find the word at that text location, and find the start of that particular word. That should be a good estimate for the beginning of the needle in the haystack. Repeat for a backward scan, again finding the word, but this time find the end of the word.

Variation
A variation is to use a hash over N words instead of a hash over N characters. This turns it into a kind of Blue metric instead of correlation kernels from spread spectrum communications. The difference is that the short strings are hashed over bins, thus it creates a digest-like structure. A standard Blue metric will only build a single metric, but here a metric is built for each bin, although it will not be normalized.