User:Jeblad/Detect text segments with a convolving hash

Detecting text segments with a convolving hash is an attempt to describe an algorithm for text synchronization.

It is easier to see whats going on in the simple algorithm than the more convolved approaches.

Simple version
Assume there is a text $$A$$ and a smaller segment $$B$$. Let there be an index $$i$$ into $$B$$, and an offset $$k$$ into $$A$$. Assume there is a hash function $$\operatorname{h}(\cdot)$$ that takes a char (or codepoint) and folds it into an alternate range. Let this range be the length of a wanted digest. Assume the wanted needle digest comes from the smaller segment and is $$\mathbf{X}$$, and the heystack has the digest $$\mathbf{Y}$$. These digests are vectors, but can be reprocessed into binary numbers. For simplicity call the unprocessed digests for needle vector and heystack vector, and likewise the truncated version for a digest.

Calculate the needle digest as

\mathbf{foreach}~\mbox{char}~i~\mbox{in}~B~\mathbf{do}~X_{\operatorname{h}(B_{i})} \leftarrow X_{\operatorname{h}(B_{i})} + 1 $$

Calculate the heystack digest for a window as

\mathbf{foreach}~\mbox{char}~i+k~\mbox{in}~A~\mathbf{do}~Y_{\operatorname{h}(A_{i+k})} \leftarrow Y_{\operatorname{h}(A_{i+k})} + 1 $$

Compare $$\mathbf{X}$$ and $$\mathbf{Y}$$ with some distance metric and sum the distances

\mathbf{foreach}~\mbox{element}~j~\mbox{in}~\mathbf{X}~\mathbf{do}~z_{k} \leftarrow \operatorname{d} \left ( X_{j} - Y_{j} \right ) $$

Each of the elements in $$\mathbf{z}$$ will now hold the accumulated distance for a specific offset $$k$$ into the text $$A$$.