User:TJones (WMF)/Notes/HebMorph Analyzer Analysis

May 2017 — See TJones_(WMF)/Notes for other projects. See also T162741. For help with the technical jargon used in the Analysis Chain Analysis, check out the Language Analysis section of the Search Glossary.

HebMorph
HebMorph seems to be the only game in town when it comes to Hebrew analysis (see T162739), so it's the one I've been looking at.

HebMorph deals with the complexities of Hebrew (lack of vowels which leads to lots of ambiguity, occasional use of vowel diacritics for pronunciation which leads to lack of matching, lots of prefixes and suffixes which add to complexity and ambiguity, spelling inconsistencies, etc., etc.), but it also has some unexpected issues that are potential areas for concern or confusion.

To provide roughly comparable examples in English, consider bt, which—following simple rules of dropping [aeiou] (and y when it acts like a vowel)—could possibly be any of bat, bet, bit, but, bot, boat, beat, bait, beet, boot, about, bite, abet, abut, obit, byte, bute, abate, or, my favorite, ubiety. (Now you know why Hebrew roots are mostly three consonants and not two—so fewer words overlap!)

The situation with prefixes means that some prepositions, conjunctions, and determiners attach to the word that follows them, so andiron could be andiron or and iron. Similarly, conjunctions could be the rare surname Totherow, or to Therow (Therow is also a rare surname), or old boring to the row.

None of these examples are great because English doesn't really have this situation, but hopefully you get the gist.

Processing Speed
The first thing that I noticed is that HebMorph is kind of slow. I ran a 5K article sample with the default prod settings as a baseline, and it took 1m20s. With the default HebMorph config, 3m52s—so about 3x slower. This was on my laptop in vagrant, so the problem may not be relevant to production, where there is a lot more memory available and beefier processors.

It's also a somewhat unnatural task passing text to be analyzed from outside Elastic via the API using curl (the curl overhead used to dwarf all other time—I recently got a general 30x speedup by minimizing curl calls and passing a lot more text per call). It's a potential concern to keep an eye on.

I ran some re-indexing tests on RelForge, which is certainly beefier than my laptop. Re-indexing ~205K articles with the production settings took 12m27s. Re-indexing with the default Hebrew analyzer took 23m22s. The unpacked analyzer (see below) took 21m17s.

So, roughly 2x to re-index on RelForge is significant, but Hebrew Wikipedia is only at ~205K articles. Again, something to keep an eye on, but not a disaster. Moore's law will probably keep us ahead of likely growth in Hebrew wikis.

"Previous Bugs"
The Analysis Config Builder code claims we didn't use HebMorph in the past because of bugs, but there are no specifics or pointers to more info, so I'm going to ignore that for now.

If anyone knows anything about previous bugs that might still be relevant, let me know!

Multiple Analyzers
The HebMorph plugin comes with several analyzers: hebrew, hebrew_query, hebrew_query_light, hebrew_exact, and an individual lemmatizers and a couple of char filters that allow you to unpack the analyzer and customize it. Oddly, there's no difference in output between hebrew and hebrew_query (on 5K articles), despite a minor config diff between them in the source code.

Multiple Analyzed Terms
An interesting feature of Hebrew is that without vowels, and with the various affixes, words without niqqud tend can be very ambiguous. Apparently there is also some spelling variation and/or common mistakes the plugin tries to account for. As a result, there are a lot of words that generate multiple analyzed tokens—much more so than other language analyzers.

For example, Chinese, English, French, Polish, Russian, and Swedish analyzers generated only one analyzed token per word out of 100K+ tokens. Ukrainian had up to four, but only for 13 out of almost 200K tokens, and 97.6% had only 1.

On a 5K sample of Hebrew with 200K+ tokens, words generated up to 14 tokens each! The mean was closer to 2 (1.38 to 2.26, depending on the analyzer, other than hebrew_exact, which always returns one).

$
An idiosyncratic feature of HebMorph is that exact forms are analayzed with a $ at the end. This applies to both Hebrew and non-Hebrew words. The hebrew_exact analyzer returns only this $-suffixed form.

The hebrew/hebrew_query analyzers return the $-suffixed form, and can give two terms that differ only by the $.

The hebrew_light analyzer drops the $ from "Hebrew words" (which includes actual Hebrew words, but also seems to include any token that starts with a Hebrew letter). However, hebrew_light doesn't drop the $ from non-Hebrew words.

Unpacking the analyzer gives us the option to drop the final $. For Hebrew words, the $-form is deduped, but for non-Hebrew words, two identical copies are returned. For words with appropriate multiple forms (André → andre / andré), two of each are returned. When unpacking the Hebrew analyzer, I'm using the niqqud filter,  hebrew_lemmatizer filter, and the icu_normalizer filter (which also lowercases Latin text).

Example Results by Analyzer
Below are some examples of different tokens (as defined by the Hebrew tokenizer) and their analyzed forms.

$ are supposed to be at the ends of words, but difficulties with your browser, difficulties with my editor, and the phase of the moon may affect the way it displays below.

There's a very interesting token, ו"Lonely, in the table. ו is a prefix meaning and, which can be attached to the beginning of nouns. In this case, it was attached to the Paul Anka song title, "Lonely Boy". Hebrew abbreviations use special punctuation marks that look sort of like single and double quotes, so—as with every other character that looks like a single or double quote—people sometimes use single and double quote characters instead, which is why the tokenizer allowed the token with a double quote in it. The Hebrew analyzers all analyze it correctly, and the non-Hebrew analyzers re-tokenized it into two parts, splitting on the double quote.

Production vs HebMorph hebrew

 * production tokens: 2,514,279
 * HebMorph tokens: 6,121,267

That's a lot more tokens—but it seems unavoidable with the ambiguity of Hebrew, plus all those $-final tokens.


 * 1.2% of input tokens have 1 analyzed token
 * 78.8% of input tokens have 2 analyzed tokens
 * 14.6% of input tokens have 3 analyzed tokens
 * 3.8% of input tokens have 4 analyzed tokens
 * 1.1% of input tokens have 5 analyzed tokens

New Collision Stats
 * types: 193714 (86.803%) [post-analysis types]
 * tokens: 2355995 (93.705%)

So the vast majority of tokens, being Hebrew words that tend to be ambiguous, are indexed with some other word now.

A small number of splits occurred as well. All are due to bi-directional Unicode characters. The production config leaves the bidi characters as part of the token, but removes them for the analyzed form. HebMorph just ignores them from the beginning.

Tokenization is different, with words split on periods and commas (including numbers), underscores, colons, and other Unicode characters, particularly combining characters like stress marks in Cyrillic, IPA diacritics, Devanagari and Thai combining characters. There's also some Unicode normalization (e.g. ɾ → r)
 * 0.01M → 0 / 01M
 * 1.6ºC → 1 / 6ºc
 * 1.9891x10 → 1 / 9891x10
 * foo_bar → foo / bar
 * foo:bar → foo / bar
 * foo.bar → foo / bar
 * 1,200 → 1 / 200
 * N·s → N / s
 * Григо́рий → григо / рий
 * कमान → कम / न
 * א.ה → א / ה
 * mo̞ˈɾẽnɐ → mo / rena
 * ˈt͡ʃipriˈan → t / ʃipriˈan
 * เกาะช้าง → เกาะช / าง

In general, these are uncommon, not Hebrew, and should still work reasonably well with the plain field.

Of course, the Hebrew tokenizer includes a lot of tokens with an additional final $, as discussed above.

HebMorph is also smart about Hebrew affixes on non-Hebrew words, as discussed above.

I'm leaving the bulk of the changes—the actual Hebrew analysis—for later, after we compare the different Hebrew options.

HebMorph hebrew vs HebMorph hebrew_query_light

 * hebrew tokens: 6,121,267
 * hebrew_query_light tokens: 3,794,234

That's a lot fewer tokens, without most of the $-final tokens gone (though they are still there on Latin-character words).


 * 69.8% of input tokens have 1 analyzed token
 * 24.8% of input tokens have 2 analyzed tokens
 * 3.8% of input tokens have 3 analyzed tokens
 * 1.1% of input tokens have 4 analyzed tokens
 * 0.3% of input tokens have 5 analyzed tokens

Since I don't think I agree with the rationale behind the $-final terms (making exact matches possible—a task for which we have the plain field), this is great, since the number of tokens has gone down by 38.0%.

HebMorph hebrew_query_light vs Unpacked HebMorph w/ lowercase
I was hoping to get rid of all the $-final tokens, so I unpacked the analyzer, and skipped the filter that adds the $-final form. I still used the niqqud and hebrew_lemmatizer filters, and the lowercase/icu_normalizer filter. (The lowercase filter is replaced with the icu_normalizer filter if it's available. It does a bit more that's generally useful. Without it, Foo and foo would index separately, which seems silly.)
 * hebrew_query_light tokens: 3,794,234
 * unpacked/lowercase tokens: 5,063,634

Hmm. That's a lot more tokens, which I did not expect.

There was no different pre-analysis tokens, so the tokenization seems to be the same.

Analyzed token difference include:
 * $-final tokens are gone.
 * Greek final ς is converted to σ, as it darn well should be!
 * Latin accented characters are preserved
 * A few Unicode characters are converted to plain characters (dʲ → dj)
 * Some IPA is preserved
 * Raised o (º) is used as a degree sign, and gets converted to o, so 45º → 45o.

But the bulk of the differences (129,723 types, 1,250,642 tokens) are Hebrew tokens. These seem to be the exact tokens for Hebrew words, without the final $ added.

There were very few new collisions or splits, mostly splits related to the lack of ascii folding, which is built into HebMorph.

Unpacked HebMorph: lowercase vs lowercase/folding
Both have the same number of tokens (5,063,634) and all changes are the expected collisions from ICU folding, with no effect on Hebrew words.

I think folding foreign languages is usually good, so we should keep the folding enabled.

Unpacked HebMorph: lowercase/folding vs lowercase/folding/preserve
Another option is to enable ICU folding, while indexing both the original form of the word and the folded form. In this case it doesn't have much impact on Hebrew, because HebMorph has already done the Hebrew-related folding, but it effects some Latin, Greek, Cyrillic, and CJK characters. There were no new collisions or splits, just additional tokens indexed.
 * Lowercase/folding tokens: 5,063,634
 * Lowercase/folding/preserve tokens: 5,065,585

As expected, a few tokens popped back up, mostly with accented characters, a few with variants that are important in the source, but less so to a non-speaker, and some phonetic characters.

The plain field, which usually does exact matching, has ICU folding with "preserve original" enabled for Hebrew, which is helpful because it removes niqqud (the vowel diacritics). I don't think we need to preserve the originals for non-Hebrew words in the text field.

What to do?
After talking with Matanya and Stas about exact matching and stemming and with David about the final $ and search internals, I don't think we need it, and David's convinced me to worry about it messing up Did You Mean suggestions and regular expression matching.

I think it's best to go with the unpacked version, since it gives us the most flexibility, and avoids the final-$. I think we should enable folding, too, but that we don't need to preserve originals in the text field.

One odd side-effect of using HebMorph is that searching Hebrew words with a final $ will prevent stemming in the analyzer. There doesn't seem to be anyway to turn it off. I don't think it comes up much, and we shouldn't tout it as a feature, since it would only be for Hebrew, and it could go away in the future. So, it's an acceptable quirk of the analyzer.

Of course, the next step is to test all this with native speakers looking at HebMorph output and using it on real data in labs.

Hebmorph Output Examples
Below are some examples of output from HebMorph, for native speaker review. These don't have to all be perfect, but they shouldn't all be horrible, either.

Groupings show words that HebMorph has indexed to a common stem. In English, this would be group and groups being indexed together, so that searching for one will find all the others.

Analyzed Terms show the (often multiple) terms that are indexed for a given search term. These are generally the possible root forms of a term. In English, this would be does being analyzed as both a form of do and a form of doe (does is the plural of doe, which is a deer, a female deer).

Random Groupings
Here are 50 randomly selected groupings. The analyzed term they share is bolded. Each term is shown with its frequency, so "[1 foo][2 bar]" means that foo occurred once in the sample, and bar occurred twice. The relative frequency is important, since lower-frequency errors matter less.


 * פרמצבטי: [2 פרמצבטיות][1 פרמצבטיים]
 * הכרתי: [1 הַכְּרֵתִי][10 הכרתי][3 הכרתיות][1 הכרתיים][1 הכרתית][1 שהכרתים]
 * שעל: [1 ושעל][1 ושעלו][1 ושעלי][1 ושעליהם][11 כשעל][12 כשעלה][3 כשעלו][1 כשעליה][1 כשעליו][35 משעל][2 משעלה][2 משעלי][1 שֶׁעָלוּ][237 שעל][51 שעלה][39 שעלו][10 שעלי][41 שעליה][30 שעליהם][12 שעליהן][72 שעליו][1 שעליך][4 שעלינו]
 * קינא: [2 המקנא][1 וקינאו][1 יקנא][2 לקנא][2 מקנא][1 מקנאות][2 מקנאת][1 קִנֵּא][3 קינא][2 קינאה][2 קנא][10 קנאה][13 קנאי][1 קנאנה][1 שמקנא]
 * אג'נדה: [6 אג'נדה][3 אג'נדת][1 באג'נדה][5 האג'נדה]
 * גנטריקס: [1 גֶנֶטְריקְס][1 גנטריקס]
 * מונופול: [1 במונופול][2 במונופולים][5 המונופול][3 המונופולים][1 ובמונופולים][2 והמונופולים][1 ומונופולים][1 למונופול][6 מונופול][6 מונופולים]
 * גופיף: [1 וגופיף][1 וגופיפי][1 מהגופיף]
 * קוסטוב: [1 קוֹסְטוֹב][40 קוסטוב]
 * שיפוד: [2 השיפוד][1 שיפודם]
 * מסוים: [14 המסוים][1 המסוימים][1 המסוימת][323 מסוים][142 מסוימות][288 מסוימים][256 מסוימת][1 מסיוויים]
 * בלום: [5 בילום][37 בלום][3 בלומה][1 בלמית][2 הבילום][1 ובלום][1 ולבלום][40 לבלום]
 * מחבלת: [1 במחבלת][4 המחבלת][1 והמחבלת][1 מחבלות][3 מחבלת]
 * הנצלה: [4 ההנצלה][3 הנצלה][3 להנצלה][1 להנצלת]
 * נזילה: [6 הנזילות][1 והנזילות][1 ונזילות][1 לנזילות][1 נזילה][9 נזילות]
 * כדורגלנית: [2 הכדורגלניות][2 הכדורגלנית][15 כדורגלנית][8 לכדורגלנית]
 * אתרוג: [8 אתרוג][1 אתרוגי][2 אתרוגים][1 האתרוג][1 כאתרוג]
 * תוקן: [3 המתוקן][1 ומתוקנות][4 ומתוקנת][1 ושתוקנו][1 ותוקן][1 יתוקן][1 יתוקנו][5 מתוקן][1 מתוקנות][12 מתוקנת][2 שיתוקנו][4 שתוקן][14 תוקן][6 תוקנה][7 תוקנו][1 תתוקן]
 * השביע: [1 השביע][1 השביעה][167 השביעי][16 השבע][4 השבעת][1 השיבעים][2 והשביעי][1 והשבעת][1 ומשביע][2 להשביע][1 מלהשביע][4 משביע][1 משביעה][1 משביעי][1 משביעת]
 * ישראלי: [618 ישראלי][1 כ"ישראלי][1 ל"ישראלי]
 * ישוב: [2 בישוב][1 בישובים][1 הַיֹּשְׁבִים][62 הישוב][1 ובישובי][1 ובישובים][1 והישוב][1 וישוב][3 וישובה][1 וישובי][2 יָשׁוּב][1 יֹשְׁבִים][41 ישוב][1 ישובה][9 ישובי][2 ישובים][1 כישוב][1 לישוב][1 לישובי][1 מהישובים][1 מישובי][1 שבישוב]
 * מלוכן: [2 המלוכני][4 המלוכנים][1 המלוכנית][2 מלוכני][2 מלוכניות][2 מלוכנים][1 מלוכנית][1 ממלוכנים]
 * נעלה: [5 בנעלי][1 בנעלים][3 ההנעלה][3 הנעלה][1 וְנַעֲלֶה][2 והנעלה][4 ונעלי][1 ונעלים][4 לנעלי][10 נעלה][4 נעלות][23 נעלי][5 נעלים][1 נעלית][1 שנעלה]
 * אירלנד: [53 אירי][2 איריות][5 איריים][10 אירים][22 אירית][213 אירלנד][1 ארלנד][1 באירים][5 באירית][44 באירלנד][4 בארים][1 בארית][3 בארת][46 האירי][14 האיריות][4 האיריים][11 האירים][26 האירית][10 הבארים][26 המאירי][1 המאירים][2 המאירית][1 המשאירים][6 ואירי][1 ואירים][1 ואירית][23 ואירלנד][4 ובאירלנד][1 ומאירי][2 ומשאירים][1 לאירי][1 לאירים][12 לאירלנד][1 לארית][5 מאירי][1 מאירים][6 מאירלנד][8 משאירים][1 שאירלנד][9 שבאירלנד][1 שהאירי]
 * אסור: [1 איסורין][71 אסור][30 אסורה][5 אסורות][1 אסוריו][32 אסורים][13 האסור][8 האסורה][10 האסורות][12 האסורים][10 ואסור][1 ואסורה][1 ואסורים][1 ולאסור][1 ושאסור][1 כאסורות][28 לאסור][1 מאסוריו][16 שאסור][3 שאסורות][3 שאסורים]
 * ציפי: [2 וציפי][30 ציפי]
 * קשירה: [8 בקשירת][5 הקשירה][1 וקשירה][1 וקשירות][6 וקשירת][4 לקשירת][6 קשירה][3 קשירות][7 קשירת][1 קשירתו]
 * תקנון: [6 בתקנון][3 בתקנונים][6 התקנון][4 התקנונים][1 והתקנון][5 לתקנון][2 לתקנונים][2 מהתקנון][1 מהתקנונים][19 תקנון]
 * עיר: [2 ב"עיר][1 בְּעִיר][1 בְּעָרֵיהֶם][1339 בעיר][15 בעירה][28 בעירו][2 בעירם][2 בעירנו][27 בערי][1 בעריה][1 בעריכים][104 בערים][3 בערכין][1 בערניו][2 הָעִיר][16 הבעירה][2 הבעירו][5 המעריך][12 המשערים][2614 העיר][5 העירה][3 העירו][1 העיריה][1 העריה][48 העריך][2 העריכים][201 הערים][1 ו"ערי][1 ו"שערי][1 וְעָרֵיכֶם][9 ובעיר][1 ובערי][8 ובערים][58 והעיר][2 והעריך][7 והערים][1 וכעיר][2 ולעיר][1 ולעיריה][2 ולערים][1 ומהעיר][1 ומעיר][1 ומעריך][1 ומערים][2 ומשערים][12 ועיר][3 וערי][12 וערים][1 ושערי][1 ושערייה][1 ושעריים][3 ושערים][3 כ"עיר][1 כ"שעיר][40 כעיר][2 כערי][2 כשהעיר][3 ל"עיר][424 לעיר][3 לעירו][7 לערי][36 לערים][2 מ"עיר][1 מ"ערי][1 מבעיר][1 מבעירה][1 מבערים][118 מהעיר][15 מהערים][33 מעיר][1 מעירו][12 מערי][1 מעריה][14 מעריך][5 מערים][4 משערי][1 משעריה][1 משעריו][24 משערים][2 עִיר][1 עִירֶךָ][1 עָרִים][2 עָרֵי][1 עָרָיו][431 עיר][2 עירה][14 עירו][1 עירך][3 עירם][1 עירן][2 עירנו][93 ערי][1 עריה][2 עריהם][1 עריו][2 עריך][2 עריכים][131 ערים][2 ערכין][1 ש"שערי][45 שבעיר][1 שבעירו][2 שבערי][2 שבערים][29 שהעיר][4 שהערים][1 שמשערים][2 שעיר][2 שעירה][8 שעירי][133 שערי][5 שעריה][1 שעריהם][16 שעריו][7 שעריים][422 שערים]
 * וריד: [1 בוורידים][1 בורידי][1 בורידים][4 הוורידים][11 הוריד][3 הורידה][8 הורידו][1 הורידיה][1 המוריד][4 והוריד][2 והורידו][1 וורידי][3 ומוריד][4 וריד][3 ורידי][1 ורידים][2 וריידן][5 כלוריד][3 כלורידים][1 כמורידי][3 לווריד][10 מוריד][4 מורידה][4 מורידים][2 שהוריד][1 שמוריד][1 שמורידים]
 * טלסקופ: [6 בטלסקופ][6 הטלסקופ][1 הטלסקופים][1 ובטלסקופ][1 ובטלסקופים][1 וטלסקופ][17 טלסקופ][4 טלסקופי][6 טלסקופים][1 לטלסקופ][1 לטלסקופים]
 * שאטו: [2 שאטו][1 שאטוֹ]
 * תגובה: [151 בתגובה][7 בתגובות][2 בתגובותיהם][1 בתגוביות][4 בתגובת][1 בתגובתה][1 בתגובתו][28 התגובה][16 התגובות][10 ובתגובה][1 ובתגובות][3 והתגובה][3 והתגובות][1 וכתגובה][2 ולתגובה][7 ותגובה][5 ותגובות][1 ותגובותיהם][1 ותגובותיו][2 ותגובת][1 ותגובתו][38 כתגובה][1 כתגובות][2 כתגובת][9 לתגובה][11 לתגובות][1 לתגובותיו][5 לתגובת][1 לתגובתו][1 לתגובתם][5 מתגובה][1 מתגובות][2 מתגובת][2 מתגובתו][2 שבתגובה][1 שהתגובה][1 שתגובה][1 שתגובת][67 תגובה][45 תגובות][5 תגובותיו][33 תגובת][3 תגובתה][7 תגובתו][2 תגובתם][1 תגובתן]
 * קושיה: [1 הקושיה][6 קושה][1 קושיה][1 קושית]
 * ח\"ן: [1 הח"ן][2 ח"ן]
 * שכלול: [3 השכלולים][1 ושכלול][1 לשכלול][7 שכלול][7 שכלולים][1 ששיכללה]
 * קפץ: [2 הקופץ][2 הקופצים][9 הקופצת][4 הקפצה][1 הקפצת][1 ולקפוץ][2 וקופץ][1 וקופצים][1 וקפצה][3 יקפוץ][11 לקפוץ][7 קופץ][5 קופצים][1 קופצת][12 קפץ][2 קפצה][1 קפצו][1 קפצן][1 שקופצים][2 שקפץ][1 שקפצה][1 תקפוץ][1 תקפצו]
 * לעוס: [1 ללעוס][1 לעוסת]
 * דיכא: [1 בדכאו][4 דיכא][1 דיכאה][6 דכאו][3 המדכא][1 המדכאות][1 המדכאים][3 ודיכא][1 ודיכאה][1 והמדכאת][1 ומדכא][1 ומדכאה][1 ומדכאת][1 ונדכא][29 לדכא][1 לדכאה][1 לדכאו][2 מדכא][1 מדכאו][4 מדכאות][1 מדכאים][1 מדכאת][3 שדיכאה][2 שדיכאו][1 שדכא]
 * כספר: [22 כספר][1 כספרו][2 כספרי][1 כספריה][2 כספרים][2 כספרן]
 * מחוכם: [1 מחוכם][2 מחוכמת]
 * רפורמת: [9 רפורמת][1 ש"רפורמת]
 * מפולת: [1 במפולת][1 ומפולות][1 ומפולת][1 למפולות][1 למפולת][1 מפולות][2 מפולת]
 * פרימוורה: [2 ה"פרימוורה][1 פרימוורה]
 * פסגה: [1 ב"פסגת][1 בפיסגת][2 בפסגה][2 בפסגות][7 בפסגת][2 בפסגתו][9 הפסגה][5 הפסגות][2 והפסגה][1 ופסגות][1 ופסגת][2 כפסגת][6 לפסגה][9 לפסגת][1 לפסגתו][1 מפיסגתו][3 מפסגות][5 מפסגת][6 פסגה][8 פסגות][2 פסגותיו][32 פסגת][1 פסגתו][1 שבפסגתו][1 שפסגתו]
 * ביסם: [2 בסם][1 בסמית][1 בסמך][1 בסמני]
 * טימור: [1 ומטמור][2 טיימור][19 טימור][4 מטמור]
 * גחלילית: [2 גחליליות][1 הגחלילית]
 * העברית: [590 העברית][1 ל"העברית]
 * טובי: [1 טוֹבִּי][89 טובי]

Largest Groupings
These are the 5 largest groupings—and they are pretty big! The analyzed form each group shares is listed under "stem". The number of distinct words is listed under "types" and the total number of words involved is listed under "tokens". The original words and their frequencies are listed under "forms".

Random Analyzed Terms
Here is a sample of 50 random analyzed terms. The bold term is the original word, the second column has the frequency of the term in the corpus, and the last column lists out all the analyzed terms generated by HebMorph.

Largest Number of Analyzed Terms
Here are the forms with the most analyzed terms. Note that they are all fairly rare—all but one only occurring once or twice in the corpus.

Live Demo
There's a live demo of Hebrew Wikipedia with unpacked Hebmorph (with ICU folding) in labs.

Note that it contains the index of Hebrew Wikipedia, so it can show results and snippets, but all the links are red and none of the pages are available in labs.