Morphological Dictionary#

There might be different reasons for which you would like to create your own morphological dictionary:

  • There could be entries missing in the dictionary or different spellings that you would like to cover.

  • You might want to have other kind of information coded, like translations or specific codes.

  • You want to develop a new language.

With the Wowool SDK you can create a dictionary to consult before or after the standard dictionaries, or a complete different dictionary created to fit your own needs.

Let’s create a dictionary that covers unknown words in the Jabberwocky text.

Evaluating wowool output to find unknown tokens

Let’s create a text called jabberwocky.txt with the following inside

It was brillig, and the slithy toves did gyre and gimble in the wabe. All mimsy were the borogoves, and the mome raths outgrabe.

And now let’s run this file using the –tool raw

wow -p english -f jabberwocky.txt

We get something like:

t(0,2) "It" (init-cap, init-token)['it':Pron-Std]
t(3,6) "was" ['be':V-Past-Sg-be]
t(7,14) "brillig" (nf)['bill':V-PrPart, +guess, +gmchar, +gskip]
t(14,15) "," [',':Punct-Comma]
t(16,19) "and" ['and':Conj-Coord]
t(20,23) "the" ['the':Det-Def]
t(24,30) "slithy" (nf)['slithy':Nn-Std]
t(31,36) "toves" (nf)['toe':Nn-Pl, +guess, +gmchar], ['trove':Nn-Pl, +guess, +gskip]
t(37,40) "did" ['do':Aux-Std]
t(41,45) "gyre" (nf)['gyro':Nn-Sg, +guess, +gskip]
t(46,49) "and" ['and':Conj-Coord]
t(50,56) "gimble" (nf)['gimlet':Nn-Sg, +guess, +gmchar, +gmEnd]
t(57,59) "in" ['in':Prep-Std, +Adv-Part]
t(60,63) "the" ['the':Det-Def]
t(64,68) "wabe" (nf)['wabe':Nn-Std]
t(68,69) "." ['.':Punct-Sent]

When the word is not found in the morphological attribute you will get the token property +nf:

t(8,15) "brillig" (nf)['billing':Nn-Sg, +guess, +gmchar, +gskip]

If you have a guesser in your configuration file, the engine will try to find a spelling alternative (“billing”), this could be annoying as the word is not a typo.

We can make our own dictionary, this is the format:

  • Simple format type:

    literal:lemma[+Tag]([+Property-Tag])*
    ex: Allison:Allison[+Prop-Std][+giv]
  • Compounds format type:

    literal:#component1[+Tag1]#component2[+Tag2]
    ex: basisschool:#basis[+Nn-Sg]#school[+Nn-Sg]
  • Compounds with custom stem :

    literal:#([+Property-Tag])*#$stem_cs[+Tag_cs]#component1[+Tag1]#component2[+Tag2]
    ex: pruimentaart:#[+good]#$pruimentaart+Nn-pl#pruim+Nn-Pl#taart+Nn-Pl

in lxware create jabberwocky.dic:

borogove:borogove[+Nn-Sg][+animal]
borogoves:borogove[+Nn-Pl][+animal]
brillig:brillig[+Adj-Std]
gyre:gyre[+V-Pres]
gimble:gimble[+V-Pres]
mimsy:mimsy[+Adj-Std]
mome:mome[+Nn-Sg]
momes:mome[+Nn-Pl]
outgribe:outgribe[+V-Pres]
outgrabe:outgribe[+V-Past]
rath:rath[+Nn-Sg][+animal]
raths:rath[+Nn-Pl][+animal]
slithy:slithy[+Adj-Std]
tove:tove[+Nn-Sg][+animal]
toves:tove[+Nn-Pl][+animal]
wabe:wabe[+Nn-Sg]

Note

No spaces are allowed (mitsy wipsy:mitsy wipsy[+Nickname] is not allowed)

Now we need to compile it:

afst -c \"import jabberwocky.dic; quit\" -o jabberwocky.morph

Now we can edit the english.config file and add our new file:

"morph_chain" : [
        { "file": "english.morph" , "lookup" : "plain" , "transform" : "none" },
        { "file": "jabberwocky.morph" , "lookup" : "plain" , "transform" : "none" }
],

Now the words are looked up if they are not found in the main dictionary:

t(0,2) "It" (init-cap, init-token)['it':Pron-Std]
t(3,6) "was" ['be':V-Past-Sg-be]
t(7,14) "brillig" ['brillig':Adj-Std]
t(14,15) "," [',':Punct-Comma]
t(16,19) "and" ['and':Conj-Coord]
t(20,23) "the" ['the':Det-Def]
t(24,30) "slithy" ['slithy':Adj-Std]
t(31,36) "toves" ['tove':Nn-Pl, +animal]
t(37,40) "Did" (init-cap, nf)['Did':Prop-Std]
t(41,45) "gyre" ['gyre':V-Pres]
t(46,49) "and" ['and':Conj-Coord]
t(50,56) "gimble" ['gimble':V-Pres]
t(57,59) "in" ['in':Prep-Std, +Adv-Part]
t(60,63) "the" ['the':Det-Def]
t(64,68) "wabe" ['wabe':Nn-Sg]
t(68,69) "." ['.':Punct-Sent]

Morphological dictionaries are a performant way of adding lemmas and tags to words. You can use it creatively to help you in your textmining:

друг:friend[+Positive]

wow -p english -i \"друг\" -e "<\'friend\'>\"

1  друг