Language Schema for JSON#

In the lxware directory you will find .language files that you can adapt to your needs.

If you open english.language you will see:

{
        "tokenizer" : "english.tkz" ,
        "hmm" : "english.hmm" ,
        "morph_chain" : [
                { "file": "english.morph" , "lookup" : "plain" , "transform" : "none" }
        ],
        "not_found" : [
                { "file": "english.morph" , "lookup" : "cleaner" , "transform" : "none" },
                { "guesser":"english-guesser.morph" , "file": "english.morph" , "lookup" : "guesser" , "transform" : "none" }
        ]
}

It is json format, so it is important to respect the syntax. These are the values that you can specify:

tokenizer#

Specify which tokenizer file (.tkz) to use, This file takes care of breaking up sentences into tokens.

postagger#

Specify the rule based disambiguation data file.

hmm#

It is a hidden markov model for POS disambiguation.

These 2 parameters can be changed and adapted, but for that you better contact us directly.

morph_chain#

Location of the morphological dictionaries and how you want your lookup to be done.

You can specify a list of dictionaries and create a chain: first words are looked up in the first listed dictionary and if it is not found it will go to the next and so on.:

{
     "morph_chain" : [
            { "file": "medical.morph" , "lookup" : "plain" , "transform" : "none" },
            { "file": "english.morph" , "lookup" : "plain" , "transform" : "none" }
        ]
}

The parameters are the following

  • file : morph filename

  • lookup : the lookup type can be ‘plain’, ‘compound_v2’, ‘cleaner’, ‘guesser’

  • transform : Specify the transformation on the input, supported values ‘none’ to ‘tolower’

lookup type

  • plain: is the default. It just looks up words without further ado.

  • compound_v2: it looks up the lexicons several times to form compounds. The stem of the compound consists of the stems of the components separated by #. E.g.: “pruimentaarten” -> ‘pruim#taart’.

  • compound_v3: it creates the stem based in the input string: “pruimentaarten” -> ‘pruimentaart’. You can set the maximum number of component a compound can have using the option max_components

  • compound_number: lookup written number compositions. (this is used to resolve numbers ex: tweeëntwintig )

  • guesser: it tries to guess the original word, if there is spelling mistakes.

  • tag: parse tags and decompound them. It assumes it will start with # or @. e.g. #BikiniBody

  • cleaner: the cleaner will first remove repetitions the occurs more then 2 times, and perform different lookups so things like cooool would be looked up as cool

transform

  • none: is the default case remains unchanged

  • tolower: it transforms the input to lowercase before looking up the dictionary. Useful with social media

properties

  • string property that needs to be add when we found the word in the given dictionary.

  • array of properties to add when we found the word in the given dictionary. ex: “properties”: [“+EN”, “+Slang” ]

not_found#

This section of lookup will be applied if nothing else has been matched.

{
    "not_found" : [
        { "file": "english.morph" , "lookup" : "cleaner" , "transform" : "none" },
        { "guesser":"english-guesser.morph" , "file": "english.morph" , "lookup" : "guesser" , "transform" : "tolower" }
    ]
}

tokenizer_mapper#

This section is used to map tokenizer values from the *.tkz file and remap it to token properties or POS values.

  • no prefix the a property will be added to the token.

  • To add a POS you need to prefix with a ‘+’.

  • To remove a POS you need to prefix with a ‘-‘.

ex : for the value in the tkz file map a property separator, add a POS Nn and remove remove the current POS Punct

# english.tkz
$Tir_Separators {500};

# english.config
"tokenizer_mapper" : [  [500,"separator","+Nn","-Punct"] ],

prop_chains#

The property_chain is used to map icu values to lookup_type. So every word the get the property tag from the tokenizer will be lookup up by the ‘tag’ lookup_type.

$Tir_Tag {2000};
The value 2000 is mapped internally to a tag value.
{
        "prop_chains" : {
                 "tag" : [ { "file": "english.morph" , "lookup" : "tag" , "transform" : "tolower" } ]
        }
}

as a example, #LikeJapan will be parsed like this:

"#LikeJapan" (tag)['like#Japan':Prop-Std
['like':V-Pres, +inf, +Positive]
['Japan':Prop-Std]]

conditions#

The conditions specify how the chain of dictionaries need to be concatenated.

You can set three types of variables:

  • type: the type of value you want to look at, it can be a pos, a property or all

  • value: the specific value of the words we are going to select. If the type is a pos it could be Prop for instance, so we would select all the props from the first dictionary in the chain and look them up in the next dictionary. A property is for instance +nf, not found.

  • action: what to do with those words that we have selected: we can merge and put together whatever we find in both dictionaries, or overwrite, which means that the second dictionary entries will overwrite the previous ones if they exists.

In the next example, only the words that are not found in the normal dictionary, will be looked up in the medical one.

"conditions" : [ { "type" : "property" , "value" : "nf", "action" : "overwrite" } ],
 "morph_chain" : [
        { "file": "english.morph" , "lookup" : "plain" , "transform" : "none" }
        { "file": "medical.morph" , "lookup" : "plain" , "transform" : "none" },
    ]

Check if the adjectives are in the slang dictionary and merge both readings.

"conditions" : [ { "type" : "pos" , "value" : "adj", "action" : "merge" } ],
 "morph_chain" : [
        { "file": "english.morph" , "lookup" : "plain" , "transform" : "none" }
        { "file": "slang.morph" , "lookup" : "plain" , "transform" : "none" },
    ]

Look up all words in both the french and english dictionaries and keep all the readings:

"conditions" : [ { "type" : "all" , "action" : "merge" } ],
 "morph_chain" : [
        { "file": "english.morph" , "lookup" : "plain" , "transform" : "none" }
        { "file": "french.morph" , "lookup" : "plain" , "transform" : "none" },
    ]