.. _json_language: Language Schema for JSON ************************ In the lxware directory you will find `.language` files that you can adapt to your needs. If you open english.language you will see: .. code-block:: json { "tokenizer" : "english.tkz" , "hmm" : "english.hmm" , "morph_chain" : [ { "file": "english.morph" , "lookup" : "plain" , "transform" : "none" } ], "not_found" : [ { "file": "english.morph" , "lookup" : "cleaner" , "transform" : "none" }, { "guesser":"english-guesser.morph" , "file": "english.morph" , "lookup" : "guesser" , "transform" : "none" } ] } It is json format, so it is important to respect the syntax. These are the values that you can specify: tokenizer --------- Specify which tokenizer file (.tkz) to use, This file takes care of breaking up sentences into tokens. postagger --------- Specify the rule based disambiguation data file. hmm --- It is a hidden markov model for POS disambiguation. These 2 parameters can be changed and adapted, but for that you better contact us directly. morph_chain ----------- Location of the morphological dictionaries and how you want your lookup to be done. You can specify a list of dictionaries and create a chain: first words are looked up in the first listed dictionary and if it is not found it will go to the next and so on.: .. code-block:: json { "morph_chain" : [ { "file": "medical.morph" , "lookup" : "plain" , "transform" : "none" }, { "file": "english.morph" , "lookup" : "plain" , "transform" : "none" } ] } The parameters are the following * **file** : morph filename * **lookup** : the lookup type can be 'plain', 'compound_v2', 'cleaner', 'guesser' * **transform** : Specify the transformation on the input, supported values 'none' to 'tolower' **lookup type** * **plain**: is the default. It just looks up words without further ado. * **compound_v2**: it looks up the lexicons several times to form compounds. The stem of the compound consists of the stems of the components separated by #. E.g.: "pruimentaarten" -> 'pruim#taart'. * **compound_v3**: it creates the stem based in the input string: "pruimentaarten" -> 'pruimentaart'. You can set the maximum number of component a compound can have using the option `max_components` * **compound_number**: lookup written number compositions. (this is used to resolve numbers ex: tweeëntwintig ) * **guesser**: it tries to guess the original word, if there is spelling mistakes. * **tag**: parse tags and decompound them. It assumes it will start with # or @. e.g. #BikiniBody * **cleaner**: the cleaner will first remove repetitions the occurs more then 2 times, and perform different lookups so things like `cooool` would be looked up as `cool` **transform** * **none**: is the default case remains unchanged * **tolower**: it transforms the input to lowercase before looking up the dictionary. Useful with social media **properties** * string property that needs to be add when we found the word in the given dictionary. * array of properties to add when we found the word in the given dictionary. ex: "properties": ["+EN", "+Slang" ] not_found --------- This section of lookup will be applied if nothing else has been matched. .. code-block:: json { "not_found" : [ { "file": "english.morph" , "lookup" : "cleaner" , "transform" : "none" }, { "guesser":"english-guesser.morph" , "file": "english.morph" , "lookup" : "guesser" , "transform" : "tolower" } ] } tokenizer_mapper ---------------- This section is used to map tokenizer values from the `*.tkz` file and remap it to token properties or POS values. * no prefix the a property will be added to the token. * To add a POS you need to prefix with a '+'. * To remove a POS you need to prefix with a '-'. ex : for the value in the tkz file map a property **separator**, add a POS **Nn** and remove remove the current POS **Punct** .. code-block:: console # english.tkz $Tir_Separators {500}; # english.config "tokenizer_mapper" : [ [500,"separator","+Nn","-Punct"] ], prop_chains ----------- The **property_chain** is used to map icu values to lookup_type. So every word the get the property **tag** from the tokenizer will be lookup up by the 'tag' lookup_type. .. code-block:: console $Tir_Tag {2000}; The value 2000 is mapped internally to a tag value. .. code-block:: json { "prop_chains" : { "tag" : [ { "file": "english.morph" , "lookup" : "tag" , "transform" : "tolower" } ] } } as a example, #LikeJapan will be parsed like this: .. code-block:: console "#LikeJapan" (tag)['like#Japan':Prop-Std ['like':V-Pres, +inf, +Positive] ['Japan':Prop-Std]] conditions ----------- The **conditions** specify how the chain of dictionaries need to be concatenated. You can set three types of variables: * **type**: the type of value you want to look at, it can be a **pos**, a **property** or **all** * **value**: the specific value of the words we are going to select. If the type is a **pos** it could be ``Prop`` for instance, so we would select all the props from the first dictionary in the chain and look them up in the next dictionary. A property is for instance ``+nf``, not found. * **action**: what to do with those words that we have selected: we can **merge** and put together whatever we find in both dictionaries, or **overwrite**, which means that the second dictionary entries will overwrite the previous ones if they exists. In the next example, only the words that are not found in the normal dictionary, will be looked up in the medical one. .. code-block:: json "conditions" : [ { "type" : "property" , "value" : "nf", "action" : "overwrite" } ], "morph_chain" : [ { "file": "english.morph" , "lookup" : "plain" , "transform" : "none" } { "file": "medical.morph" , "lookup" : "plain" , "transform" : "none" }, ] Check if the adjectives are in the slang dictionary and merge both readings. .. code-block:: json "conditions" : [ { "type" : "pos" , "value" : "adj", "action" : "merge" } ], "morph_chain" : [ { "file": "english.morph" , "lookup" : "plain" , "transform" : "none" } { "file": "slang.morph" , "lookup" : "plain" , "transform" : "none" }, ] Look up all words in both the french and english dictionaries and keep all the readings: .. code-block:: json "conditions" : [ { "type" : "all" , "action" : "merge" } ], "morph_chain" : [ { "file": "english.morph" , "lookup" : "plain" , "transform" : "none" } { "file": "french.morph" , "lookup" : "plain" , "transform" : "none" }, ]