Lexicons#

The easiest way to use Wowool is to encode a list of words or regular expressions in a lexicon. A lexicon is a list of words or expressions separated by commas.

Lexicon Syntax#

lexicon :            -> Identifier for lexicons
{                    -> Opening of group
       Word1,        -> Words or expressions
       Word2,
       ...
} = Annotation;     -> } closes the group
                    -> = assignment operator
                    -> Annotation: name of annotation
                    -> “;” closes the rule

Note

It is important to put a comma after each entry, if you don’t, you will get a warning from the compiler:

rules/test.wow(6,0): Lexicon entry has carriage returns, did you forget the ‘,’ ?

(6,0) tells you in which line you have to look, in this case line 6, but usually the problem comes from the line above (in this case line 5).

The last line of lexicon can have a comma, however it is not mandatory.

Example:

lexicon :
{
     apple,
     kiwi,
     passion fruit,
     watermelon
} = fruit;

You will find your fruit:

“I bought an [apple], a [kiwi] and a [passion fruit]”

This kind of lexicon will ‘literally’ match the word ‘apple’, but not variants like ‘apples’ or ‘APPLE’. To achieve this you can modify the lexicon identifier to specify the kind of input you wish to match:

the literal, or word as it appears in the text,
the stem or dictionary form,
the ‘normalized’ stem or literal, independently of their capitalization or accent patterns
the head or a part of a compound:

lexicon: { … } = Annotation;
lexicon: (input="stem") { … } = Annotation;
lexicon: (input="normalized_literal") { … } = Annotation;
lexicon: (input="normalized_stem") { … } = Annotation;
lexicon: (input="component") { … } = Annotation;
lexicon: (input="head") { … } = Annotation;
lexicon: (input="canonical") { … } = Annotation;

Let’s see how this works:

Type of Matches#

literal matching#

literal matching: matches the words as they appear in the text

lexicon : { kiwi } = Fruit;

This will only match once in the following sentence:

“I ate one kiwi. He ate two kiwis. Kiwi is a fruit. KIWI LOVE”

stem matching#

Stem matching: matches the stem of the token

lexicon: (input="stem") { kiwi } = Fruit;

This will match all the morphological variants that have “kiwi” as a stem.

“I ate one [kiwi]. He ate two [kiwis]. [Kiwi] is a fruit. KIWI LOVE”

In the previous sentence, we will find “Kiwi”, because, although it is written with capitals, as it is the beginning of the sentence, the word will be normalized to lower case. “KIWI” does have ‘KIWI’ as a stem, so we won't be able to match it with this input option. It could very well be the case that we do not want to match words in different capitalization patterns, think of ‘WHO’, the World Health Organization or ‘UPS’.

When matching the stem, we need to take into account phenomena like compounding:

“Finanzredakteur” [‘Finanzredakteur’:Nn-Sg [‘Finanz’:Nn-Std][‘Redakteur’:Nn-Sg]]

“Kabinettsmitglieder” [‘Kabinettsmitglied’:Nn-Std [‘Kabinett’:Nn-Sg, +neut][‘Mitglied’:Nn-Std, +neut, +nom-gen-acc]]

“Aktienpreis” [‘Aktienpreis’:Nn-Sg [‘Aktie’:Nn-Pl][‘Preis’:Nn-Sg]]

To match the above cases using the stem, we would need to make entries like:

lexicon: (input="stem")
{
    Finanzredakteur,
    Kabinettsmitglied,
    Aktienpreis,
} = WordsOfInterest;

So if the first parts of the compound are in plural or have a linking ‘s’ they remain as they are, the only part that is set to the stem is the head of the compound (‘Mitglieder’ -> ‘Mitglied’).

In the case of contractions (e.g.: French ‘aux’, Italian ‘della’, Spanish ‘del’, German ‘im’, Portuguese ‘da’), for simplicity sake, the stem is the first part of the contraction:

“aux” -> ‘à’

“del” -> ‘de’

“della” -> ‘di’

“im” -> ‘in’

For the verbal contractions in English (“don't”, “gonna”,”can't”), the literals will be split in two, which can produce weird forms:

“don't” -> “do” “n't” -> 'do' 'not'

“can't” -> “ca” “n't” -> 'can' 'not'

We recommend using the stem instead of the literal in rules and lexicons:

// Not recommended
lexicon: { ca n't } = Negation
rule: { "ca" "n't" } = Negation

// Recommended
lexicon(input="stem"): { can not } = Negation
rule: { 'can not' } = Negation

Note

To use one of the morphologically derived forms instead of the stem will cause the lexical item not to match:

Example:

lexicon: (input="stem") { kiwis } = Fruit;

Output:

“I love kiwis” -> won’t match

Normalized Literal#

Normalized literal: matches the literal token regardless of capitalization or any accents or special characters:

lexicon: (input=”normalized_literal”) { Eyeontext } = MyCompany;

This will match all our variants:

“I work for [Eyeontext]. I work for [EyeOnText]. I work for [EYEONTEXT].”

Another example with accents:

lexicon: (input="normalized_literal") { agnes, john } = Friend;

Output:

“Agnès and John are my dear friends”

As for special characters:

lexicon: (input="normalized_literal") { suker, zewlakow } = Friend;

…

Output:

“My Polish friends are called Šuker and Żewłakow”

For Spanish ‘ñ’:

Lexicon: (input="normalized_literal") { banco popular espanol } = Bank;

…

“Banco Popular Español”

Note

Normalization is made according to the ICU (International Components for Unicode) guidelines. Some characters do not have a normalized equivalent, for instance

Character	Example
ß	“straßburg”
ø	“hjørring”
æ	“holbæk”

Normalized Stem#

the normalized stem will match the stem of the token regardless of capitalization or any accents or special characters:

lexicon: (input="normalized_stem") { peche } = Fruit;

Output:

“Confiture de [pêches]”

In the case of German, we can lowercase the nouns:

lexicon: (input="normalized_stem")
{
    finanzredakteur
} = Profession;

Note

To use one of the morphologically derived forms instead of the stem will cause the lexical item not to match:

lexicon: (input="normalized_stem") { peches } = Fruit;

Output:

“Confiture de pêches” -> won't match

Note

Normalization is made according to the ICU (International Components for Unicode) guidelines. Some characters haven’t got a normalized equivalent, for instance

Character	Example
ß	“straßburg”
ø	“hjørring”
æ	“holbæk”

Component#

Component: this is a special match made for languages with compounding (Norwegian Bokmal, German, Russian, Danish, Dutch and Swedish). This will match the normalized stem of the word independently or as a part of a compound. As it is the normalized form, the lexicon words need to be listed in lowercase and without accents or special characters:

Example:

lexicon: (input="component") { erdbeere } = Fruit;

Output:

Erdbeeren

Erdbeerkuchen

Erdbeertorten

Note

The annotation will extend to the full token, not to part of it.

Head#

Head: this is also made for languages with compounding. This will only match the normalized stem of the word independently or as the last element of the compound. The lexicon words need to be listed in lowercase and without accents or special characters:

Example:

lexicon: (input="head") { kuchen } = Food;

Output:

Kuchen

Erdbeerkuchen

Marmorkuchen

Kuchenrezepte”

“Kuchenrezepte” is not matched, as “Kuchen” is not the last element of the compound in that word.

Note

The annotation will extend to the full token, not to part of it.

Canonical#

Canonical: this is to match on concepts that have a canonical attribute.

Example:

lexicon: (input="canonical") { Joe Biden } = President;

Output:

Joe Biden is the president. He lives in the White House.

Joe Biden

He

Note

When running this domain the canonical attribute should already have been set in a previous domain or in a lexicon.

Lexicons: Regular Expressions#

Woowol allows the use of regular expressions in its lexicons

lexicon : { (ha)+ } =laughter;

Would match one or more time the character sequence ‘ha’:

hahahahaha

The following:

lexicon: (input="normalized_literal")
{
    (.)+straße,
    (.)+straat,
    street
} = StreetWord;

This will match words whose literal end in ‘straße’, ‘straat’ or ‘street’

Lange Leemstraat

Regular Expression Operators#

The regular expressions that you can use in strings (literal and lemmas) in your lexicons or rules are the following:

regex	Meaning	Example	Matches
.	Any character	c(.)t	“cat”, “cot”, “cut”
*	zero or more	house(.)*	“house”, “housewife”
+	one or more	(ha)+	“ha” “hahaha” “hahahahaha”
?	zero or one	book(s)?	“book”, “books”
{n}	‘n’ elements	(6){3}	“666”
{n,m}	‘n’ to ‘m’ elements	(la){3,6}	“lalalalalala”, “lalala”
\|	or operator	work(ed\|ing)	“worked”, “working”

For all the repetition operators:*,+,?,{n} and {n,m} The use of parentheses around the pattern that is to be repeated is required:

eg:

so* -> This is not allowed

s(o)* -> This is the correct form

The same applies to the ‘or’ operator, the parenthesis need to delimit the extent of the strings to be 'ored':

In lexicons:

(colon|small intestine) cancer

In rules:

('colon'|'small' 'intestine') 'cancer'

You can use several operators in an expression, always respecting the proper use of parentheses:

(colon|(small )?intestine) cancer

To use regular expressions in a lexicon string it is important to understand that the space ' ' in a lexicon is a regular character. For instance to match ‘single white female’ or ‘single female’, where ‘white’ is optional, you will have to write:

lexicon:
{
    single (white)? female
} = CandidateRenter;

Note

There is a space included in the optional expression (white )? and there is no space between the parenthesis and female.

Warning: As spaces are meaningful, be careful that you do not insert 2 spaces in between the tokens: ‘single white female’. These are difficult to detect and it will result in your expression not being matched.

When using the ‘any token expression (.)*’ this is different: then the blank behind does not count:

lexicon:
{
    single (.)* female
} = CandidateRenter;

Output:

‘single Asian female’ or ‘single white female’

Always with a token between ‘single’ and ‘female’.

The third option, to match any token (or no token in between), you express with ‘((.)* )?’

lexicon:
{
    single ((.)* )?female
} = CandidateRenter;

Output:

‘single Asian female’ or ‘single white female’ as well as ‘single female’.

Any Token#

<>

You can use the any token replacement <> to capture any token.

lexicon:
{
    chief (<>)? officer
} = Chiefs;

Output:

The ‘chief officer’ or ‘chief executive officer’.

Character Ranges#

[..]

There can be multiple A-Za-z123, the ‘-’ indicates a range, otherwise it’s just a set.

lexicon:
{
    [A-Z]([a-z])+son
} = SwedishFamilyName;

Output:

Bjorn ‘Andersson’ or Anna ‘Karlsson’.

Character Classes#

If you want to refer to classes of characters, like digits or alphabetic characters, you can use the built-in character classes

class	Meaning	Example	Matches
[:alpha:]	alphabetic character	d([:alpha:])	de, dl, dé
[:alnum:]	alphanumeric	pw([:alnum:])+	pw123, pwd3a5y
[:digit:]	digit	([:digit:]){6}	392742, 123456
[:xdigit:]	hexadecimal	([:xdigit:]){12}	fd8ae9581f65
[:lower:]	lowercase character	([:lower:])+	a, niño, сейчас
[:upper:]	uppercase character	([:upper:])+	IBM, ABC
[:range:]	character range	([:range(0-9):])+	13434987

Note

As you can see character classes can be very useful, particularly for alphanumeric patterns (dates, number plates, social security numbers) or when the language you are dealing with has special characters in the alphabet (å, й). You can combine these classes with the other operators, as shown in the examples, always encompassed by parenthesis:

//----------------------
//Spain - telephone
//91 111 11 11 -> Madrid
//954 111 111 -> Sevilla
//----------------------
lexicon:
{
    91 ([:digit:]){3} ([:digit:]){2} ([:digit:]){2},
    954 ([:digit:]){3} ([:digit:]){3}
} = SpanishPhone;

Attributes#

You can attach attributes to any annotation and thus, also to lexicons. This is particularly useful when you need some more fine grained information in your rules. You can think of attributes as rows in a database, so you could query your data with something like:

rule: { Company@(country="Germany", sector="automotive") } = GermanCarIndustry;

this would be the equivalent of:

SELECT * FROM Company where country like ‘Germany’ and sector like ‘automotive’;

Attributes appear behind the annotation, preceded by @ and in between parenthesis. Attributes are expressed as attribute-value pairs.

You can choose any attribute name, but it needs to have alphanumeric characters in it [a-zA-Z0-9] or underscore _ and the first character needs to be alphabetic.

The value is a string, so there are less restrictions: it can be several words and any character, except for quotations.

lexicon:
{
    Calais,
    Lyon,
    Nantes,
    Paris
} = City@(country = "France");

You can specify more than one attribute, separated by commas:

lexicon:
{
    spaghetti,
    lasagna,
    cannelloni
} = Food@(type="pasta dish", region="Italy");

You can only refer to annotations, and thus also annotations with attributes, from rules:

rule:
{
    {(Prop)+} = ItalianRestaurant
    'serve'
    Food@(region="Italy")
};

Canonicals#

A canonical is a preferred name for something. For instance, when referring to ‘IBM’, we might say ‘IBM’, ‘International Business Machines’ or ‘Big Blue’. To simplify, we are going to use ‘IBM’ as a canonical ( a normalized name ). We can express this easily in the lexicons, using the ‘:’ operator. Canonicals are a special kind of attribute in Wowool.

lexicon:(input="stem")
{
    VW:Volkswagen,
    (IBM|International Business Machines|Big Blue):IBM
} = Company;

Output

International Business Machines is an American company

Company@(canonical = “IBM”)