Rules#

Rules give you a more expressive mechanism to capture linguistic patterns than lexicons. Lexicons are preferred to make your dictionaries or gazetteers, rules are for larger and more sophisticated patterns:

With rules you can:

  • address stems, literals and part of speech, as done in lexicons

  • use other annotations (from lexicons or other rules)

  • Address the context of the expression

  • Make sub-annotations

  • Use the relative position of the sentence

  • Make coreference rules

  • Use filters

  • call python

Rule Syntax#

rule :                         -> rule identifier
{                              -> beginning of capture group
    …                          -> linguistic pattern
}                              -> end of capture group
= Annotation@(att="value")     -> Annotation and attributes (optional)
;                              -> ; end of rule

Example:

rule :
{
    "Hello"
    "World"
} = Greeting ;

Output

“I said Hello World

Capture group#

Every rule needs at least one capture group that covers the whole body of the text, but nested capture rules are possible:

rule :
{
    "Mr"                        → Context
    {
        { Prop } = First
        { Prop } = Last
    } = Person
};                            → No annotation needed

Output:

“I saw Mr Johann Gambolputty”
           First    Last
            ------------
                Person

In the case above, the main rule does not have an associated annotation, the tokens outside the annotation (“Mr”), just mark the context. There are no limits on the nesting level.

Units: Token#

The basic unit at rule level is the token. A token could be defined as a word plus all its attributes. Tokens are defined by a module called tokenizer. The tokenizer decides what is a sentence and what is a token. In general, a token is a string surrounded by blanks and stripped off punctuation:

“what? Are you sure? Yes, I am.”

Here we have 3 sentences, and 11 tokens (punctuation is also a separate token)

{Sentence
  t(0,4) "what" (init-token)['what':Det-Int]
  t(4,5) "?" ['?':Punct-Sent]
}Sentence
{Sentence
  t(6,9) "Are" (init-cap, init-token)['be':V-Pres-Pl-be]
  t(10,13) "you" ['you':Pron-Pers]
  t(14,18) "sure" ['sure':Adj-Std]
  t(18,19) "?" ['?':Punct-Sent]
}Sentence
{Sentence
  t(20,23) "Yes" (init-cap, init-token)['yes':Interj]
  t(23,24) "," [',':Punct-Comma]
  t(25,26) "I" ['I':Pron-Pers]
  t(27,29) "am" ['be':V-Pres-Sg-be]
  t(29,30) "." ['.':Punct-Sent]
}Sentence

In some cases the results from tokenization are not that obvious, like in the case of the hyphen. If you are doubtful, just run the sentence. Every line represents a token and all its attributes:

./wow -i “3-year-old”

t(0,1) "3" (init-token)['3':Num]
t(1,2) "-" ['-':Punct]
t(2,6) "year" ['year':Nn-Sg]
t(6,7) "-" ['-':Punct]
t(7,10) "old" ['old':Adj-Std]

That sentence has 5 tokens: “3”, “-”, “year”, “-” and “old”. So if you would write a rule to capture that expression, you would have to use 5 < >:

rule:
{
    Num "-" "year" "-" "old"
} = Age;

Similarly if you code it in a lexicon, you would have to add extra blanks:

lexicon:
{
    ([:digit:]){1,2} - year - old
}= Age;

Tokens can be represented on their own or enclosed within angular brackets < >. In our rules we can refer to different token dimensions: the literal, the stem, the POS (Part Of Speech), the token properties or a combination of those:

rule: { 'back',Nn 'pain' } = Symptom;

or:

rule: { <'back',Nn> <'pain'> } = Symptom;

If you are representing several token dimensions (like <’back’,Nn> ), it is a better practice to use angular brackets.

From version 1.7 the angular brackets are not mandatory anymore. As an example to address a token literal you can just use the “…”

Literal#

A “literal” is the word as it appears in the text surrounded by double quotes:

Example:

“left”

Matches:

“He [left] and went [left] but not LEFT.”

Stem#

The ‘stem’ or lemma is the word as it is found in the dictionary. It needs to be enclosed by single quotes:

Example:

‘leave’

Matches:

“He [left] and went left [leaving] a trace of [leaves].”

In the above sentence, we see examples of ambiguity:

  • Both the verb ‘to leave’ and the noun ‘leave’ as in the ‘leave’ of a tree, have the same stem or lemma.

  • We see the same with the literal ‘left’ as a direction and ‘left’ as the past form of the verb ‘leave’.

In text analysis it is important to be able to distinguish the concepts the strings refer to, so that we can annotate accurately. The processing of determining which meaning is the right one is called disambiguation. We will later give some ideas about how to disambiguate tokens that look the same.

Compound Component#

c’stem’ or component stem is a construct that allows you to match the stem of any part of a compound or a non-compound token.

rule: { c'Finanz' } = FinanzWord;

Output:

“Finanzen”
“Bundesfinanzhof”
“Bundesfinanzminister”
“Finanzgericht”

In lexicons, we saw this when we looked for:

lexicon : (input="component") { finanz } = FinanceWord;

the strings could be written in lowercase and without special characters. This is not the case in rule tokens, as shown in the example above: both capitals and special characters are maintained.

Compound Head#

h’head’

This will only match the stem of the last part of the compound (the head) or a non-compound word:

rule: { h'Leben' } = Leben;

Output:

“Arbeitsleben”
“Leben”
“Studentenleben”

This won't match words like: “Lebensmotto” or “Lebenslauf”

Part od speech#

  • Pos

Part of speech can be the long tag V-Pres or the umbrella tag: V (tag until the next hyphen)

rule: { Adj } = Adjective;

Output:

“the [quick] [brown] fox”

This is the list of umbrella tags

Umbrella Tag

Meaning

Example

Adj

Adjective

“big”, “incendiary”

Adv

Adverb

“incredibly”

Aux

Auxiliary

“do”, “will”

Conj

Conjunction

“if”, “and”

Det

Determiner

“the”, “my”

Interj

Interjection

“yes”

Nn

Noun

“car”, “imagination”

Num

Number

“1”, “forty”

Part

Particle

“to”, “not”

Prep

Preposition

“in”, “for”

Pron

Pronoun

“he”, “anybody”

Prop

Proper Noun

“England”, “Michelle”

Punct

Punctuation

“,”, “:”

V

Verb

“is”, “came”

Token property#

  • +tokenproperty

Token properties are properties of the token assigned by the tokenizer or the engine. E.g.: word with initial capitals, digits, word is known to the lexicon, etc.

+init-cap
Beware of the Jabberwocky

You can see the properties in the raw output, the appear behind the literal within round parenthesis:

t(60,63) “The” (init-cap, init-token)[‘the’:Det-Def]

Properties:

Property

Meaning

Example

+init-token

First token of a sentence

“The”

+init-cap

Initial capital

“Jabberwocky”

+all-cap

All capitals

“USA”

+nt

word not found in dictionary

“porduct”

+nf-lex

proper noun not in wowool lexicon

“Dingeldong”

+abbrev

abbreviation found by tokenizer

“Mr.”

+e-mail

e-mail address

info@eyeontext.com

+internet

url

https://www.rhs.org.uk/advice/advice

+markup

xml, html tags or similar

“<markup>”

+currency

token expressing currency

“$2m, £”

+apostrophe

's in English

“Labor ‘s”

+alphanum

alpha and numbers

AD34_aB

+compound

word is a compound

“Nachrichtendienst”

Token properties must be preceded by a ‘+’ when used in a rule:

rule: { +currency } = MoneyAmount;

Length#

The length range is used to filter the size of the token without running the expressions, which can be used to speed up some time consuming expressions. A example of this would be ([a-z]){124}. In this case it has to go every token just to fail or succeed at the last character. By using the length keyword We can avoid going trough every token, but only start the process if the token in 124 characters.

rule: { <length(128),([a-z])*> } = Word128;

Syntax: length(min,[max])

length(10) = exactly 10
length(10,) = 10 and bigger
length(10,12) = between 10 and 12

Combining matching patterns#

  • <”literal”,’stem’,Pos,+property>

Here also the angular brackets are not mandatory anymore, but when combining different sections it is visually better and safer. To code with a ‘space’ and forgetting a comma would result in different results.

ex:

“left”,Adj = <”left”,Adj> = 1 token “left” Adj = <”left”><Adj> = 2 tokens

Token attributes can be combined: you can use literal, stem, POS and properties to refer to the same token. They should be separated by commas

<”left”,Adj> “My [left] neighbor left and went left”

We have mentioned before the problem of ambiguity: the fact that many words look similar, though they have different meanings. We can use the combination of stems and POS (Part Of Speech) to differentiate between them:

rule:
{
    (
        <'stem',Nn>
        | <'leave',Nn>
    )
} = PartOfPlant;

"The [stem] and the [leaves] are part of a plant"

Tokens Regular Expressions#

The same regular expressions used for lexicons can be used within literals and stems:

regex

Meaning

Example

Matches

.

Any character

“c(.)t”

“cat”, “cot”, “cut”

*

zero or more

“house(.)*”

“house”, “housewife”

+

one or more

“(ha)+”

“ha” “hahaha” “hahahahaha”

?

zero or one

“book(s)?”

“book”, “books”

{n}

'n' elements

“(6){3}”

“666”

{n,m}

'n' to ‘m’ elements

“(la){3,6}”

“lalalalalala”, “lalala”

|

'or operator

“work(ed|ing)”

“worked”, “working”

This also applies to the character classes:

class

Meaning

Example

Matches

[:alpha:]

alphabetic character

“d([:alpha:])”

“de”, “dl”, “dé”

[:alnum:]

alphanumeric

“pw([:alnum:])+”

“pw123”, “pwd3a5y”

[:digit:]

digit

“([:digit:]){6}”

“392742”, “123456”

[:xdigit:]

hexadecimal

“([:xdigit:]){12}”

“fd8ae9581f65”

[:lower:]

lowercase character

“([:lower:])+”

“a”, “niño”, “сейчас”

[:upper:]

uppercase character

“([:upper:])+”

“IBM”, “ABC”

[:range:]

character range

“([:range(0-9):])+”

“13434987”

[range]

character range

“([0-9])+”

“13434987”

Some examples of regular expressions within rule tokens:

  • Literals:

For literals, remember to use double quotes around the expression:

rule :
{
    "[:upper:]([:lower:])+"
} = InitCaps;

Output:

“[Call] me [Ishmael]”

rule :
{
    <"un(.)+ed",Adj>
} = NegatedAdjectives;

Output:

“unspecified”
“undisclosed”
“understated”
“unlimited”
“unemployed”
“unnamed”
“unwanted”
“unpublished”
“unexpected”
“unprecedented”
  • Stems:

Remember to use single quotes around the expression:

rule :
{
    '(.)*wife'
} = wifeword;

Output:

“The parson’s [wife] visited some [housewives]”

Escape characters#

Not all characters can be used freely in lexicons, stems or literals; some of them are ‘reserved’ characters. A reserved character is one that has another meaning in Wowool: it is used as a separator or an operator or something else, so to match it as it is, you need to escape it. To escape a character, it has to be preceded by a backslash.

This is the list of characters that need to be escaped in Wowool lexicons:

Character

Name

Example

.

period

“U.S.”

,

comma

“,”

:

semicolon

“:”

' or "

quotes.

“'s”

( ) { } [ ]

paired parentheses

“(” and “)”

|

pipe

“|”

* + ?

kleene operators

“?”

\

backslash

“\”

In the tokens, the list of escaped characters is longer, as there are other characters, like the single and double quote that are ambiguous in that context (' and "):

'John' '\’s'

Token level regular expressions#

You can also apply regular expressions to the tokens themselves.

regex

Meaning

Example

Matches

<>

Any token

‘eat’ { <> } = Food

“I [ate sushi]”

()*

zero or more

(Adj)* Nn

“brown quick fox”

()+

one or more

‘Mr’ (Prop)+

“Mr John Wiston Doe”

()?

zero or one

(Det)? Nn

“the book”, “books”

(){n}

‘n’ elements

(‘la’){2} ‘land’

“la la land”

(){n,m}

‘n’ to ‘m’ elements

V (<>){0,3} Nn

“I [saw a beautiful horse]”

( | )

‘or’ operator

(‘buy’|’purchase’) Det Nn

“I bought a bicycle”

These are the same operators we have seen for characters, except for the ‘any’ token (<>) and 2 types of matches: the shortest and longest match and the unordered operator.

‘<>’ Any token#

<> represents any token, without further specification. It is useful for finding collocations, and also in combination with the frequency operators ( (<>){1,6}). It is also useful to delimit the matching scope, so that the items you are matching are not too far apart to be unrelated:

For collocations: Find 4 tokens behind ‘kill’:

'kill' (<>){4}
Output:
“kill bill : vol .”
“kill her and her unborn”
“kill only danes ; then”
“kill an über - target”
“killed while trying to summon”

To delimit scope we can add a noun behind the expression:

'kill' (<>){0,4} <Nn>

Output:

“killed his wife”
“kill her and her unborn child”
“kill only danes”
“kill his younger brother”
“kill notorious taliban leader”

‘..’ Shortest match#

Regular expressions are by their very nature ‘greedy’. This means that they will try to find the longest match:

V (<>)* Prop “He [visited Japan and Donald Trump avoided China]”

However, in text mining the shortest match is more useful, as words that are closer together have a stronger relationship, like in verb-object relationships:

V .. Prop “He [visited Japan] and Donald Trump avoided China”

In the above example, the first Prop we find behind the verb ‘visit’ is ‘Japan’, that’s the shortest match

rule :
{
    'meet'
    ..
    Prop
} = meeting;

Output:

“I would have [met her friend John] if Rose would have called me”

Note

You can never start or end a capture group with the shortest match operator:

rule :
{
    'eat'
    {
        ..            -> This is not allowed
        'potato'
    } = Food
} = meeting;

‘.?’ Shortest Span#

We have seen the shortest match operator: from a given point it tries to find the following expression in a non-greedy way:

rule :
{
    { Company } = Buyer
    ..
    'buy'
    ..
    {Company } = Acquired
} = CompanyAcquisition ;

Output:

“[Cardinal Health is planning to buy Medtronic] and Monsanto wants to merge with Bayer”

But also:

“According to [Bloomberg, Cardinal Health is planning to buy Medtronic]”

There could be an expression (or several) of the same kind as the first in between. To avoid this, and really choose the closest neighbors, you can use the shortest span, and the attribute is .?:

rule :
{
    { Company } = Buyer
    .?
    'buy'
    .?
    {Company } = Acquired
} = CompanyAcquisition;

Output:

“According to Bloomberg, [Cardinal Health is planning to buy Medtronic]”

‘…’ Longest match#

When you use the longest match operator there can be no tokens or many tokens between the 2 patterns of the expression:

Let’s check the same rule as above with the longest match:

rule :
{
    'meet'
    ...
    Prop
} = meeting;

Output:

“I would have met her friend John if Rose would have called me”

We see that the relation between the verb and proper noun is lost in this case.

Nevertheless, the longest match is sometimes useful if you are trying to find a longer pattern:

rule :
{
    'as' 'a' 'conclusion'
    <"\,">
    {
        ...
    } = ConclusionSentence
};

Output:

“As a conclusion, reindeers are better than people

‘%%’ Unordered shortest match#

The ‘unordered’ operator is used to match expressions that need to appear in a sentence in no particular order. For instance, we are interested in a date and an event, like someone’s birth, but they can appear:

“On September 8th 1964, John Murphy was born”

or:

“John Murphy was born on September 8th 1964”

rule:
{
    (Date %% Person 'be' "born")
} = DateBirthRelation;

Parenthesis around the expression are mandatory.

‘%%’ will capture the closest match of the expressions it encounters:

rule:
{
    (WeekDay % Event % Country)
} = DateEventCountry;

This time, the elements furthest away will be chosen:

"On Friday there was a tsunami in Indonesia, Thailand was spared"
        |                  |          |
    -----------------------------------------
         DateEventCountry

‘%’ Unordered adjacent#

The same principles apply to the unordered longest match, but the elements need to be adjacent:

rule:
{
    ('gebruiken' % 'worden')
} = Usage;

Output:

mijn gegevens moeten niet [gebruikt worden] voor direct marketing mijn gegevens [werden gebruikt] voor direct marketing

‘%%%’ Unordered longest match#

The same principles apply to the unordered longest match, only that the furthest element would be chosen:

rule:
{
    (WeekDay %%% Event %%% Country)
} = DateEventCountry;
"On Friday there was a tsunami in Indonesia, Thailand was spared"
        |                  |                     |
    -------------------------------------------------
         DateEventCountry

Units: Annotations#

Not only tokens, but also annotations, can appear as elements in the rules:

rule:
{
    Person 'visit' (City | Person)
} = PersonVisitFact;

Output:

“Angela Merkel visited Washington”

The same regular expressions (as for tokens are concerned) can be applied.

All annotations (no matter from lexicons or rules) are reusable in other rules:

rule:
{
    PersonVisitFact .. Date
} = PersonVisitDate;

Output:

“[Angela Merkel visited Washington on the 18th of February 2017]”

Annotations as containers#

Annotations can be viewed as containers, which means that you can match the elements inside. To express it as an element inside a container, you have to use the [ ] notation.

rule:
{
    Person[( "Merkel" | "Putin" | "Rihanna" | "Beyoncé")]
    'visit'
    (City | Person)
} = VIP_Visit;

Output:

“[Angela Merkel visited Washington]”

‘^’ anchors#

In the rule above, the listed names “Merkel” or “Beyoncé” can appear anywhere inside the person annotation, so both “Angela Merkel” and “Beyoncé Knowles” will match. We might want to match only the annotations at the start or the end with a certain token; in that case we need to anchor the token to the beginning or end of the container with the ^ symbol.

rule:
{
    Person[( "Merkel" | "Putin" | "Rihanna" | "Beyoncé") ^]
    'visit'
    (City | Person)
} = VIP_Visit;

This will match “Merkel” as well as “Angela Merkel”; “Beyoncé” but not “Beyoncé Knowles”.

We can use the anchors on both sides as well:

rule:
{
    Company[^ <> ^]
} = OneWordCompanyName;

Then you will match companies with just one word in their name, like ‘Samsung’ and ‘Apple’ and not others like ‘British Airways’

Sentence Container#

A sentence container is a special kind of out-of-the-box annotation that is useful as a container:

rule:
{
    Sentence
    [
    {
        (<>)*
         <"\?">
    } = QuestionSentence
    ]
};


"Will East Coast rail services remain nationalised ?"
"Why is no other developed country content trading on WTO rules alone ?"
"What is the EU withdrawal bill ?"
"What did I do ?"
"Was this helpful ?"

Using annotation attributes#

Attributes allow you to add features to annotations that can help you make more interesting queries later.

You can address annotation attributes coming from rules or lexicons in other rules:

rule: {
    Event
    'in'
    City@(country="Spain")
} = SpanishEvent;


“Last year’s [summit in Madrid]”

Formatted attributes#

You can add attributes at runtime using the literal, stems, or attributes of the annotations in the rule.

For instance you can aggregate the position of a person as an attribute using a rule:

// Person Aggregator
// twitter CEO Jack Dorsey
rule:{
    Position
    {
        Person
    } = Person@( position=f"{rule.Position.literal()}" )
};


//Output
- Person -> Jack Dorsey  @ position=CEO;

You can use the formatted attributes to automatically add information from the dbpedia.

rule:
{
    Person
} = Person@( inherit="true", birthPlace=f"{dbpedia:birthPlace}");

You can then use later in a rule:

rule:{ Person@(birthPlace="Honolulu") } = Honolulian;

If you run the sentence: “I met Barack Obama”, the literal of the Person ‘Barack Obama’ will be looked up in the dbpedia, we search for the attribute ‘birthPlace’ and it will be populated in your resulting annotation:

Person -> Barack Obama @ birthPlace=Honolulu;

Negation lookahead#

Version > 2.0.4 A negative look a head will continue matching if the expression did not match. But it will not move the current matching location. The reason why it a lookahead is that we can also apply negation on semantic tags like !( Person ) Not a person.

// I saw a green train, with a green car on it.
rule:{ 'green' !( 'car' ) <> } = NotAGreenCar;

As the !( expr ) is a look a head we still need to capture the token behind it, using the any token ‘<>’. If you run the sentence: “I saw a green train, with a green car on it.”, only ‘green train’ will be matched.

NotAGreenCar -> green train

Note that this also works with Concepts.

rule:{ 'University' 'of' !( City ) <> } = UNI_NOT_CITY;

Match           : University of Art
Does NOT match  : University of Antwerp