Lemmatization or Stemming#

Lemmatization (also called ‘stemming’ in our documentation) refers to the ability to link words to their base or dictionary form which is known as a lemma (or stem) :

“ate” -> ‘eat’

Some languages, such as English, have a fairly simple morphology, which means that the lemmas are not that different from the derived forms (‘work’, ‘worked’). Other languages, such as Spanish, French or German, have more morphological variations. For these languages it is important to lemmatize in order to be able to generalize in the rules:

Eg: Spanish “poner” (to put) forms:
“pusiera” -> ‘poner’
“pone” -> ‘poner’
“pondrá” -> ‘poner’
“poniéndolo” -> ‘poner’

For Germanic languages (German, Dutch, Swedish, Danish and Norwegian) there is a mechanism to break compounded words (words made up of 2 or more different words, like “cheesecake”) into their components.

E.g: “Kanzlerkandidat” ‘Kanzlerkandidat’ -> ‘Kanzler’ ‘Kandidat’

This facilitates both the search and rule writing. For instance, you might be interested in finding out, all the words (compounds or not) where the Dutch word ‘fiets’ (bicycle) appears. To do so, you can use a construct called ‘head’ (head of the compound, the last word) or another called ‘component’, where it does not matter in which part of the compound the word appears.

E.g, using the head:

“<h’Minister’>” -> Gesundheitsminister, Finanzminister, Premierminister, …

E.g, using the component:

“<c’Minister’>” -> Finanzminister, Kultusministerkonferenz, Ministerpräsidentinnen

What’s the use of Stems?#

As we mentioned, to be able to generalize: it saves us the trouble of writing all the derived word forms.

It can also help us to disambiguate: disambiguation entails distinguishing word meanings, for example ‘left’ can be a direction or the past for of the action ‘leave’. if you want to annotate ‘left’, the direction and not the action (left -> leave), you can use the stem to differentiate both forms:

rule:{'left'} = LeftDirection;
rule:{'leave'} = Departure;