Tokenization#

Tokenization is the process of breaking down a stream of text into sentences, words, punctuation and symbols. These smaller units (words, punctuations and symbols) from now on are referred to as tokens. The tokens, as they appear in a text, are also called literal.

“Excuse me. Are you the Judean People’s Front?”

Token analysis

Sentence 1 --     offsets:  0,11
Excuse                      0,6
me                          7,9
.                           9,10

Sentence 2 --     offsets:  11,45
Are                         11,14
you                         15,18
the                         19,22
Judean                      23,29
People                      30,36
‘s                          36,38
Front                       39,44
?                           44,45

Hyphens (-) are separated in most languages:

US-based -> US - based

Apostrophes are kept together when it has meaning in the language:

‘s : English l’ : French

Tokenization is important for the proper definition of lexicons and rules. There are 2 important construct that you need to understand: the sentence and the token.

Note

offsets are a property of the tokens and can be defined as the distance, in characters, from the beginning of the text. Every sentence and token has a beginning and an end offset. Offsets are used for indexing and display functions.

For more info about the .tkz format, See [icu tokenizer](http://userguide.icu-project.org/boundaryanalysis/break-rules)

Sentences#

We consider a sentence a stream of text between 2 sentence delimiters. These are, in English (and most languages) the period (‘.’), question mark ”?” and the exclamation mark “!” when they are followed by a word starting with initial caps:

wow -p english -i "Here's to the crazy ones. The misfits. The rebels."

This will yield:

{Sentence
   t(0,4)   "Here" (init-cap, init-token)['here':Adv-Std]
   t(4,6)   "'s"   (apostrophe)['be':V-Pres-Sg-be]
   t(7,9)   "to"   ['to':Prep-to]
   t(10,13) "the"   ['the':Det-Def]
   t(14,19) "crazy" ['crazy':Adj-Std]
   t(20,24) "ones"  ['one':Nn-Pl]
   t(24,25) "."     ['.':Punct-Sent]
}Sentence
s(26,38)
{Sentence
   t(26,29) "The" (init-cap, init-token)['the':Det-Def]
   t(30,37) "misfits" ['misfit':Nn-Pl]
   t(37,38) "." ['.':Punct-Sent]
}Sentence
s(39,50)
{Sentence
   t(39,42) "The" (init-cap, init-token)['the':Det-Def]
   t(43,49) "rebels" ['rebel':Nn-Pl]
   t(49,50) "." ['.':Punct-Sent]
}Sentence

These are 3 sentences. It is important to know that rules run within sentences, so a rule such as:

rule :{ 'crazy' .. 'misfit' } = SpecialPerson;

Won't work with the previous input, because the words that we want to match are in separate sentences.

Abbreviations: There is an exception to these rules, the non-sentence breaking abbreviations, like Mr. or Dr.

wow -p english -i "Mr. Sandman, bring me a dream."
s(0,30)
    t(0,3) "Mr." (abbrev, init-cap, init-token)['Mr.':Prop-Std, +title]
    t(4,11) "Sandman" (init-cap, nf)['Sandman':Prop-Std]
    t(11,12) "," [',':Punct-Comma]
    t(13,18) "bring" ['bring':V-Pres]
    t(19,21) "me" ['I':Pron-Pers, +1P, +sg]
    t(22,23) "a" ['a':Det-Indef]
    t(24,29) "dream" ['dream':Nn-Sg]
    t(29,30) "." ['.':Punct-Sent]

You can add your own abbreviations, in ../tir/lxware. You will find a readable file called <language>.abbrev (english.abbrev, for instance):

[A-Z].([A-Z].)*
Art.
Dr.
etc.
n.
nr.
Mrs.
Mr.
mr.
Ph.D.
p.m.
Tel.
Sr.

Tokens#

The second construct is the token: to write lexicons and rules properly, you need to know the extent of the token and its properties. Tokens can be deceiving; Sometimes you might need to use a space between the tokens, even when in the original text they appear together:

“ex-girlfriend”

To write your lexicons or rules you would need to artificially separate this item in 3 tokens:

lexicon: { ex - girlfriend } = PeopleIKnow;
or:
rule:{ 'ex' '-' 'girlfriend' } = PeopleIKnow;

If you use spaces within tokens when you are using rules, the compiler will take care of it and consider it different tokens and set them apart:

rule:{ Person 'run into'  Person } = Encounter;

Example:

wow -p english -i "Oscar Wilde ran into Constance Lloyd"
s(0,36)
    t(0,5) "Oscar" (init-cap, init-token)['Oscar':Prop-Std]
    t(6,11) "Wilde" (init-cap, nf)['Wilde':Prop-Std]
    t(12,15) "ran" ['run':V-Past]
    t(16,20) "into" ['into':Prep-Std]
    t(21,30) "Constance" (init-cap)['Constance':Prop-Std]
    t(31,36) "Lloyd" (init-cap)['Lloyd':Prop-Std]

We see that “run” and “into” are 2 separate tokens.

In the case of the English contractions (“can’t”, “won’t”, “gonna”, “Ive”), the same applies:

rule :{ Person 'can' 'not' 'help' NP } = NoHelp;

As a general rule, tokens are words separated by blanks or punctuations, except for the case above with the hyphen. If you are unsure of how a particular word will be tokenize, for instance, in an alphanumeric sequence, you should run a sample.