Tokenization#
Tokenization is the process of breaking down a stream of text into sentences, words, punctuation and symbols. These smaller units (words, punctuations and symbols) from now on are referred to as tokens. The tokens, as they appear in a text, are also called literal.
“Excuse me. Are you the Judean People’s Front?”
Token analysis
Sentence 1 -- offsets: 0,11
Excuse 0,6
me 7,9
. 9,10
Sentence 2 -- offsets: 11,45
Are 11,14
you 15,18
the 19,22
Judean 23,29
People 30,36
‘s 36,38
Front 39,44
? 44,45
Hyphens (-) are separated in most languages:
US-based -> US - based
Apostrophes are kept together when it has meaning in the language:
‘s : English l’ : French
Tokenization is important for the proper definition of lexicons and rules. There are 2 important construct that you need to understand: the sentence and the token.
Note
offsets are a property of the tokens and can be defined as the distance, in characters, from the beginning of the text. Every sentence and token has a beginning and an end offset. Offsets are used for indexing and display functions.
For more info about the .tkz format, See [icu tokenizer](http://userguide.icu-project.org/boundaryanalysis/break-rules)
Sentences#
We consider a sentence a stream of text between 2 sentence delimiters. These are, in English (and most languages) the period (‘.’), question mark ”?” and the exclamation mark “!” when they are followed by a word starting with initial caps:
wow -p english -i "Here's to the crazy ones. The misfits. The rebels."
This will yield:
{Sentence
t(0,4) "Here" (init-cap, init-token)['here':Adv-Std]
t(4,6) "'s" (apostrophe)['be':V-Pres-Sg-be]
t(7,9) "to" ['to':Prep-to]
t(10,13) "the" ['the':Det-Def]
t(14,19) "crazy" ['crazy':Adj-Std]
t(20,24) "ones" ['one':Nn-Pl]
t(24,25) "." ['.':Punct-Sent]
}Sentence
s(26,38)
{Sentence
t(26,29) "The" (init-cap, init-token)['the':Det-Def]
t(30,37) "misfits" ['misfit':Nn-Pl]
t(37,38) "." ['.':Punct-Sent]
}Sentence
s(39,50)
{Sentence
t(39,42) "The" (init-cap, init-token)['the':Det-Def]
t(43,49) "rebels" ['rebel':Nn-Pl]
t(49,50) "." ['.':Punct-Sent]
}Sentence
These are 3 sentences. It is important to know that rules run within sentences, so a rule such as:
rule :{ 'crazy' .. 'misfit' } = SpecialPerson;
Won't work with the previous input, because the words that we want to match are in separate sentences.
Abbreviations: There is an exception to these rules, the non-sentence breaking abbreviations, like Mr. or Dr.
wow -p english -i "Mr. Sandman, bring me a dream."
s(0,30)
t(0,3) "Mr." (abbrev, init-cap, init-token)['Mr.':Prop-Std, +title]
t(4,11) "Sandman" (init-cap, nf)['Sandman':Prop-Std]
t(11,12) "," [',':Punct-Comma]
t(13,18) "bring" ['bring':V-Pres]
t(19,21) "me" ['I':Pron-Pers, +1P, +sg]
t(22,23) "a" ['a':Det-Indef]
t(24,29) "dream" ['dream':Nn-Sg]
t(29,30) "." ['.':Punct-Sent]
You can add your own abbreviations, in ../tir/lxware. You will find a readable file called <language>.abbrev (english.abbrev, for instance):
[A-Z].([A-Z].)*Art.Dr.etc.n.nr.Mrs.Mr.mr.Ph.D.p.m.Tel.Sr.
Tokens#
The second construct is the token: to write lexicons and rules properly, you need to know the extent of the token and its properties. Tokens can be deceiving; Sometimes you might need to use a space between the tokens, even when in the original text they appear together:
“ex-girlfriend”
To write your lexicons or rules you would need to artificially separate this item in 3 tokens:
lexicon: { ex - girlfriend } = PeopleIKnow;
or:
rule:{ 'ex' '-' 'girlfriend' } = PeopleIKnow;
If you use spaces within tokens when you are using rules, the compiler will take care of it and consider it different tokens and set them apart:
rule:{ Person 'run into' Person } = Encounter;
Example:
wow -p english -i "Oscar Wilde ran into Constance Lloyd"
s(0,36)
t(0,5) "Oscar" (init-cap, init-token)['Oscar':Prop-Std]
t(6,11) "Wilde" (init-cap, nf)['Wilde':Prop-Std]
t(12,15) "ran" ['run':V-Past]
t(16,20) "into" ['into':Prep-Std]
t(21,30) "Constance" (init-cap)['Constance':Prop-Std]
t(31,36) "Lloyd" (init-cap)['Lloyd':Prop-Std]
We see that “run” and “into” are 2 separate tokens.
In the case of the English contractions (“can’t”, “won’t”, “gonna”, “Ive”), the same applies:
rule :{ Person 'can' 'not' 'help' NP } = NoHelp;
As a general rule, tokens are words separated by blanks or punctuations, except for the case above with the hyphen. If you are unsure of how a particular word will be tokenize, for instance, in an alphanumeric sequence, you should run a sample.