Corpus analysis#

The semantic corpus analysis tool (-e) is a tool that allows you to search a corpus using wowool expressions. It helps you to:

  • Check what are the most forthcoming expressions in your corpus

  • See collocations: what words appear around a word or expression

  • Develop your rules, and test them fast and reliably.

  • Create lexicons.

  • See in which context the expression has been found and makes a statistical analysis of the output.

Run a sample expression:

wow -p english -i "A rose is a rose is a rose. I like roses." -e " 'rose' "

The output has several parts:

First, you get the matches listed after the sentence they have been found in:

SENTENCE:(0,27) A rose is a rose is a rose .
  CAPTURE -> rose
  CAPTURE -> rose
  CAPTURE -> rose
SENTENCE:(28,41) I like roses .
  CAPTURE -> roses

Then we see the options that are being run and the expression

### options:
* language:english
* domains:[]
* text:[A rose is a rose is a rose. I like roses.]
* threshold:1

### expression:
        ('rose'|'be')

Finally, we get the general statistics, how many times a match has been found for that expression:

## CAPTURE

    |  cnt|                       literal|
    |:---:|                          :---|
    |    1|                         roses|
    |    3|                          rose|

Options#

Input parameters -i/-f#

You can pass a text or files to your tool for analysis:

  • You can pass a sentence (-i)

wow -p english -i "A rose is a rose is a rose" -e " 'rose' "
  • You can pass a file (-f)

wow -p english -f ~/corpus/english/movies/Alien.txt -e " 'alien' "
  • You can pass a folder (-f)

wow -p english -f ~/corpus/english/movies/ -e " 'kill' "

The -e expression#

The search expression is preceded by the -e parameter. Between quotes you can put the rule expression, without the “rule:” prefix (lexicons are not supported).

wow -p english -f ~/corpus/english/movies/ -e " 'kill' "

The default annotation is called CAPTURE, but you can annotate all or part of the expression using { } = Annotation syntax:

wow -p english -f ~/corpus/english/movies/ -e "{ 'kill' }= Action Det { Nn }=Victim "

when you capture more than 1 expression, you will get statistics for each of the captures separately (Action and Victim), for the whole expression (CAPTURE) and the collocations, which shows the whole expression, but in tabular form:

## Action

|  cnt|                       literal|
|:---:|                          :---|
|    1|                       killing|
|    3|                        killed|
|    4|                         kills|

## CAPTURE

|  cnt|                       literal|
|:---:|                          :---|
|    1|             killed his family|
|    1|               killed his wife|
|    1|                killed the dog|
|    1|               killing the dog|
|    1|         kills another captive|
|    1|                kills both men|
|    1|              kills his friend|
|    1|              kills his guards|

## Victim

|  cnt|                       literal|
|:---:|                          :---|
|    1|                       captive|
|    1|                          wife|
|    1|                        family|
|    1|                        friend|
|    1|                        guards|
|    1|                           men|
|    2|                           dog|

## Collocations:

    { 'kill' }= Action Det { Nn }=Victim


|  cnt|              Action|              Victim|
|:---:|                ---:|                ---:|
|    1|              killed|                 dog|
|    1|              killed|              family|
|    1|              killed|                wife|
|    1|             killing|                 dog|
|    1|               kills|             captive|
|    1|               kills|              friend|
|    1|               kills|              guards|
|    1|               kills|                 men|

The command line does not allow to use the same kind of quotations (“”” or ‘’) to mark the expression and to use inside the expression, so you need to alternate them depending on if you are matching the literals (’ “literal” ‘) or stems (” ‘stem’ “).

Passing a domain to a pipeline (-p)#

To run a set of rules or lexicons (a domain) is an excellent way to start searching your document collection.

You can use a domain compiled file:

wow -p 'english,english-entity' -f ~/corpus/english/movies -e 'Person'

or a directory containing wow files:

wow -p 'english,movie-rules' -f ~/corpus/english/movies -e 'Character'

You can append domains, by listing them with a comma:

wow -p 'english,english-entity,movie-rules' -f ~/corpus/english/movies -e 'Character .. Event'

–cis#

Prints all matches to lowercase

wow -p english -f ~/corpus/english/movies/ -e ' Prop ' --cis

Matching#

Literals: use ‘’ in the outer expression#

wow -p english -i "he left and went left" -e '"left"'

Stem: use “” for the expression#

wow -p english -i "he left and went left" -e " 'left' "

In Windows you cannot use the single quotes, but this should work:

Matching the literal in Windows#

wow -p english -i "he left and went left" -e " \"left\" "

Matching the stem in Windows#

wow -p english -i "he left and went left" -e " 'left' "

POS - statistical analysis#

Part of speech analysis can help you understand what type of information your texts contain.

Nouns refer to concepts. In English 2 or more nouns in a row are more meaningful that just one. In Germanic languages, compounds can also give you more meaningful information (-e ” +compound “) . Romance languages tend to use the preposition ‘of’ (-e “Nn ‘de’ Nn “).

wow -p english -f ~/corpus/english/movies/ -e " Nn (Nn)+ "

Proper names are used for entities. Check them out:

wow -p english -f ~/corpus/english/movies/ -e " (Prop)+ "

Adjectives are descriptions, it can give you an idea of the sentiment.

wow -p english -f ~/corpus/english/amazon/ -e " Adj "

Actions are expressed with Verbs. These are important to capture facts.

wow -p english -f ~/corpus/english/movies/ -e " V "

Collocations#

Collocations refer to how an expression relates to other expressions (or rules) in the sentence. To have collocations, you just need to capture 2 annotations or more in your expression.

Let’s see how the word ‘kill’ appears in a text:

wow -p english -f ~/corpus/english/movies/ -e " {'kill'}=Kill (<>){0,3} {( Prop | Nn )+} = Victim"

At the bottom you will see that you have ## Collocations, followed by a sorted table with the collocations With these results we can already see some patterns:

|  cnt|                Kill|              Victim|
|:---:|                ---:|                ---:|
|    2|               kills|              friend|
|    2|               kills|               Benny|
|    2|               kills|                 men|
|    2|                kill|                 men|
|    2|             killing|             process|
|    2|              killed|             farmers|
|    2|              killed|                 dog|
|    2|              killed|            creature|
|    2|              killed|              action|

Let’s now check what kind of adjective-noun pairs we find more often:

wow -p english -f ~/corpus/english/movies/ -e " {Adj}=Adj {Nn+} = Noun"
|  cnt|                 Adj|                Noun|
|:---:|                ---:|                ---:|
|   17|                next|             morning|
|   10|                next|                 day|
|    5|                real|               world|
|    4|           following|             morning|
|    4|               first|                 day|
|    4|             present|                 day|

Another example, with a domain and with the stem:

wow++ -p english,syntax -f ~/corpus/english/movies/ -e " {V}=Verb NP[{Nn+} = Noun]"
|  cnt|                Verb|                Noun|
|:---:|                ---:|                ---:|
|    2|             become |               king |
|    1|              break |                net |
|    1|              bring |               news |
|    1|            confess |            feeling |
|    1|             devise |               plan |
|    1|          encounter |             forest |
|    1|          encounter |              shark |

If you use the -t option, to pass a file format, you can get a markdown or json file that you can directly visualize in any visualization tool:

wow++ -p english,entity -f ~/corpus/english/news -e " {Position}=Job {Person}=Name " > position.md

Formatting your output the way you want#

You can also format your output to something that it is convenient for you by using the wowool plugin function format. Let’s see how this works:

wow -p english,english-syntax,english-drugs -f ~/corpus/english/healthline   -e '{ Triple[ Subject[HealthIssue] VP Object ] }= ::python::eot.wowool.plugin::format@(format="{this.Subject.literal()} == {this.VP.stem()} == {this.Object.literal()} ")'

The argument of format is a python f-string. We can use access the variables in our rules by using {this.variable_name}. In between the arguments, we have chosen to print a double equal sign.

Leukemia == be == a form of cancer
heart problems == be == a less common but serious side effect of Sprycel
Hair loss == be == a less common side effect of Sprycel in studies
A more severe allergic reaction == be == rare but possible

Rule writing aide#

Semantic search can be used to write rules. Let’s write a rule to discover patterns for company acquisition. Make your expression wide and unordered to have as much recall as possible:

wow -p 'english,snippet(lexicon:(input="stem") {buy,acquire,purchase}=Buy;).app,english-entity' -f ~/corpus/english/news/ -e "(Company %% Buy)"

The results can already give us clues to how to write more geared rules:

2 -> AT & T ‘s acquisition of Time Warner
2 -> AT & T , buying a content producer , Time Warner
2 -> AT & T ( T ) first announced its intention to purchase Time Warner
4 -> AT & T acquisition of Time Warner