wow.cp – Semantic Copy#

Semantic Copy – or wow.cp – is the command line utility provided by the eot-wowool-sdk.

Example#


Usage#

EyeOnText Corpus Copy Tool, This tool rearranges a corpus folder using language identification, topics, semantic-themes and wowool rules. The output argument is formatted at runtime. –output ~/tmp/{language}/{filename} will sort all your files according to the language of the document.

ex: using -n (dry_run)

wow.cp -f ~/corpus/multi_lingual --output ~/tmp/{language}/{filename} -n

Variables:

  • expression or capture (Concept): The Concept that has been captured with the -e or –expression argument.

  • language (str): language of the document

  • topic (list): topics found in the document.

  • theme (list): themes found in the document.

  • counter (int): file counter.

  • input_filename,ifilename (Path): Path object. which means you can use ifilename.name,ifilename.stem,ifilename.suffix.

  • suffix (str): filename extension.

ex: using the first theme name

wow.cp -f ~/corpus/multi_lingual --output ~/tmp/{language}/{theme[0]}/{filename} -n

ex: This will copy the file in multiple locations, the first 2 theme’s

wow.cp -f ~/corpus/multi_lingual --output ~/tmp/{language}/{theme[:1]}/{filename} -n

Functions:

  • folder(str): to lower casing and convert ‘ ‘ to ‘_’

  • camelize(str): converts ‘Streaming service’ -> ‘StreamingService’

  • normalize(str): remove accents

  • initial_caps(str): converst ALlCaPs -> Allcaps

ex: using wowool to sort your files using the gender of the person that has been captured.

wow.cp -f ~/corpus/multi_lingual  -e 'Person' -p "english,entity" --output "~/tmp/{folder(expression.Person.gender)}/{filename}" -x ~/tmp/corpus_test/not_found/notfound_{counter}{suffix}

Note

-e,–expression : is the wowoolian expression, which is also used to create a variable that can be used in the output variable. In this case we are sorting by gender. -p,–pipeline : pipe line to run your expression -x,–output_fallback : In the case we cannot resolve the output filename, we will fallback on this format. If that fails we will skip the file.

use the –action to either ‘link’ or ‘text’ to just extract the text of a file. This mean that if your input is html the text will be extracted.

usage: wow.cp [-h] -f FILE [-o OUTPUT] [-x OUTPUT_FALLBACK] [-l LANGUAGE]
              [-p PIPELINE] [-e EXPRESSION] [--to_text] [-n] [--action ACTION]
              [--overwrite]

Named Arguments#

-f, --file

folder or file

-o, --output

folder or file

-x, --output_fallback

exception if there was a exception in generating the output name.

-l, --language

language

Default: “auto”

-p, --pipeline

pipeline

-e, --expression

expression to search

--to_text

cleanup the documents

Default: False

-n, --dry_run

dry run

Default: False

--action

what to do with the input file,[copy,link,text]

Default: “copy”

--overwrite

overwrite

Default: False