wow.cp
– Semantic Copy#
Semantic Copy – or wow.cp
– is the command line utility provided by the eot-wowool-sdk
.
Example#
Usage#
EyeOnText Corpus Copy Tool, This tool rearranges a corpus folder using language identification, topics, semantic-themes and wowool rules. The output argument is formatted at runtime. –output ~/tmp/{language}/{filename} will sort all your files according to the language of the document.
ex: using -n (dry_run)
wow.cp -f ~/corpus/multi_lingual --output ~/tmp/{language}/{filename} -nVariables:
expression or capture (Concept): The Concept that has been captured with the -e or –expression argument.
language (str): language of the document
topic (list): topics found in the document.
theme (list): themes found in the document.
counter (int): file counter.
input_filename,ifilename (Path): Path object. which means you can use ifilename.name,ifilename.stem,ifilename.suffix.
suffix (str): filename extension.
ex: using the first theme name
wow.cp -f ~/corpus/multi_lingual --output ~/tmp/{language}/{theme[0]}/{filename} -nex: This will copy the file in multiple locations, the first 2 theme’s
wow.cp -f ~/corpus/multi_lingual --output ~/tmp/{language}/{theme[:1]}/{filename} -nFunctions:
folder(str): to lower casing and convert ‘ ‘ to ‘_’
camelize(str): converts ‘Streaming service’ -> ‘StreamingService’
normalize(str): remove accents
initial_caps(str): converst ALlCaPs -> Allcaps
ex: using wowool to sort your files using the gender of the person that has been captured.
wow.cp -f ~/corpus/multi_lingual -e 'Person' -p "english,entity" --output "~/tmp/{folder(expression.Person.gender)}/{filename}" -x ~/tmp/corpus_test/not_found/notfound_{counter}{suffix}Note
-e,–expression : is the wowoolian expression, which is also used to create a variable that can be used in the output variable. In this case we are sorting by gender. -p,–pipeline : pipe line to run your expression -x,–output_fallback : In the case we cannot resolve the output filename, we will fallback on this format. If that fails we will skip the file.
use the –action to either ‘link’ or ‘text’ to just extract the text of a file. This mean that if your input is html the text will be extracted.
usage: wow.cp [-h] -f FILE [-o OUTPUT] [-x OUTPUT_FALLBACK] [-l LANGUAGE] [-p PIPELINE] [-e EXPRESSION] [--to_text] [-n] [--action ACTION] [--overwrite]Named Arguments#
- -f, --file
folder or file
- -o, --output
folder or file
- -x, --output_fallback
exception if there was a exception in generating the output name.
- -l, --language
language
Default: “auto”
- -p, --pipeline
pipeline
- -e, --expression
expression to search
- --to_text
cleanup the documents
Default: False
- -n, --dry_run
dry run
Default: False
- --action
what to do with the input file,[copy,link,text]
Default: “copy”
- --overwrite
overwrite
Default: False