Document API for Python#

The Document class represents the core data structure in which all NLP results for a processed document are represented. In short, each processed input document (text, HTML, PDF, etc.) is transformed into an instance of Document which holds the following information:

  • A uniquely given or assigned – through use of input providers – document identifier

  • The textual data of the document, which is the same as the provided data for .txt files, but is different for most other types such as .html, .pdf, …

  • Any data such as results or diagnostics added by applications and stored under a unique application identifier

class eot.wowool.document.Document#

Document is a class that stores the data related to a document. Instances of this class are returned from a Pipeline, a Language or a Domain object.

__init__(data: Optional[Union[str, bytes, eot.wowool.io.provider.input_provider.InputProvider]] = None, id: Optional[str] = None, provider_type: Optional[str] = None, encoding='utf8', meta_data=None, **kwargs)#

Initialize a Document instance

from eot.wowool.document import Document

document = Document("John Smith ...", id="john_smith")
print(document.id, document.text)
Parameters
  • data (A str, bytes, or an InputProvider) – Data to be processed

  • id (str) – Unique identifier for the document. If not provided, and data is an InputProvider, then the id will be set automatically

  • type (str) – Document type. Defaults to 'txt'

  • encoding (str) – Encoding of the data. Defaults to 'utf8'

Returns

An initialized document

Return type

Document

property id: str#
Returns

The unique identifier of the document

Return type

str

property text: Optional[str]#
Returns

The text data of the document

Return type

: str | None

property analysis#
Returns

The Analysis of the document, containing the Sentences, Tokens and Concepts, or None if the document has not been processed by a Language

from eot.wowool.document import Document
from eot.wowool.native.core import PipeLine

document = Document("John Smith works at EyeOnText.", id="john_smith")
english_entities = PipeLine(name="english,entity")
document = english_entities(document)
for concept in document.analysis.concepts():
    print(concept)
Return type

Analysis

app_ids()#

Iterate over the application identifiers

Returns

A generator expression yielding application identifiers

Return type

str

has_results(app_id: str) bool#
Returns

Whether the application, as identified by the given application identifier, is in the document

Return type

bool

add_results(app_id: str, results)#

Add the given application results to the document

Parameters
  • app_id (str) – Application identifier

  • results (A JSON serializable object type) – Application results

results(app_id: str) Optional[Any]#
Returns

The results of the given application. See the different type of application results

Parameters
  • app_id (str) – Application identifier

  • defaults – The defaults result to create when the application identifier is not present

add_diagnostics(app_id: str, diagnostics: eot.wowool.diagnostic.Diagnostics)#

Add the given application diagnostics to the document

Parameters
  • app_id (str) – Application identifier

  • diagnostics (Diagnostics) – Application diagnostics

has_diagnostics(app_id: str = None) bool#
Parameters

app_id (str or None) – Application identifier

Returns

Whether the document contains diagnostics for the given application or any diagnostics if no application identifier is provided

Return type

bool

diagnostics(app_id: str = None) eot.wowool.diagnostic.Diagnostics#
Parameters

app_id (str or None) – Application identifier

Returns

The diagnostics of the given application. See the different type of application results

Return type

Diagnostics

to_json() dict#
Returns

A dictionary representing a JSON object of the document

Return type

dict

concepts(filter=<function _filter_pass_thru_concept>)#

Access the concepts in the analysis of the document

Parameters

filter (Functor accepting a Concept and returning a bool) – Optional filter to select or discard concepts

Returns

A generator expression yielding the concepts in the processed document

Return type

Concepts

See also the corresponding JSON schema.