Document API for Python#

The Document class represents the core data structure in which all NLP results for a processed document are represented. In short, each processed input document (text, HTML, PDF, etc.) is transformed into an instance of Document which holds the following information:

A uniquely given or assigned – through use of input providers – document identifier
The textual data of the document, which is the same as the provided data for .txt files, but is different for most other types such as .html, .pdf, …
Any data such as results or diagnostics added by applications and stored under a unique application identifier

class eot.wowool.document.Document#

Document is a class that stores the data related to a document. Instances of this class are returned from a Pipeline, a Language or a Domain object.

__init__(data: Optional[Union[str, bytes, eot.wowool.io.provider.input_provider.InputProvider]] = None, id: Optional[str] = None, provider_type: Optional[str] = None, encoding='utf8', meta_data=None, **kwargs)#

Initialize a Document instance

from eot.wowool.document import Document

document = Document("John Smith ...", id="john_smith")
print(document.id, document.text)

Parameters

data (A str, bytes, or an InputProvider) – Data to be processed
id (str) – Unique identifier for the document. If not provided, and data is an InputProvider, then the id will be set automatically
type (str) – Document type. Defaults to 'txt'
encoding (str) – Encoding of the data. Defaults to 'utf8'

Returns

An initialized document

Return type

Document

property id: str#

Returns: The unique identifier of the document
Return type: str

property text: Optional[str]#

Returns: The text data of the document
Return type: : str | None

property analysis#

Returns: The Analysis of the document, containing the Sentences, Tokens and Concepts, or None if the document has not been processed by a Language

from eot.wowool.document import Document
from eot.wowool.native.core import PipeLine

document = Document("John Smith works at EyeOnText.", id="john_smith")
english_entities = PipeLine(name="english,entity")
document = english_entities(document)
for concept in document.analysis.concepts():
    print(concept)

Return type: Analysis

app_ids()#

Iterate over the application identifiers

Returns: A generator expression yielding application identifiers
Return type: str

has_results(app_id: str) → bool#

Returns: Whether the application, as identified by the given application identifier, is in the document
Return type: bool

add_results(app_id: str, results)#

Add the given application results to the document

Parameters

app_id (str) – Application identifier
results (A JSON serializable object type) – Application results

results(app_id: str) → Optional[Any]#

Returns

The results of the given application. See the different type of application results

Parameters

app_id (str) – Application identifier
defaults – The defaults result to create when the application identifier is not present

add_diagnostics(app_id: str, diagnostics: eot.wowool.diagnostic.Diagnostics)#

Add the given application diagnostics to the document

Parameters

app_id (str) – Application identifier
diagnostics (Diagnostics) – Application diagnostics

has_diagnostics(app_id: str = None) → bool#

Parameters: app_id (str or None) – Application identifier
Returns: Whether the document contains diagnostics for the given application or any diagnostics if no application identifier is provided
Return type: bool

diagnostics(app_id: str = None) → eot.wowool.diagnostic.Diagnostics#

Parameters: app_id (str or None) – Application identifier
Returns: The diagnostics of the given application. See the different type of application results
Return type: Diagnostics

to_json() → dict#

Returns: A dictionary representing a JSON object of the document
Return type: dict

concepts(filter=<function _filter_pass_thru_concept>)#

Access the concepts in the analysis of the document

Parameters: filter (Functor accepting a Concept and returning a bool) – Optional filter to select or discard concepts
Returns: A generator expression yielding the concepts in the processed document
Return type: Concepts