Document API for Python#
The Document
class represents the core data structure in which all NLP results for a processed document are represented. In short, each processed input document (text, HTML, PDF, etc.) is transformed into an instance of Document
which holds the following information:
A uniquely given or assigned – through use of input providers – document identifier
The textual data of the document, which is the same as the provided data for
.txt
files, but is different for most other types such as.html
,.pdf
, …Any data such as results or diagnostics added by applications and stored under a unique application identifier
- class eot.wowool.document.Document#
Document
is a class that stores the data related to a document. Instances of this class are returned from a Pipeline, a Language or a Domain object.- __init__(data: Optional[Union[str, bytes, eot.wowool.io.provider.input_provider.InputProvider]] = None, id: Optional[str] = None, provider_type: Optional[str] = None, encoding='utf8', meta_data=None, **kwargs)#
Initialize a
Document
instancefrom eot.wowool.document import Document document = Document("John Smith ...", id="john_smith") print(document.id, document.text)
- Parameters
data (A
str
,bytes
, or anInputProvider
) – Data to be processedid (
str
) – Unique identifier for the document. If not provided, anddata
is anInputProvider
, then theid
will be set automaticallytype (
str
) – Document type. Defaults to'txt'
encoding (
str
) – Encoding of thedata
. Defaults to'utf8'
- Returns
An initialized document
- Return type
- property id: str#
- Returns
The unique identifier of the document
- Return type
str
- property text: Optional[str]#
- Returns
The text data of the document
- Return type
:
str | None
- property analysis#
- Returns
The
Analysis
of the document, containing theSentences
,Tokens
andConcepts
, orNone
if the document has not been processed by a Language
from eot.wowool.document import Document from eot.wowool.native.core import PipeLine document = Document("John Smith works at EyeOnText.", id="john_smith") english_entities = PipeLine(name="english,entity") document = english_entities(document) for concept in document.analysis.concepts(): print(concept)
- Return type
- app_ids()#
Iterate over the application identifiers
- Returns
A generator expression yielding application identifiers
- Return type
str
- has_results(app_id: str) bool #
- Returns
Whether the application, as identified by the given application identifier, is in the document
- Return type
bool
- add_results(app_id: str, results)#
Add the given application results to the document
- Parameters
app_id (
str
) – Application identifierresults (A JSON serializable object type) – Application results
- results(app_id: str) Optional[Any] #
- Returns
The results of the given application. See the different type of application results
- Parameters
app_id (
str
) – Application identifierdefaults – The defaults result to create when the application identifier is not present
- add_diagnostics(app_id: str, diagnostics: eot.wowool.diagnostic.Diagnostics)#
Add the given application diagnostics to the document
- Parameters
app_id (
str
) – Application identifierdiagnostics (
Diagnostics
) – Application diagnostics
- has_diagnostics(app_id: str = None) bool #
- Parameters
app_id (
str
orNone
) – Application identifier- Returns
Whether the document contains diagnostics for the given application or any diagnostics if no application identifier is provided
- Return type
bool
- diagnostics(app_id: str = None) eot.wowool.diagnostic.Diagnostics #
- Parameters
app_id (
str
orNone
) – Application identifier- Returns
The diagnostics of the given application. See the different type of application results
- Return type
- to_json() dict #
- Returns
A dictionary representing a JSON object of the document
- Return type
dict
- concepts(filter=<function _filter_pass_thru_concept>)#
Access the concepts in the analysis of the document
See also the corresponding JSON schema.