LanguageIdentifier API for Python#

class eot.wowool.native.core.language_identifier.LanguageIdentifier#

__init__(default_language: str = '', language_candidates: Optional[list[str]] = None, sections: bool = False, section_data: bool = False, engine: Optional[eot.wowool.native.core.engine.Engine] = None)#

LanguageIdentifier is a class that will load and share the data that will be used by the language identifier. The options are the same as the keyword arguments.

Parameters

default_language (str) – The default language code to used when we cannot detect the language for a section. Default: english
language_candidates (list[str]) – List of the languages that will be considered
sections (bool) – Analyze the full document and return the sections with the corresponding language
section_data (bool) – Add the data of the section in the results. Default: False

from eot.wowool.native.core.language_identifier import LanguageIdentifier
from eot.wowool.document import Document

document = """
La juventud no es más que un estado de ánimo.

Record de chaleur battu dans une cinquantaine de villes en France

"""
# Initialize a language identification engine
lid = LanguageIdentifier()
# Process the data
language = lid(document)
print(language)

eot_language_identifier, {'language': 'french'}

__call__(document: Union[str, eot.wowool.document.document.Document]) → eot.wowool.document.document.Document#

Parameters: document (Document or a str) – document input data.
Returns: a Document object with the language information.
Return type: Document

lid = LanguageIdentifier()
doc = lid(document)
sections = doc.results( 'eot_lid' )
# prints the sections in the document data
for section in sections:
    print(section)

sections(document: eot.wowool.document.document.Document) → list[dict]#

Return a list of the different section with there language in a given document

Parameters: document (Document) – document input data.
Return type: the a json object with the section in a document.

english_lid_section.py#

from eot.wowool.native.core.language_identifier import LanguageIdentifier

# Initialize a language identification engine
lid = LanguageIdentifier()
document = """
La juventud no es más que un estado de ánimo.

Record de chaleur battu dans une cinquantaine de villes en France

"""
lid = LanguageIdentifier()
sections = lid.sections(document)
print(sections)
for section in sections:
    print(f"{section['language']} ,({section['begin_offset']},{section['end_offset']})")

[
    {"begin_offset": 0, "end_offset": 50, "language": "spanish"},
    {"begin_offset": 50, "end_offset": 117, "language": "french"}
]