LanguageIdentifier API for Python#

class eot.wowool.native.core.language_identifier.LanguageIdentifier#
__init__(default_language: str = '', language_candidates: Optional[list[str]] = None, sections: bool = False, section_data: bool = False, engine: Optional[eot.wowool.native.core.engine.Engine] = None)#

LanguageIdentifier is a class that will load and share the data that will be used by the language identifier. The options are the same as the keyword arguments.

Parameters
  • default_language (str) – The default language code to used when we cannot detect the language for a section. Default: english

  • language_candidates (list[str]) – List of the languages that will be considered

  • sections (bool) – Analyze the full document and return the sections with the corresponding language

  • section_data (bool) – Add the data of the section in the results. Default: False

from eot.wowool.native.core.language_identifier import LanguageIdentifier
from eot.wowool.document import Document

document = """
La juventud no es más que un estado de ánimo.

Record de chaleur battu dans une cinquantaine de villes en France

"""
# Initialize a language identification engine
lid = LanguageIdentifier()
# Process the data
language = lid(document)
print(language)
eot_language_identifier, {'language': 'french'}
__call__(document: Union[str, eot.wowool.document.document.Document]) eot.wowool.document.document.Document#
Parameters

document (Document or a str) – document input data.

Returns

a Document object with the language information.

Return type

Document

lid = LanguageIdentifier()
doc = lid(document)
sections = doc.results( 'eot_lid' )
# prints the sections in the document data
for section in sections:
    print(section)
sections(document: eot.wowool.document.document.Document) list[dict]#

Return a list of the different section with there language in a given document

Parameters

document (Document) – document input data.

Return type

the a json object with the section in a document.

english_lid_section.py#
from eot.wowool.native.core.language_identifier import LanguageIdentifier

# Initialize a language identification engine
lid = LanguageIdentifier()
document = """
La juventud no es más que un estado de ánimo.

Record de chaleur battu dans une cinquantaine de villes en France

"""
lid = LanguageIdentifier()
sections = lid.sections(document)
print(sections)
for section in sections:
    print(f"{section['language']} ,({section['begin_offset']},{section['end_offset']})")
[
    {"begin_offset": 0, "end_offset": 50, "language": "spanish"},
    {"begin_offset": 50, "end_offset": 117, "language": "french"}
]