LanguageIdentifier API for Python#
- class eot.wowool.native.core.language_identifier.LanguageIdentifier#
- __init__(default_language: str = '', language_candidates: Optional[list[str]] = None, sections: bool = False, section_data: bool = False, engine: Optional[eot.wowool.native.core.engine.Engine] = None)#
LanguageIdentifier is a class that will load and share the data that will be used by the language identifier. The options are the same as the keyword arguments.
- Parameters
default_language (
str
) – The default language code to used when we cannot detect the language for a section. Default:english
language_candidates (
list[str]
) – List of the languages that will be consideredsections (
bool
) – Analyze the full document and return the sections with the corresponding languagesection_data (
bool
) – Add the data of the section in the results. Default:False
from eot.wowool.native.core.language_identifier import LanguageIdentifier from eot.wowool.document import Document document = """ La juventud no es más que un estado de ánimo. Record de chaleur battu dans une cinquantaine de villes en France """ # Initialize a language identification engine lid = LanguageIdentifier() # Process the data language = lid(document) print(language)
eot_language_identifier, {'language': 'french'}
- __call__(document: Union[str, eot.wowool.document.document.Document]) eot.wowool.document.document.Document #
- Parameters
document (Document or a str) – document input data.
- Returns
a Document object with the language information.
- Return type
lid = LanguageIdentifier() doc = lid(document) sections = doc.results( 'eot_lid' ) # prints the sections in the document data for section in sections: print(section)
- sections(document: eot.wowool.document.document.Document) list[dict] #
Return a list of the different section with there language in a given document
- Parameters
document (Document) – document input data.
- Return type
the a json object with the section in a document.
from eot.wowool.native.core.language_identifier import LanguageIdentifier # Initialize a language identification engine lid = LanguageIdentifier() document = """ La juventud no es más que un estado de ánimo. Record de chaleur battu dans une cinquantaine de villes en France """ lid = LanguageIdentifier() sections = lid.sections(document) print(sections) for section in sections: print(f"{section['language']} ,({section['begin_offset']},{section['end_offset']})")
[ {"begin_offset": 0, "end_offset": 50, "language": "spanish"}, {"begin_offset": 50, "end_offset": 117, "language": "french"} ]