The «OTELO» project
Language, whether written or oral, is intrinsically ambiguous and polysemic. Linguists aspire to account for this ambiguity in order to understand langage complexity itself. Researchers in computer science are also concerned with the formalization of linguistic variation for application purposes. Works that focus on an exhaustive description of language are rare because they involve approaches from several scientific communities.
Winner of the call for excellence projects launched by the DATAIA Institute and MSH Paris-Saclay in 2020,
OTELO proposes a multi-level analysis of spoken language from large oral corpora, segmented and annotated automatically.
Segmented into phones and words, these data will then be enriched with knowledge about the grammatical status of words, their syntactic and semantic relationships in context. The expected results concern:
- the role of phonetic information in the disambiguation of contextual homophonies involving entities ;
- the impact of "high-level" linguistic knowledge (grammatical, syntactic, semantic) in the diffusion of patterns of phonetic variation within a language’s lexicon.
OTELO is led by Ioana Vasilescu, researcher in linguistics at LIMSI, and Fabian Suchanek, researcher in computer science at Télécom Paris.
The work of F. Suchanek, is internationally known for the creation of the YAGO knowledge base, which is used among others in the IBM Watson system. His research focuses on the extraction of entities and facts from text in natural language, and the structuring of these data in a knowledge base. One of the aspects addressed is the analysis of these knowledge bases, rule mining, and completeness determination. His work is supported by an AI Chair funded by ANR.
At LIMSI, the analysis of written and oral language is at the heart of the Language Science and Technology Department. Within this department, I. Vasilescu and his colleagues in the "Spoken Language Processing" group are at the origin of many SHS initiatives dealing with the analysis of sound variation from large multilingual corpora. The analyses are based on massive data explored with automatic tools. The work of I. Vasilescu, supported by the MSH Paris-Saclay, have highlighted the interest of this methodology and of large corpora for the study of synchronic variation in relation to the history of languages. LIMSI researchers have also initiated a first joint approach involving a multi-level analysis of oral data related to errors in automatic systems, in the framework of the ANR VERA (adVanced ERror Analysis) project (Santiago et al., 2015; Goryainova et al., 2014).