The usage of textual data is dramatically growing: indeed, individuals and companies communicate and express themselves by using texts, often through Internet both publicly and privately. Textual data may be fully unstructured (a free text) or may be found inside predefined structures (such as web pages, standardized reports, semi-formal models). In the latter case, these predefined
structures can be used to, possibly partially, suggest data interpretation. We are interested to extract, automatically and semi-automatically, information and knowledge from textual data. The purpose of extracting information and knowledge is to organize and define software components useful for designing and realizing computer-based systems facilitating several activities
performed by individuals (possibly when working for companies). For instance :
- making easier, quicker and reliable the way to decide on the base of textual data,
- making explicit hidden information conveyed by textual data and, as a consequence,
- enabling understanding of other individuals’ behaviors, ideas and so on,
- making computer-based systems more efficient and effective on the base of available textual data, and finally
- supporting individuals in following what textual data implicitly suggest.
We consider that knowledge extraction from textual data can be divided into three main topics that are our main focus:
- textual data acquisition and filtering,
- text mining, and
- knowledge representation.
- Annotated corpora and reliability measures. Large annotated corpora are needed to develop machine learning techniques. Research is conducted to develop corpora and to investigate the reliability of the annotations [Muzerelle2014] and [Antoine2014].
- Semantic relations and information extraction. Currently, we are investigating multi-text summarization. As a preliminary task, we are using distributional analysis to estimate sentence similarity [Vu2014].
- NLP and datamining. Contributions on the topic is to first, develop new algorithms in datamining (sequential pattern mining) specially dedicated to textual data, and second use these methods in order to resolve NLP tasks like linguistic pattern extraction, POS tagging, event detections, ontology extraction from texts. A web site, developped in conjunction with the GREYC, MODYCO and LIPN institutes, using our sequential pattern mining algorithms is available here. More information in [BechetCCC14] (in French) and [BechetSAC2015] (to appear, contact us to dowload it).
- Ontology learning (or ontology generation from texts). Starting from existing tools (specifically Text2Onto), we have proposed and approach based on linguistic patterns and deep parsing for extracting ontologically relevant relationships [Sami2014].
- Ontology validation and verification. A critical aspect of ontologies concerns their verification and validation. This aspect becomes even more critical whenever ontologies are automatically or semi-automatically built from texts. Indeed, current tools for ontology learning result in very low quality ontologies requiring a heavy validation and verification work. As a consequence there is strong need to couple much better ontology validation and verification with ontology learning. We have developed a framework for introducing a standard classification of symptoms and defects of ontologies. We have then performed experiments for analysing the usage of the proposed classification with learned lightweight ontologies built by using Text2Onto. Promising results suggest how to automatically configure validation and verification processes. See [Harzallah2014a] and [Harzallah2014b].
- Discriminative sequential motif mining for language register characterization. This work studies how different can be texts of different styles by extracting discriminative sequential motifs using data mining approaches. The application domain are language registers.
ANR TREMoLo (2017-2021)
The main objectives of the TREMoLo project are to study linguistic registers, and to develop methods for automatic transformation of linguistic registers across texts, i.e., translating a text from a register to another. This work will rely on the extraction of register-specific linguistic patterns and their integration in an automatic paraphrase generation process.