Feature extraction from text flows based on semantic similarity for classification tasks: an approach inspired by audio analysis.

Accueil de DSpace
→
Campus Campina Grande | Centro de Engenharia Elétrica e Informática - CEEI
→
PÓS-GRADUAÇÃO EM CIÊNCIA DA COMPUTAÇÃO
→
Doutorado em Ciência da Computação.
→
Voir le document

Feature extraction from text flows based on semantic similarity for classification tasks: an approach inspired by audio analysis.

http://lattes.cnpq.br/5089116729963334; VASCONCELOS, Larissa Lucena.

URI: http://dspace.sti.ufcg.edu.br:8080/jspui/handle/riufcg/25059

Date: 2022-03-18

Résumé:

Text classification is one of the mainly investigated challenges in Natural Language Processing research.The higher performance of a classification model depends on a representation that can extract valuable information about the texts. The problem discussed in this doctoral research is how to enhance text representations by incorporating semantics to improve the efficacy of textclassification models. Aiming not to lose crucial local text information, a way to represent texts is through flows, sequences of information collected from texts. This thesis proposes an approach that combines various techniques to represent texts: the representation by flows, the power of the word embeddings text representation associated with lexicon information via semantic similarity distances, and the extraction of features inspired by well-established audio analysis features. The approach splits the text in to sentences and calculates a semantic similarity metric to a lexicon on an embedding vector space. The sequence of semantic similarity metrics composes the text flow. Then, the method performs the twenty-five audio analysis features inspired ( called Audio-Like Features) extraction. The features adaptation from audio analysis comes from a similitude between a text flow and a digital signal, in addition to the existing relationship between text, speech, and audio. The conducted experimental evaluation comprises five text classification tasks: Fake News Detection in English and Portuguese; Newspaper Columns versus News; Sentiment Polarity involving Movie Reviews in Portuguese. The experiments comprised six datasets and six lexicons involving the English and Portuguese languages. The approach efficacy is compared to baselines that embed semantics in text representation: the strong Paragraph Vector and the BERT. The objective of the experiments was to investigate if the proposed approach could compete with the baselines methods efficacy or improve their effectiveness when associated with them. The experimental evaluation demonstrates that the method can enhance the baseline methods classification efficacy in four of the five scenarios. In the Fake News Detectionin Portuguese task, the approach surpassed the baselines and obtained the best effectiveness (PR-AUC=0.98). The proposed features achieved better results on shallow learning models than deep learning in three tasks. None subset of features appeared among the most impacting ones in all classification tasks, highlighting the importance of all the twenty-five features.

Afficher la notice complète