FIRMINO, A. A.; http://lattes.cnpq.br/6042902332948785; FIRMINO, Anderson Almeida.
Resumen:
The growth of social media around the world has brought both benefits and challenges
to society. Among the challenges, we highlight the proliferation of hate speech in social
networks. Detecting hate speech has become an arduous task in today’s world. About 22.5
million posts with hate speech were removed from social networks between April and June
2020. Thus, it is necessary to develop research that seek automated solutions to identify
and remove hate speech in social networks. In this thesis, we propose a new methodology
for detecting hate speech in Portuguese texts. This methodology uses Cross-Lingual
Learning, which consists of using transfer learning in Pre-Trained Language Models with
a language with large corpora available (source language) to solve problems in languages
with less annotated data (target language). The proposed methodology comprises four
stages: corpora acquisition, definition of PTLM, training strategies and evaluation. We
carried out experiments using Pre-Trained Language Models in different languages: English,
Italian and Portuguese (BERT and XLM-R) to verify which one best suited the proposed
method. Corpora in English (WH) and Italian (Evalita 2018) were used as source language
and two corpora in Portuguese (target language) were used: OffComBr-2 and Hate Speech
Dataset (HSD). The results of the experiments showed that the proposed methodology is
promising: for the OffComBr-2 corpus, the best state-of-the-art result was obtained (F1
Score = 92%); and for the HSD corpus, the second best result was obtained (F1 Score =
90%).