Um modelo BERT para sumarização extrativa de textos em documentos da Polícia Federal.

DSpace Home
→
Campus Campina Grande | Centro de Engenharia Elétrica e Informática - CEEI
→
PÓS-GRADUAÇÃO EM CIÊNCIA DA COMPUTAÇÃO
→
Mestrado em Ciência da Computação.
→
View Item

Um modelo BERT para sumarização extrativa de textos em documentos da Polícia Federal.

BARROS, T. S.; http://lattes.cnpq.br/7401639950436351; BARROS, Thierry Silva.

URI: http://dspace.sti.ufcg.edu.br:8080/jspui/handle/riufcg/27174

Date: 2022-04-28

Abstract:

In the Federal Police, a document known as notitia criminis is used as the starting point of the criminal investigation. The notitia criminis document aims to report a summary of investigative activities and contains all relevant information about the supposed crime that occurred. In order to manage an investigation and correlate with similar investigations, in general, the Federal Police needs to extract the most important information of the notitia cri- minis document. Manual extraction (reading and understand their entire content) may be hu- man exhausting, due to the size and complexity of the documents. Therefore, it is necessary to use Natural Language Processing (NLP) techniques for automatically extracting the most important passages, such as the crime that occurred.In the last few years, deep neural net- works have been successfully applied to many different NLP tasks. A neural network model that leveraged the results in a wide range of NLP tasks was the BERT model - an acronym for Bidirectional Encoder Representations from Transformers. Due to its ability to repre- sent the meaning textual data, being able to capture both short-range (correlations between textual data that are close together in the text) and long-range (correlations between textual data that are far apart in the text) dependence on textual data. This dissertation proposes different approaches based on the BERT model to extract the most important information from the textual document referring to a notitia criminis document and build a summary of it. For the automatic summarization of textual documents, two types of different techniques can be applied: abstractive and extractive. In this dissertation, the extractive summarization technique was used to summarize the documents. Thus, we aim to analyze the feasibility of using the BERT model to extract and synthesize the most important information from the notitia criminis document. We evaluate the performance of the proposed approaches using two real datasets: the Federal Police dataset (a private domain dataset) and the Brazilian Wikihow dataset (a public domain dataset). Experimental results on the two datasets, using different variants of the ROUGE metric, show that our approaches can significantly increase extractive text summarization effectiveness without sacrificing efficiency.

Show full item record