BARROS, T. S.; http://lattes.cnpq.br/7401639950436351; BARROS, Thierry Silva.
Abstract:
In the Federal Police, a document known as notitia criminis is used as the starting point
of the criminal investigation. The notitia criminis document aims to report a summary of
investigative activities and contains all relevant information about the supposed crime that
occurred. In order to manage an investigation and correlate with similar investigations, in
general, the Federal Police needs to extract the most important information of the notitia cri-
minis document. Manual extraction (reading and understand their entire content) may be hu-
man exhausting, due to the size and complexity of the documents. Therefore, it is necessary
to use Natural Language Processing (NLP) techniques for automatically extracting the most
important passages, such as the crime that occurred.In the last few years, deep neural net-
works have been successfully applied to many different NLP tasks. A neural network model
that leveraged the results in a wide range of NLP tasks was the BERT model - an acronym
for Bidirectional Encoder Representations from Transformers. Due to its ability to repre-
sent the meaning textual data, being able to capture both short-range (correlations between
textual data that are close together in the text) and long-range (correlations between textual
data that are far apart in the text) dependence on textual data. This dissertation proposes
different approaches based on the BERT model to extract the most important information
from the textual document referring to a notitia criminis document and build a summary of
it. For the automatic summarization of textual documents, two types of different techniques
can be applied: abstractive and extractive. In this dissertation, the extractive summarization
technique was used to summarize the documents. Thus, we aim to analyze the feasibility
of using the BERT model to extract and synthesize the most important information from the
notitia criminis document. We evaluate the performance of the proposed approaches using
two real datasets: the Federal Police dataset (a private domain dataset) and the Brazilian
Wikihow dataset (a public domain dataset). Experimental results on the two datasets, using
different variants of the ROUGE metric, show that our approaches can significantly increase
extractive text summarization effectiveness without sacrificing efficiency.