LUCENA, A. L.; http://lattes.cnpq.br/2962858146566073; LUCENA, Andrielly de Lima.
Resumen:
Currently, the wide range of platforms, applications, and online operations available for solving different problems result in a high volume of user data traffic, including sensitive and identifying data. To protect users' privacy, a right guaranteed by laws worldwide (Data Protection Laws), greater attention to these data is necessary to prevent their disclosure. However, identifying sensitive information among many other types of data may not be a trivial task. Existing studies propose the application of Natural Language Processing (NLP) techniques for the automatic identification of Personal Identifiable Information (PII) in Portuguese documents. The aim of this work is to propose, through a proof of concept, a complementary approach to those used in related studies, through the task of NLP Relation Extraction. To do so, a component was created that combines a language model specialized in the Portuguese language and additional layers of relation extraction. For the training and evaluation of the component, a synthetic sensitive database was generated with the assistance of a Large Language Model (LLM). The results were satisfactory, with precision, recall, and f1-score metrics above 95%, indicating that the approach could be a good proposal for automatic detection of sensitive personal information.