NERY, L. G. A.; http://lattes.cnpq.br/9635566043548464; NERY, Luiz Gustavo Alves.
Abstract:
This study addresses the importance of accurately extracting information from PDF documents, highlighting the challenges faced due to the lack of uniformity in the structure and layout of these documents. Extracting text from PDF documents, especially in contexts such as Official Gazettes, is crucial for automating processes and optimizing the analysis of relevant information. The ROUGE metric is used to evaluate the quality of text extraction by the tools and the importance of extracting all information from the original text while preserving the reading order. Given the inefficiency and high cost associated with manual text extraction from documents in PDF format, this study aims to provide significant insights that help in choosing the most appropriate tool, considering the different application scenarios in text extraction. The evaluation of the chosen tools, together with the measurement of results through metrics relevant to the evaluation of the extracted texts, improves the effectiveness and efficiency in the analysis of these tools.