FARIAS, W. N.; http://lattes.cnpq.br/5834360324217282; FARIAS, Walisson Nascimento de.
Resumo:
Optical character recognition (OCR) plays a key role in the digitization and processing of personal documents, however, it faces accuracy and efficiency challenges, since the tools that perform OCR still depend heavily on the quality of the input data and the conditions in which the documents are scanned or photographed. To improve optical character recognition, it is proposed a combination of pre-processing and post-processing techniques to improve OCR quality. The process begins by collecting a representative dataset of images of personal documents. After that, the images are pre-processed and post-processed, followed by OCR and the use of a metric that evaluates the OCR obtained. Pre-processing techniques included modifying the DPI of the images, smoothing the image and converting it to grayscale, followed by the application of OCR. In addition, post-processing was carried out to remove accents marks from the extracted text and convert it into capital letters. The results indicated that pre-processing method significantly improved OCR accuracy for identity documents (ID), increasing the F1-Score from 0.33 (without pre-processing) to 0.53 (with pre-processing). For CPF images, pre-processing procedure resulted in an accuracy of 73.48% and an error rate of 26.52%, while OCR without pre-processing had an accuracy of 36.46% and an error rate of 63.54%. This study aims to investigate techniques for improving optical character recognition in personal documents, contributing to greater OCR accuracy, with potential benefits for applications that extract content from personal document images.