FARIAS, M. C.; http://lattes.cnpq.br/1287907028022615; FARIAS, Mainara Cavalcanti de.
Résumé:
Machine learning models need large datasets to be well trained. However, creating this labeled data is still a major challenge for supervised learning, as it is often a very time consuming and costly manual process. Therefore, existing databases are usually used, however, in many cases this data does not perfectly reflect the context in which the model will act. As an alternative to this traditional approach, the data programming paradigm arose, used for the programmatic creation of training sets, in this paradigm users express weak supervision strategies to create labeling functions, which eventually are noisy. In order to create a robust model to the noise produced by these functions, researchers at Stanford University created the Snorkel system, which benefits from the agreement of these functions to create their model. In this research, Snorkel is used for the purpose of labeling people in images, a different task from which Snorkel has been used, since the applications that make use of it are generally in the context of natural language processing, because it is simpler to create heuristics that act in texts. The environment in which the images were extracted was a laboratory in the CN block, located at UFCG. In order to compare the performance of the trained model with the specific environment data (generated by Snorkel) and an existing (generic) database, a more sophisticated model was trained with the different datasets. The final accuracy of the model trained with the data generated by Snorkel was 84.94%, while the one trained with generic images was only 30%, indicating that the performance of a specialized machine learning algorithm in a specific environment is far superior to one trained with generic data.