OLIVEIRA, T. G.; http://lattes.cnpq.br/2170938506941690; CERQUEIRA, Thaciana Guimarães de Oliveira.
Resumen:
The GitHub platform is the largest source code hosting site, containing a large number
of users and software repositories. Due to their social and collaborative nature, users are
encouraged to contribute to each other’s repositories. In this context, it is crucial to understand the factors that represent the interests of users on the platform, so that it is possible to design effective services to assist them in discovering relevant repositories. Recommendation systems (SRs) have been increasingly popular to assist users in retrieving information and satisfaction with online services. SRs from repositories on the GitHub platform can be designed to personalize the user experience and help them discover relevant projects in a wide and dynamic search space. However, this is a challenging scenario for recommendations. Since the last decade, machine learning methods and techniques have been explored in research on SRs. In general, classification techniques such as Collaborative Filtering - FC and Content Based Filtering - FBC are used. The main contribution of this research is to explore the framework of features present on the GitHub platform and advance this knowledge to recommend projects of interest to the user. It consists of analyzing the various characteristics of users and projects extracted from the data and metadata available through the platform. Additionally, we propose metrics generated from these characteristics to learn about the activities of the projects in terms of attractiveness, visibility, statistics and engagement. All this information was explicitly modeled using the FBC and FC techniques (Item-KNN, Random Forest - RF, XGBoost and Factorization Machine - MF), and we applied the recommendation of the repositories for contribution by the user. An offline evaluation was carried out considering factors that affect the learning of the classification, in which we suggest two sets of data: balanced sampling and negative sampling, having a reach of about 500 thousand projects. The experiments implemented and performed in this
research show greater effectiveness for the XGBoost and RF techniques, in relation to the
Item-KNN, FBC and MF techniques in recommending projects in terms of quality, such as
precision and project coverage. Among the results obtained, XGBoost presents an accuracy 48 % greater than Item-KNN in recommending projects of interest to the user. However, the accuracy of the project recommendations did not reach 1 % of the top 10 recommendations. Finally, we propose a new approach involving the learning of classification, with the objective of improving the recommendations by the applied techniques, where we obtained an accuracy with about 40 % of the projects in the top-10 of the recommendation list. We concluded that a better understanding of the GitHub platform, with regard to the characteristics of users and their respective projects, allowed us to characterize the interests of users and implement strategies for recommending repositories that correspond to users’ preferences. The results were encouraging, which implies that the proposed characteristics represent the user’s preferences. We hope that our findings will pave the way for the creation of new efficient recommendation systems and search algorithms to help GitHub users find projects that interest them.