Um experimento comparativo da eficácia de diferentes LLM na geração de cenários Gherkin.

Página inicial
→
Campus Campina Grande | Centro de Engenharia Elétrica e Informática - CEEI
→
PÓS-GRADUAÇÃO EM CIÊNCIA DA COMPUTAÇÃO
→
Mestrado em Ciência da Computação.
→
Ver item

Um experimento comparativo da eficácia de diferentes LLM na geração de cenários Gherkin.

SOUSA, H. N. F.; http://lattes.cnpq.br/2201042413775848; SOUSA, Hiago Natan Fernandes de.

URI: http://dspace.sti.ufcg.edu.br:8080/jspui/handle/riufcg/41048

Data: 2025-01-31

Resumo:

Behavior-Driven Development (BDD) is essential in modern software development, with the Gherkin language playing a crucial role in specifying test scenarios. However, the manual creation of these scenarios is time-consuming and error-prone. Large Language Models (LLMs) emerge as an innovative solution to automate and optimize this process, offering a more efficient and reliable alternative. In this study, we investigated the effectiveness of six LLMs (GPT-3.5 Turbo, GPT-4 Turbo, GPT-4o Mini, LLaMA 3, Phi-3, and Gemini) in the automated generation of Gherkin scenarios from 1,286 real-world test scenarios. We applied different prompting techniques, such as zero-shot, one-shot, and few-shot, to evaluate the quality and consistency of the gen erated outputs. The goal was to identify the most suitable technique and model for creating BDDscenarios. To conduct the analysis, we selected quality and variability evaluation measures, which were correlated with qualitative assessments performed by experts. This ensured the choice of representative metrics that adequately reflect the quality of the generated scenarios. Addi tionally, statistical analyses were performed to verify the existence of significant differences between the models and techniques applied, ensuring the methodological robustness of the study. The variability analysis indicated that the consistency of the models depends on the tech nique used: in zero-shot, Gemini was more consistent, while LLaMA 3 and GPT-3.5 Turbo showed higher variability. In one-shot, GPT-4o Mini and GPT-4 Turbo stood out for their stability, whereas in few-shot, GPT-4o Mini and LLaMA 3 were the most stable. The per formance analysis revealed that the zero-shot technique was the most effective in various contexts, especially when applied to the Gemini model. However, statistical analyses, such as the Kruskal-Wallis test, demonstrated that the observed differences between the models were not statistically significant.

Mostrar registro completo