Unit Test Generation with GitHub Copilot : A Case Study
Humalajoki, Sami (2024-06-17)
Unit Test Generation with GitHub Copilot : A Case Study
Humalajoki, Sami
(17.06.2024)
Julkaisu on tekijänoikeussäännösten alainen. Teosta voi lukea ja tulostaa henkilökohtaista käyttöä varten. Käyttö kaupallisiin tarkoituksiin on kielletty.
avoin
Julkaisun pysyvä osoite on:
https://urn.fi/URN:NBN:fi-fe2024062056482
https://urn.fi/URN:NBN:fi-fe2024062056482
Tiivistelmä
Artificial intelligence has taken remarkable steps in recent years. Natural language processing technology and large language models have changed many aspects of software development process. Different tools have been developed to aid in the software development process. In this study we aim to evaluate GitHub Copilot’s abilities in automated unit test generation. The study is motivated by the critical role of testing in software development. Effective testing ensures code quality, reliability, and maintainability. Software systems have grown increasingly complex and with it the demand for efficient test generation tools. This study aims to assess GitHub Copilot’s abilities in real-world software development standards of testing by introducing unit tests to a legacy software system.
Four research questions guide the evaluation: the immediate usability of Copilot’s test suggestions based on their compilation success rates, the correctness of these suggestions through execution error analysis, the effectiveness of test suggestions measured by code coverage, and the presence of test smells indicating potential maintainability problems. The research methodology employs an iterative one-shot method to evaluate GitHub Copilot's performance in generating and refining test cases, structured into three primary steps. First, test cases are generated by prompting Copilot within an IDE, followed by verifying and correcting their syntactic accuracy using IDE feedback and Copilot's correction suggestions. Finally, the syntactically correct test cases are executed, corrected as needed, and assessed for functional correctness and quality metrics like code coverage and test smells.
The research findings indicate that GitHub Copilot can generate valid unit tests, but its performance is inconsistent and frequently requires human intervention. Copilot struggles with complex mocking scenarios, often fails to detect straightforward errors, and relies heavily on the provided context, leading to potential reliability issues. Code coverage analysis shows that Copilot is effective in straightforward testing scenarios, achieving high coverage in simple methods, but performs poorly with methods of high cyclomatic complexity. Additionally, Copilot’s tests exhibit common test smells such as Magic Number Tests and Lazy Tests, which are more common in complex code, suggesting a preference for speed over quality and a tendency to overlook best practices in unit testing. Overall, while Copilot can produce reasonable quality tests, its effectiveness diminishes with increased code complexity.
The results indicate the need of frequent of human intervention for error correction and test quality enhancement. Also, the presence of common test smells may indicate a preference for speed over best practices. Copilot might also benefit from internal feedback system, where it could execute and assess its code suggestions. These insights suggest that Copilot is valuable for straightforward testing scenarios, but its reliability decreases with more complex code.
Four research questions guide the evaluation: the immediate usability of Copilot’s test suggestions based on their compilation success rates, the correctness of these suggestions through execution error analysis, the effectiveness of test suggestions measured by code coverage, and the presence of test smells indicating potential maintainability problems. The research methodology employs an iterative one-shot method to evaluate GitHub Copilot's performance in generating and refining test cases, structured into three primary steps. First, test cases are generated by prompting Copilot within an IDE, followed by verifying and correcting their syntactic accuracy using IDE feedback and Copilot's correction suggestions. Finally, the syntactically correct test cases are executed, corrected as needed, and assessed for functional correctness and quality metrics like code coverage and test smells.
The research findings indicate that GitHub Copilot can generate valid unit tests, but its performance is inconsistent and frequently requires human intervention. Copilot struggles with complex mocking scenarios, often fails to detect straightforward errors, and relies heavily on the provided context, leading to potential reliability issues. Code coverage analysis shows that Copilot is effective in straightforward testing scenarios, achieving high coverage in simple methods, but performs poorly with methods of high cyclomatic complexity. Additionally, Copilot’s tests exhibit common test smells such as Magic Number Tests and Lazy Tests, which are more common in complex code, suggesting a preference for speed over quality and a tendency to overlook best practices in unit testing. Overall, while Copilot can produce reasonable quality tests, its effectiveness diminishes with increased code complexity.
The results indicate the need of frequent of human intervention for error correction and test quality enhancement. Also, the presence of common test smells may indicate a preference for speed over best practices. Copilot might also benefit from internal feedback system, where it could execute and assess its code suggestions. These insights suggest that Copilot is valuable for straightforward testing scenarios, but its reliability decreases with more complex code.