Bridging the gap between complex scientific research and the curious minds eager to explore it.

Computer Science, Software Engineering

Construct Validity Threats in Measuring Emotion Causes in Software Engineering Communication

Construct Validity Threats in Measuring Emotion Causes in Software Engineering Communication

Emotions play a crucial role in software development, as they can significantly impact communication and collaboration among team members. However, identifying and understanding the causes of emotions in this context is challenging, particularly when working with natural language comments from software engineers. In this article, we explore the use of zero-shot large language models (LLMs) to extract emotion causes from domain-specific comments in software engineering communication. We evaluate the performance of three popular LLMs – ChatGPT, GPT-4, and flan-alpaca – and provide insights into their strengths and limitations.

Background

Traditional rule-based approaches for emotion recognition in natural language processing (NLP) rely on hand-crafted features and domain-specific heuristics, which can be time-consuming and suboptimal. Zero-shot LLMs offer a more efficient and effective solution by leveraging pre-trained models fine-tuned on large datasets from diverse domains. These models can learn to recognize emotions without explicit labels or knowledge of the target domain.

Methodology

We evaluate the performance of ChatGPT, GPT-4, and flan-alpaca in extracting emotion causes from software engineering comments. We annotate a dataset of 450 utterances with manually identified emotion causes and use it to train and evaluate these models. We calculate the BLEU score to assess the quality of the generated text and perform error analysis to understand where the models make mistakes.

Results

The results show that ChatGPT, GPT-4, and flan-alpaca have average lengths of 8.85, 8.64, and 13.12 words for emotion cause spans, respectively. The BLEU score for these models ranges from 0.467 to 0.598, indicating that the generated text is generally understandable but may contain some errors. We perform error analysis to identify areas where the models struggle and find that they tend to make mistakes in complex sentences or when the causes are long or ambiguous.

Discussion

Our findings demonstrate that zero-shot LLMs can be used to extract emotion causes from software engineering comments with varying degrees of accuracy. While these models can learn to recognize emotions without explicit labels, their performance is influenced by factors such as the quality of the training data and the complexity of the domain.

Conclusion

In conclusion, this study explores the use of zero-shot LLMs for emotion recognition in software engineering communication. We evaluate the performance of three popular models and provide insights into their strengths and limitations. Our findings have implications for future research in this area, particularly with regards to improving the accuracy of emotion recognition in complex domains. By leveraging zero-shot LLMs, we can better understand the emotional nuances of software development communication and improve collaboration and productivity among teams.