Our paper has been accepted to the ACL2022 Workshop (Insights from Negative Results in NLP).
Bibliographic Information
Itsuki Okimura, Machel Reid, Makoto Kawano and Yutaka Matsuo
On the Impact of Data Augmentation on Downstream Performance in Natural Language Processing, the
the Third Workshop on Insights from Negative Results in NLP, ACL 2022, May 2022
■Outline
With in the broader scope of machine learning, data augmentation is a common strategy to improve generalization and robustness of machine learning models. While data augmentation has been widely used within computer vision, its use in the NLP has been comparably rather limited. The reason for this is that within NLP, the impact of proposed data augmentation methods on performance has not been evaluated in a unified manner, and effective data In this paper, we look to tackle this by evaluating the impact of 12 data augmentation methods on multiple datasets when We find minimal improvements when data sizes are constrained to a few thousand, with performance degradation when data size is increased. We also use various methods to quantify the strength of data augmentations, and find that these values, though weakly Furthermore, we find a glaring lack of consistently performant data augmentations. Furthermore, we find a glaring lack of consistently performant data augmentations. this all alludes to the difficulty of data augmentations for NLP tasks and we are inclined to believe that static data This alludes to the difficulty of data augmentations for NLP tasks and we are inclined to believe that static data augmentations are not broadly applicable given these properties.