The increasingly wide uptake of Machine Learning (ML) has raised the significance of the problem of tackling bias (i.e., unfairness), making it a primary software engineering concern.
In this paper, we introduce Fairea, a model behaviour mutation approach to benchmarking ML bias mitigation methods.
We also report on a large-scale empirical study to test the effectiveness of 12 widely-studied bias mitigation methods.
Our results reveal that, surprisingly, bias mitigation methods have a poor effectiveness in 49% of the cases.
In particular, 15% of the mitigation cases have worse fairness-accuracy trade-offs than the baseline established by Fairea;
34% of the cases have a decrease in accuracy and an increase in bias.
Fairea has been made publicly available for software engineers and researchers to evaluate their bias mitigation methods.