Biased Algorithms and How They Affect Your Business
Machine learning models are easy to build with a smart team of engineers in the open source world. However, with companies and governments waking up to the problems of algorithmic bias, it is important to make sure that the models used by businesses are not inherently so biased that they make bad decisions, or even worse, become reputational risks for the organisation.
Biased datasets lead to biased models
Machine learning models owe their performance to the data they have seen. For example, the model could have seen five years of mortgage repayment history for the customers of a bank where the lowest defaulters were customers receiving special tax rebates. Further, consider that these rebates are about to expire. The AI model will consider that all these customers are low risk because they are in the low default category, when in reality the model might be “biased” towards customers who receive tax benefits. In such a case, both statistical and machine learning models will fail to identify cases of concern, thereby affecting the bank’s business directly.
To make sure the decisions are business appropriate, either such a model needs an exceptionally high amount of expert time (thereby defeating the purpose of building an AI) or it needs to see a balanced dataset. In this case, the balanced dataset would contain customer repayment history for the same five years but would be split equally between 1) customers with tax benefits, 2) customers who didn’t have any benefits to start with, and 3) customers who had tax benefits that recently expired, i.e. roughly 30-40% of data in each of the three buckets. The same is true when you are trying to build a facial recognition application for a smart camera but your dataset is skewed towards white men. In such a case, the model might fail to detect an African or an Asian face, which could bleed the camera company both in terms of business performance and reputational risks. In short, the trick is in better sampling of the training dataset.
Datasets based on biased human decisions also lead to biased models
Human history and decision making are also filled with various shades of unintentional bias, which means we make familiar assumptions while making a decision. This might have served us well in the past, but it might not be as effective in all scenarios of current life. However, datasets with human decision points are often the food for a machine learning model, which means the algorithms already are going to learn human biases to a certain extent. So, we have to be extra cautious when using such datasets to train a model and try to de-bias the training set as best as possible. This means, if we try to predict future earnings based on a person’s potential, we should ideally drop the data attribute of “gender” and focus on attributes like education and past work experience.
As another example we could think of social media websites where the models “learn” about our specific interests and views and “feeds” us more of similar stories. This is a classic case of confirmation bias. This one of the reasons after an election, we often wonder who the people who voted against us were, because all our friends seem to hold the same world view. This issue has been debated massively of late across America and Europe and deserves special mention since the models have caused tremendous reputational risks for the social media businesses.
How do you protect your models from bias?
As we pointed earlier, biased datasets lead to biased models. As good and ethical machine learning engineers, it is important for us to understand the nuances around ethics and how software engineering can be used to get around the problem of bias. The answer can mostly be found in better sampling.
We can balance out datasets so that there is a fair representation of all buckets in each category that will be used to train the model. It also is important that in case of absence of balanced datasets, new data is “generated” such that the data is an apt representation of real business cases.
Finally, if most of the model’s predictive power comes from attributes such as gender, race, monthly salary, the explanations should be well understood in the context of the business’s moral values; or dropped from being used as independent variables. This is especially true in the world of GDPR and other regulations whereby more demands are being made on opening up the black-box. The question is pertinent because once the black-box is opened up, and we find that the model discriminated negatively towards, for example cancer patients, the business might be questioned from an ethical perspective, which could lead to an irreversible reputational damage.
Standards, validation and knowledge are the cornerstones of every good machine learning architecture. As long as a proper pre-mortem exercise is done to protect against biased models, machine learning models will still be our friends.