8 types of bias in data analysis and how to avoid them

There are several ways in which bias can present itself in analytics, including in the formation and testing of hypotheses, sampling, and preparation of data.

Elif Tutuk, Associate VP of innovation and design at Qlik, emphasizes the critical need to prioritize bias mitigation in all data-related endeavors. She points out that bias can arise at various stages, ranging from data framing and collection to the implementation of analytics, AI, or ML systems. While complete elimination of bias may not be feasible, data scientists can take measures to minimize its extent and impact.

"The first step is the realization that bias exists, not just in the data that is being analyzed or used, but also by the people who are using it," Hariharan Kolam, CEO and founder of Findem, a people intelligence company. It is dangerous as it may result in inaccuracies in decision-making, which in turn affects the profitability of a company and specific stakeholders. One of the common problems is that people do not concentrate on the main question of inquiry. Kolam further recommends that data scientists should establish the goal of an analysis to avoid ambiguous results.

What are the forms of bias in analytics?

According to Alicia Frame, lead product manager at the graph database vendor Neo4j, data scientists primarily associate bias with the data itself. She identifies several sources that can contribute to upward bias, including human sources, the utilization of unrepresentative data sets, leading survey questions, biased reporting, and measurements. Frame notes that bias may not become apparent until the data is applied in decision-making processes, such as constructing predictive models.

How Diverse Talent Pools Can Assist in Resolving AI Bias

Medical data, for instance, is bound to contain disproportionate amounts of white patients, especially in the testing of new drugs. Therefore, the experience and outcome for the people of color are less articulated. This bias is particularly relevant in relation to COVID-19 when companies involved in producing vaccines are trying to speed up the process of testing them and attract participants with diverse genetic backgrounds to the trials. This is why Pfizer recently announced that they are starting new trials and that they would be looking to recruit another 15,000 patients. "It's so disappointing that analytics also reflects the biases we see around us," said SAS' Sarah Gates, global product marketing manager.

Another concern with using fairness as a principle is that fairness is not static but changes with the definition of society. An example reported recently by Reuters is that of the International Baccalaureate program that withdrew exams for high school students in May over COVID-19. Rather than traditional exams, a blind algorithm gave grades for the IB program, leading to much lower grades than expected for many students and teachers.

When it comes to business, bias can also stem from the way in which data are to be recorded by different individuals.

For instance, "It is extremely unusual for salespeople who are updating CRM data to point to themselves as the reason why a deal was lost," said Dave Weisbeck, Chief Strategy Officer at Visier, a people analytics company. Data source reflects certain biases, and careful selection can minimize them.

Here are eight examples of biases in data analysis and the ways to deal with each of them.

1. Recycling the Current Establishment

A common and dangerous kind of bias in analysing data is recycling the current establishment, said Alicia Frame. For instance, the now-deleted recruiting tool of Amazon was found to favor male applicants, especially given the company's current hiring patterns. It didn't take the gender of applicants into account directly but was based on influences related to gender like sports and social activities or the kind of descriptive adjectives for accomplishments. In effect, the AI identified these nuanced distinctions and looked for candidates who appeared similar to respondents that were considered successful by the firm. Frame argued that a good countermeasure is to accompany your AI systems with an explanation of their place in the world and how they relate to it.

2. Trained on the Wrong Thing

Aible is an AI platform established by Arijit Sengupta - the CEO - noted that one of the core innate biases connected to AI comes from the fact that such models are trained on model accuracy - which is relatively unimportant - as opposed to business outcomes - which matter for the organization at hand. The basic problem is that the algorithm works under the premise that all cost-benefit analyses are equal. But in the real world of business, there is often much more at stake on the side of being right than there is on being wrong. For instance, The payoff of winning a deal is $100,000, and the cost of failing to identify that a given deal is unwinnable is $1000. An AI that wins 1 per 100 times can be considered highly inaccurate but would increase the net revenue to a large extent. Sengupta stated that data scientists should clearly define what constitutes high or low costs and which outcomes provide high or low value.

3. Under-representing Populations

The omission of those sections of the population that should have been included in data analysis is one of the main sources of selection bias. This one has had grave effects in the medical field, for it has in the past failed to focus on the considerable segmental dissimilarities in the manifestations of heart diseases between men and women, as COO and co-founder of Wovenware Carlos Melendez explained. That bias can come from the fact that the training data might not be adjusted for gender or race or economics of decision. Melendez proposed some ways to address these challenges: The inclusion of a diverse talent pool of data scientists is a good practice. The provision of diversity training for data scientists is another way the algorithm bias can be avoided. Rigorous testing of algorithms for biases is the third practice.

4. False Interpretation/Misses the mark

This is why Weisbeck stated that: "If one goes in with the intention to find a reason to carry that belief or opinion - then one will find supporting data in the analysis".

Medical researchers have been able to counter this bias by conducting double-blinded studies whereby the participants and the collectors of the information do not in any way influence the results. Although this is difficult to achieve in the business world, still, data scientists still need to regulate this problem through analysis of the source. Such a gender equity standpoint is based on the report of a study that Visier ran internally on the same topic. One way to do so involved dividing the data into groups in which it was expected to find bias and those in which no bias is expected and then measuring the differences in the outcome variables like the pay changes of women as a function of their having male or female supervisors.

Another research tactic entailed looking for similar results when bias was likely to be an issue. For instance, they tested the hypothesis: "If they are not being adjusted properly in compensation, the same holds true for their performance evaluations." This method uses the assumption that if such gender bias exists in one area, it will also be present in the likes of it.

5. Statistical Biases

Parkey says that statistical biases can emanate from cognitive biases. This is according to Charna Parkey, the data science lead at Kaskada, a machine learning platform. Nevertheless, many times, analysis uses data that is easily found or that is gathered in an "ad hoc" fashion rather than purpose-built data sets. The way the data is gathered from the source can also lead to bias in the data that is received; this is known as sample bias.

Selection bias is defined as follows - Given a sample from the data, the collected samples are not representative of the future population to which the model is to be applied.

To remedy this, one should move from static facts to event-sourced data sources, which enable data to be updated corresponding to real-time changes. This also involves creating real-time dashboards and ML models that can be tracked in the long run.

It is possible that, at least in part, reminding those building the models and those making decisions of the unintentional biases that affect them and also providing them with strategies to try to avoid the biases would help to reduce these unintentional biases Parkey reported.

6. Analytics bias

The interconnection between the two concepts is evident, as bias in analytics often stems from incomplete data sets and a lack of proper context. Elif Tutuk highlighted the significance of acknowledging the insights that may be missed when crucial data is excluded. Understanding these omitted relationships is just as crucial as recognizing the existing ones analytically.

It is for this reason that static data is always addressing a pre-events period in that it is normally molded by the time it was gotten. In order to deal with these issues, organizations should use associative data technologies that would allow the organisation to access and consolidate all of the necessary data into the system.

This is especially true due to the fact that business runs on feedback, and any analysis needs to be done almost instantly. In other words, data must be prepared and available for reuse as the business scenario develops. In order for true semantic views of the information to be defined and for data managers to be able to provide the necessary architectural support to IT in the form of business-adaptive views of the data, the required contextualisations have to be designed in terms of business requirements view instead of real, existing levels.

7. Confirmation bias

Confirmation bias is a very common vice that researchers have in the social research process, where researchers only choose to look at the evidence that supports the hypothesized ideas rather than looking for contradicting evidence.

ERT Simmons, which took place in Hunter, New York, saw participants asked to solve a hypothetical crime, and most relied on preconceptions rather than concentrating on sources of evidence, commented: 'Most of the time when doing an analysis we have an idea in mind, when we go in search of statistics, we only see what confirms our notion,' said Eric McGee, senior network engineer at TRG Datacenters.

One of the most prominent forms of the more general concept of the so-called self-fulfilling prophecy is what is known as confirmation bias, especially in terms of handling outcomes.

"If the results are aligned with our hypotheses, we do not debate these outcomes anymore," shared Theresa Kushner, the Senior Director of Data Intelligence and Automation at NTT Data Services. "But when the results do not support the forecast hypotheses, there is nothing we will not redo - the method, the data or the algorithms - for we know that there is an error somewhere".

Notably, Kushner suggested coming up with a procedure to detect bias in the existing models and approaching this before using them. For example, the firm NTT Data Services has the AI Ethics governance aimed at guarding against bias from the design phase through deployment and use.

8. Outlier bias

Outlier bias is a form of systematic error that occurs due to the extreme nature of some of the observed numerical data points affecting the overall mean or average.

'If you use Jeff Bezos, for instance, in an attempt to look at the mean American incomes, then your investigation is going to go off track due to his richness significantly,' noted Rick Vasko, director of service delivery as well as quality at Entrust Solutions.

Outliers pose a threat to the credibility of the results because they are values that are quite different from most of the measurements in the data set. For instance, if there are ten individuals within a given population and one of them has $ 10,000 in cash while the rest has less than $5,000, then the holder of the $ 10,000 is an outlier and should be deprived of the samples to increase the validity of results.

Conclusion

Bias can significantly influence decision-making and fairness by distorting data analysis results. Acknowledging bias begins with recognizing its presence within both the data and the analysts themselves. Common forms of bias include recycling previous patterns, basing training on fabricated measures, and inadequate sample sizes in populations.

Mitigation strategies encompass various approaches, including contextualizing AI and its business impact, promoting diversity, filtering event-sourced data for post-event analysis, and conducting real-time assessments of AI impacts. Additionally, addressing statistical and confirmation biases through rigorous testing and robust governance is essential for ethical data analysis practices.

Furthermore, leveraging outliers can enhance precision in analysis. Overall, mitigating bias fosters more credible and inclusive decision-making, benefiting both businesses and society as a whole.






Latest Courses