Data Science Concepts Every Analyst Should Know

Featured
12711 Views
0 Comments
1 Likes

As mentioned in my previous article Three Myths About Data Science Debunked, sooner or later business analysts will be involved a project with a machine learning or AI component. While BAs don’t necessarily need to know how statistical models work, understanding how to interpret their results can give them a competitive advantage.

This article discusses three concepts that can help analysts add value to data science projects (future articles will cover additional ones). Cultivating skills in these areas will increase your ability to build cross-functional alignment between business and data science teams and prevent bad decisions based on flawed analyses.

1) Evaluation metrics for machine learning models

In both business and nonprofit contexts, it’s common for data science teams to produce machine learning models with the purpose of understanding or predicting things: ad clicks, monthly sales, likelihood of students dropping out of school. While such models can aid in operations and fulfill critical business needs, it’s common for the combination of a weak analytics team and a manager with undue faith in analytical models to cause costly mistakes.

I’ve seen managers get excited with the news that a data scientist achieved 95% accuracy in a prediction model. Without more context, that means nothing: if you’re predicting complications for patients receiving a particular treatment, and only 5% of those patients get complications, a model that simply guessed “No complications” for all patients will be 95% accurate (and useless).

Commonly used model evaluation metrics include Standard Error of the Regression and R-Squared for regression analysis, and Confusion Matrix and F1 Score for classification problems. A BA with good understanding of such metrics can help their organization prevent foolish mistakes like using a classification model with accuracy of 90% but an unreasonable false positive rate, or a regression model with predictions within +/- 25% when an acceptable interval would be +/- 5%.

2) Sampling bias

Even in the era of big data, sampling remains a key technique in data science projects, for efficiency and other reasons. But because sampling bias alters the distribution of observed data, analysis based on biased samples can lead to highly inaccurate results.

Imagine a company that decided to run a survey with customers to get their input for prioritizing product enhancements in a mobile app. A glitch in the process causes a link to the survey to be sent only to iPhone users. Knowing that half of the customers use Android phones, a BA knowledgeable on sampling bias would be able to raise an alarm and seek a solution, like picking a new batch of 100 customers selected at random to answer the survey, before sending the data for analysis.

3) False discoveries in experimentation

Many businesses switch from analyzing big observational data sets to experimentation when they need to to understand whether a relationship between variables is reliably predictive. Experiments involving A/B tests or multivariate tests are regularly used to optimize product features, project efficiency, or revenue. In such projects, BAs with solid grounding in statistical significance and how to interpret p-values can help prevent embarrassing executive decisions based on false discoveries.

Consider a company that wants to learn which color of button leads to more clicks on a piece of advertising. It runs an experiment with the current standard blue and 20 other colors to see if one of them generates a higher ad-click percentage.

The data science team uses p-value to determine if the differences found in clicks are statistically significant, and finds that the color orange has a p-value that is 10x lower than the standard significant level of 0.05. Should the decision-maker trust the conclusion that the orange button will outperform the blue one?

A BA who understands how significance tests work knows that the more tests are performed, the higher the probability of getting a significant result simply due to chance. When 20 tests are being run, there is a 64% chance of observing at least one significant result, even if none of the tests are actually significant. Because the test included too many experiments, it’s impossible to draw a meaningful conclusion without more data.

(On a side note, if involved earlier in the process, the same BA might have suggested a more carefully designed experiment that could lead to a more informative test.)

Likewise, a knowledgeable BA would be able to spot the problem when a naive A/B test experimenter stopped an experiment as soon as a positive effect reached 90% of confidence, and explain to stakeholders how improper optional stopping increases the probability of getting a significant result simply due to chance and false discovery rate.

# # #

In an ideal world, all organizations would have data science teams with enough statistical knowledge and understanding of the business domain to be able to avoid all mistakes in hypothesis formulation, data collection, and model validation. In practice, machine learning work is often performed by analysts and engineers who lack strong grounding in the fundamentals of applied data science. Since powerful actionable insights start not with data, but with a valid business question, business analysts are particularly well-positioned to help close that gap. By acquiring practical information on the three concepts mentioned here, BAs can greatly increase their ability to protect their organization against dubious insights, flawed models, and recommendations that haven’t reached the necessary confidence levels.

Below are a few useful resources for anyone interested in learning more about the topics covered:


Author: Adriana Beal

Adriana Beal worked for more than a decade in business analysis and product management helping U.S. Fortune 500 companies and high tech startups make better software decisions. Prior to that she obtained graduate degrees in Electrical Engineering and Strategic Management of Information in her native country, Brazil. In 2016 she got a certificate in Big Data and Data Analytics from the University of Texas, and since then she has been working in machine learning and data science projects in healthcare, mobility, IoT, customer science, and human services. Adriana has two IT strategy books published in Brazil and work internationally published by IEEE and IGI Global. You can find more of her useful advice for business analysts at bealprojects.com.

 



Upcoming Live Webinars

 




Copyright 2006-2024 by Modern Analyst Media LLC