Inside the Mind of a Data Analyst

So let’s first talk about what this article is NOT about. There will be no mention of R or Python or analytics softwares or machine learning algorithms. Nothing about data quality or data visualization or data modelling . Nope, not even data techniques or rules or best practices.

As an aspiring data analyst, I wanted to know everything about the analytics world and I used to put immense value on all things mentioned above (and I still do). All this knowledge and education taught me how to work as an analyst. But hands-on experience over the last few years taught me something else, something far more important. It taught me how to think as an analyst.

This article is about analytical thinking (or well.. at least the way I think!). It’s about being mindful of the things that might influence or affect your decision making skills. It’s about being aware of the thought process you follow and the impact it might have.

I’ll be focusing on 4 key areas which have helped me greatly in the way I approach a data problem and I hope they help you too..

No, I’m not biased.. Or am I?

Couple years back, I got interested in learning about the way people think; an interest which later translated into genuine love for the subject when I picked up a book called Freakonomics. Cognitive biases, consumer psychology, behavioral economics.. it taught me a lot about how biases affect your decision making. Since this article is not only about biases, I’ll list just one here.

Confirmation bias – Knowingly or unknowingly, many fall victim to this. Everyone has an assumption or an opinion when working with data and that’s where the trouble starts. You find a desire to get results in line with your opinion because you want to prove your point. This happens because its difficult to process or accept something new, something that challenges your beliefs.

What I learnt? When working on a data science project in grad school, I was tasked to identify “the best places to live in NYC” by analyzing rental data. Before even looking at the data, my mind was telling me – “Manhattan definitely at the top – great buildings, easy accessibility, lots of jobs”. An assumption was made, an opinion was formed. I had to select a few variables from hundreds available to do the analysis. I picked something on the lines of cleanliness quotient (high), public transit options (high), unemployment rate (low). Manhattan was the clear winner and I concluded that I did good analysis. In reality, I had merely rationalized my opinion. What if I had selected variables like cost of living or average rent or population density?

Correlation is NOT Causation

Let’s start of with an example that I came across recently. If you chart data of ice cream sales and deaths by drowning over the course of a year, you’ll see a correlation between the two. So does that mean eating ice cream causes people to drown? Of course not. This correlation is because of a confounding variable – in this case, the weather. Since it’s hot during the summer, more people tend to eat ice cream and also go swimming. Hence increased ice cream sales and also more number of people drowning. But when the weather starts to cool down, there are fewer people who eat ice cream or go out swimming. A confounding variable is nothing but a third variable that causes the correlation.

What I learnt? Always look out for a confounding variable. There has to be a proven causation – one which can only occur if there is direct relationship between the variables involved. From a data science-y perspective, it helps identify variables that should or should not be used when building a data model. If only correlation is accepted at face value, it has the potential to seriously affect business decisions, strategic initiatives, marketing plans etc. Imagine if people accept the ice-cream & drowning correlation and stop manufacturing or selling ice cream all together?

Deliver Value, Not Numbers

What’s the difference between Data Reporting and Data Analysis? Data reporting involves answering the What? question. But as a data analyst, you’re expected to answer the Why?. Let’s say there was a 7% lift in sales for business A, $40k decrease in revenue for business B this quarter – everyone loves numbers, but you don’t provide value unless and until you explain the impact of those numbers. Do these numbers mean that business A is doing well and business B is doing poorly? What if the actual expected lift in sales for business A was 20% and what if business B was losing an average of $500k every quarter for the last 2 years?

What I learnt? You need to look at the bigger picture. I believe understanding the importance of the data and its subsequent impact is vital before diving into the analysis. What’s the problem I’m trying to answer? How is this analysis going to impact the business/product/society? Who are the stakeholders involved and what are their expectations?. Asking questions is important because communicating your findings effectively is a key component for every analyst. If you don’t ask questions, you don’t get to know your audience, and if you don’t know your audience, who are you delivering value for?

Context is King

This is in fact a continuation of where I left off in the previous section. If you look closely, the example of business A and business B is also trying to say something else – context. In this world of man vs. machines, it’s the ability to put things into perspective and understanding the context that sets us apart (that is until Artificial Intelligence takes over). Let’s say your online business generated $200k in Q4 of this year, an increase of 80% over Q3. That sounds impressive. But what if this online business only sells winter wear – sweaters, scarves, coats etc.? Your high earnings during Q4 is because well.. it’s winter. Is it then right to compare these results with Q2 or Q3 when people are much more likely to shop for beach shorts than sweaters? This is where experience or a conscious understanding of everything around us helps us analysts make sense of the data.

What I learnt? When I first started with analytics and was working predominantly on open data, the only thing that mattered to me was the spreadsheet and the data it contained. I never asked questions like – Which business is this? Where are they located? What were their sales last year?. But today I do. In fact, the biggest change for me in recent times has been the way I now actively look for context than just restricting myself to the information at hand. For instance, if I’m dealing with real world data, I just Google everything. So if I spot some outlier in the data for a particular day or year, I Google to see if something happened during that particular time frame that might explain the unusual behavior pattern.

Just being mindful of things like these has helped me take a much more holistic approach towards data. Data-sets are not just few columns in a spreadsheet, they are part of something much bigger.

This was my story. What’s yours?