What to ask when looking at data

What to ask when looking at data

I listen to a podcast all about data and statistics in the news, which is called More or Less.  

In the podcast they investigate numbers that have been raised in the news in order to question whether the stat is accurate and what it actually means, and one of the mantras of presenter Tim Harford is "Is it a big number?".

The question "Is it a big number?" isn't really focused on the size of the number, but instead its focus is really on the context of the number.  Is it a big number compared to other numbers on the same subject?

It's these extra type of questions that we should be asking ourselves whenever we come across some data point.

What is being measured?

I've worked in many organizations where the terminology being used isn't very clear, and as such, the actions you can take based on the data might not always be optimum due to this lack of clarity.

In a retailer with a long sales cycle, and you're looking at sales, when is a sale classified as a sale?

  • Is it when a deposit has been placed?
  • Is it when the product is delivered?
  • Is it when the final payment has been received?

Do all the people in the organization use the same definition of a sale, or do the sales people count it from final payment as that's when their commission kicks in and finance people count it from when the product has been delivered as that's when the stock has been transferred out of the business?

If an organization doesn't have data definitions of what term means what, then there is scope for crossed wires and confusion, so it's important whenever you look at data that you understand exactly what is being reported. 

Where has the data come from?

"The majority of customer support calls this week have been from customer's with issues logging in to the product" is something that the Head of Customer Success might come and tell you, but where's the information come from?

Is there a report that you can view that shows this statistic, and where did the data from this report get generated from?  Is it the customer success team's opinion because they've had a number of customers with major issues in this area and that's what they perceive as being the biggest issues of the week.

Do you trust the source of the data, and if not, how can you get to the point where you do believe what is being presented to you?

You don't want to divert resources to something in one direction, when the reality is that there is a bigger problem in another area that just isn't apparent from the data source you're seeing.

Is it causation or just correlation?

Firstly, what do we mean by correlation and causation?

Correlation is when two variables move in relation to each other.  For example, this week the number of new sign ups went up and income went up.

Causation is when one of the variables is as a direct result of the movement of the other variable.  For example, this week the number of paying sign ups went up and income from paying sign ups went up.

If we look back at the example we gave in the correlation definition, where the number of new sign ups went up this week, as did income, we can say that just because sign ups increased it doesn't mean that the income increase was as a direct result of it.  The increase in new sign ups could all be from free subscribers and all the income increases could come from existing customers who were upgrading.

As such, it's important for us as product managers to understand what the really cause of a data point moving is, so that when we take action we're taking action on the right thing.

Is it a big number?

And then back to the example of the question I asked at the start.  Is it a big number?

"100 customers cancelled their subscription this week"

Of course, it never feels great to lose customers, but in every product there is a natural level of customer churn, so is losing 100 customers good or bad?

If you've got 105 customers and in one week you lose 100, then it seems pretty bad.

If you've got 1 million customers and lose 100, then the number takes on a different perspective.

It's for this reason that we should utilize other data ideas to provide context.  For example, percentages might show you how things change in relation to the whole, or trends might show you how changes are occurring over time.

"0.25% of customers cancelled this subscription this week, down against the quarterly average of 0.35%" is a more context driven set of statistics.