Enterprises worldwide are continually amassing huge amounts of data, especially those related to technology, so it is very important for them to have specialists in the analysis and interpretation of the data. This has given rise to “Data Science” specialists. They are people who focus on data analysis by applying statistical and probability techniques to the gathered data in order to provide relevant indicators so better decisions can be made based on the results.
What is data science about?
Data science is mostly about analyzing, exploring and understanding the data in order to help organizations drive improvements or solving problems.
Where is data gathered from?
It can be data from anywhere, from text files, databases, even smart device sensors.
What can you do with the data?
You need to use tools to clean up the data and then start to manipulate and analyze that data. There are a lot of tools available for this like the widely known Microsoft Excel. Or you might want to use more specialized code tools like Python or R.
Why would you want to analyze the data?
Raw data might not be very informative per se, as it can very well be just a set of numerical or non-numerical data about one or more subject domains that you have gathered from several sources. In order to give sense to this data, and be able to analyze and interpret it, you need to combine the data and transform it into information.
One of the best ways to interpret data into information is by shaping it into statistical information.
What can I do with the statistical information about data?
Statistical information is key for data analysis, specially if you are dealing with numerical data. You might be asked to determine the sales trend for a particular line of business, that is how the volume of sales behaves over a specific time period.
In order to achieve this, at a high level, you need to gather the sales data and then consider the statistical variables that you are dealing with.
Key concepts of statistical analysis
There are several important basic concepts related to statistical analysis. In order to better illustrate how these concepts work, let’s use a very simple example.
DATA ANALYSIS BASICS - Part 2
The coffee cart
Let’s imagine that we own a mobile coffee cart that we usually run at a city downtown near to customer packed places such as office buildings, a college or the train station.
Also, let’s say we have kept the record of our daily sales along with more details like the coffee type, price, where the cart was installed and the environmental temperature.
Our recorded data might look like this:
By recording our daily sales data we might want to analyze how things in our data, our variables, relate or influence to each other. This is what we call correlation.
How do you get the correlation and how can it help me analyze my data?
Correlation ranges from negative 1 to positive 1. The closer a relationship is to positive 1 the more in alignment they are, this means they have a positive impact. The opposite is also true, the closer they are to negative 1 the more negative influence they have on the other.
Having our recorded data in mind, we might want to analyze how our sales are being driven depending on our distinct variables and which has either a positive or negative correlation to sales.
So, looking at our data we can distinguish a few variables that we could use to determine if they are related to sales: temperature, price, total sales, and revenue.
Calculating correlation using the Microsoft Excel data analysis tools
We will now use the data analysis tools integrated into Excel to quickly calculate and illustrate the correlation on our data.
First, we need to enable the tools in case they appear disabled. To do this you need to go into the “Add-Ins” options and use the “Manage Excel Add-ins” option. Then enable the “Analysis ToolPak” Add-In.
Now we will use the Data Analysis item located in the “Data” menu:
Then select “Correlation” from the options menu. Hit the “OK” button.
Now we need to define the data range containing the data to be analyzed, the grouping and where to put the generated result. For the range, we will select the complete content of the Temperature, Price, Total Sales and Revenue columns, including their headers.
We will also check the “Labels in First Row” option to indicate the first row contains the title of each variable. Lastly, we hit the “OK” button.
By doing this, excel will then create a new data sheet containing our resulting correlation matrix.
In our example, this is a very simple and brief matrix because we have selected only 4 variables.
DATA ANALYSIS BASICS - Part 3
How to interpret the correlation matrix results? As mentioned earlier, a correlation is a relationship between our analysis variables, and it can be either positive or negative in nature, ranging from -1 to +1.
The results at the correlation matrix are known as “correlation coefficients” and they determine the degree to which two variable “movements” are associated. In general, the correlation scale to determine whether the coefficients are weak, medium and strong can be depicted as follows:
We need to analyze, how these variables are then related, and what type of correlation they have.
Taking one simple example, the correlation between the temperature and the total sales. We locate the value where the total sales row crosses with the temperature column:
We can see, that it has a correlation coefficient of “-0.7888”. This is a moderate negative correlation, meaning that while the temperature increases, the total sales decrease, which makes sense because we are selling hot coffee, so we are likely to sell more on cold days, right?
This can be also expressed as the variables move in opposite directions. Based on the correlation coefficient, we can interpret it as the relationship between the temperature and the total sales is negative approximately 79% of the time.
If we represent this graphically on a scatter chart we will see that the total sales trend line goes downwards as the temperature increases:
In the other hand we also see an example of a weak positive correlation between the price and the revenue:
This can be expressed as the variables move in the same direction, so an increase in one, also results in an increase in the other. In this case, we confirm that indeed increasing the price will also increase our revenue, and this is accurate around 18% of the time.
It is worth mentioning that since this is a weak correlation then it means that although increasing price can increase our total revenue, based on the overall variables it is not very likely to happen.
We can see the chart representation of this correlation below:
The correlation results can help us to begin understanding possible action courses based on the data.
In our example, we have seen that increasing the price can lead us to increase our revenue, and that relationship has a weak correlation based on the matrix. So, does this mean that we definitely can increase the price without any consequence? Not necessarily… Why? Keep reading.
Take a look at the correlation between the price and the total sales:
This shows a moderate negative correlation of “-0.7535”. Meaning that as we increase the price our total sales plummet, and this is true around 75% of the time.
Based on this we might think twice before we increase our price since, in the long run, it will most likely bring our total sales down. Yes. Raising the prices can increase revenue, however, there is also a good chance that we end up selling less. And that’s not good for our business. As you can see, correlation is a very basic analysis technique, but it can be a good starting point for our data analysis. There are also other more advanced data analysis statistical techniques that can greatly help us interpret the results. Please keep tuned for upcoming data science entries!
Microsoft Professional Program / Data Science track (https://academy.microsoft.com/en-us/professional-program/data-science/)