My Data Science Experience


detailing my bootcamp journey

Understanding ACF and PACF Plots

Time series forecasting is a machine learning model used for time series data compiled hourly, daily, weekly etc. to predict future, meaningful values based on previously observed values. There are several statistical models that help us forecast data and one of the most common models is the ARIMA model. ARIMA stands for Auto Regressive Integrated Moving Average and has three parameters (p,d,q) to account for trends, seasonality and noise in the data. One of the ways to predict these parameters is the use of two graphs: ACF and PACF.


**Multiple Comparisons**

Imagine the following scenario, you are a data scientist asked to assist a business make sense of all the data coming in about different aspects of their company. This can include employee performance, discounts, freight costs and sales on their products. As data scientist, your job is to perform experiments, learn as much possible about those findings and come up with recommendations for that business to follow. Usually this is done by defining several useful metrics or segmentations. Both strategies help you learn more about the business in different ways. In the case of segmentation, you may run statistical analyses to discover significant or non-significant findings; for example, if discounts increase product revenues and which discount level increases revenues the most. This means digging deeper to find insights. But this comes at a cost because the more comparisons you may make, the more likely it possible to make a false positive or come to conclusion that is incorrect. This common mistake is known as multiple comparisons and this post will highlight how this mistake can affect your statistical analysis and a technique, I used to correct this problem. For this project, I was given a database from the Northwind Co, who is a food and beverage distributor and was asked to find some insights that would improve their business. Their database consists of 13 tables, all intertwined with unique ID keys. One table that caught a lot of my interest was the Order Detail information which consists of details of order quantities, unit prices and discounts. One of the questions I needed to answer was if discounts affected order quantities. Part of that question was if there was an optimal discount level that affected order quantities more than others. This question is an example of how one can run into the multiple comparisons problems because you need to compare all the combinations of discount levels and perform the same test multiple times. In every statistical analysis, it is important to define three things: a null hypothesis, an alternative hypothesis and a significance level. The null hypothesis tells that this no difference between the metrics we are comparing; the alternative hypothesis tells that there is a difference between the metrics we are comparing. For example, that products that are discounted account for more order quantities than products with no discount. The significant level also known as the p-value or alpha is a value selected to represent the possibility that the null hypothesis is true. In the case of the Northwind project, the p-value used was 0.05. This means that in a world where the level of discount doesn’t matter to order quantities, we would measure that discounts do affect order quantities 5% of the time due to random noise. After the p-value is established and the experiment is performed, we determine the experiment’s p-value. If that value is less than 0.05, we can reject the null hypothesis; alternatively, if the p-value is greater than 0.05, we fail to reject the null hypothesis. However, this is not the entire story because obtaining a low p-value doesn’t guarantee that the null hypothesis is incorrect. For example, a p-value of 0.001 still means there’s a 1 in 1000 chance the null hypothesis is true. And if we perform this same test 100 times and get a small p-value, the false positive only becomes cumulative. In order to avoid this problem in the Northwind project, I used ANOVA and Tukey tests to perform multiple comparisons. ANOVA stands for analysis of variance and is a test that can generate statistical analysis for multiple groups. This is a useful alternative to t-tests when you need to compare multiple groups. For the Northwind Sale Analysis, I used ANOVA when comparing the different discount levels. Using statsmodels to generate an ANOVA table:

Import statsmodels.api as sm
From statsmodels.formula.api import ols
Lm =ols(‘Quantity ~C(‘Discount’), discounted).fit()
Lm.summary()

In this table, we can see the p-value generated. The next step would be performing a Tukey test. Tukey test is a single step statistic test used to find means that are significantly different from each other. For multiple comparisons, we will need to create data frames for the varying discount levels. Then we will import libraries for Tukey test: from statsmodels.stats.multicomp import pairwise_tukeyhsd from statsmodels.stats.multicomp import MultiComparison

tk = pairwise_tukeyhsd(discount_tukey.Quantity, discount_tukey.Discount, .05)
print(tk)

The resulting table gives us all the combination pairs of discount levels to compare, with the mean difference and whether this mean difference rejected the null hypothesis or failed to reject null hypothesis. In the case of the Northwind project, I found that for the most part the discount level that product was priced at didn’t affect order quantities. In fact, there is a possibility that these discounts were negatively affecting revenues. The multiple comparisons problem is an important issue to keep in mind when performing statistical analysis or examine studies. Because if we look hard enough, we may see correlations where they don’t exist. Humans are very adept at finding patterns and can make the mistake of finding meaning in what is random noise. From a business standpoint, being aware of this means analyzing claims closely and making more rational decisions.


Module 01 Project: Data Cleaning

For the Module 1 project, we were to take a dataset regarding prices of houses in King County, Washington and create a multivariate model that would accurately predict the house prices. While there are several steps in the OSEMin method to get this, one of the most important steps is the second step known as “Scrubbing” or cleaning the data. In this step, we take a dataset that is filled with outliers, missing data, ambiguous datatypes and get it ready to be explored further and modelled. Cleaning the dataset goes a long way to improving the model’s performance. The first step in this process, is get a basic understanding of what is inside the dataset. This is done with the help of Pandas’ .head(), .info() and .describe() functions. The .head() function allows us to view the first five entries in the dataset. From there, I could see a few columns that were not relevant to creating the model (“id” and “date”) so, they were removed first. With the .info() function, you will be able to see how many entries are in the dataset, how many entries are in each column and what type of data you are working with. In the King County dataset, there was a total of 21597 entries and there were 3 columns that were missing data. Next, I could see that the column for sqft_basement was an object datatype, which means it had string values inside when the column referred to the squared footage of a house’s basement. I decided to investigate this column first. Upon checking the unique values in that column, there were over 400 “?” placeholder values. I decided to delete this column because the column had some many zero values. Next step was using the .describe() function which gives us the 5 point statistics of the dataset: the mean, median, standard deviation, minimum value and maximum value along with the quartiles for each column. From there I could see the bedrooms had a large maximum value, the waterfront, yr_renovated and view columns had a lot of null values. First, I dealt with the bedrooms issue. I used pandas’ .loc() function to see if the maximum value was possible for the size of the house. Upon inspection, the house in question had 33 bedrooms on one floor and 1.75 bathrooms. Since that did not seem possible, I deleted the row from the dataset. Next, I dealt with the null values. To confirm whether there were null values, I used pandas’ .isna().sum() function to see how many missing values each column has. In the end, the waterfront column had over 2000 null values, the yr_renovated column had over 3000 null values and the view column had 63 null values. There are several ways to deal with columns with null values: you can drop the columns completely; you can fill the null values with another value that it is relevant to the column or with a placeholder or you can leave it alone. It all depends on what type of data you are dealing with. First, I dealt with the waterfront column. Using the .unique() function, the column only consisted of 0, 1 and nan(not a number) which means that column refers to whether a house has a waterfront view or not. Since the possibility of having waterfront property can affect the price, I decided not to remove it. In order to deal with the null values, I separated the dataset into values that had waterfront views and houses that didn’t have a waterfront view and then used the .pd_concat() function to bring the two subsets back together. I used a similar method on the view column. For the yr_renovated column, I initially tried to keep the column but soon realized that the year the house was renovated had no clear relationship with the price of the house and removed the column. After making sure that there were no more null values, I checked which features would fit the model best. This was accomplished using scatter plots and histograms. These visualizations gave me an idea of how well the features were distributed and if any fixes needed to be made to improve their quality. For example, the histogram for the sqft_living column was highly skewed, so I log-transformed it to improve its distribution. The most interesting feature to deal with was zip code. Since location is important to house prices, it needed to be investigated. From the zip code column’s visualizations it looked categorical, so I decided to one-hot encode it using pandas’ .get_dummies() function. This way, the houses could be separated between their individual zip codes. Making sure the dataset you are working with is very important. In the real world, data is messy for a variety of reasons. As data scientists and aspiring data scientists we must learn to deal with these issues effectively in order to answer some very important questions for business stakeholders.