Module 01 Project: Data Cleaning

Posted by Hazel Donaldson on June 18, 2019

For the Module 1 project, we were to take a dataset regarding prices of houses in King County, Washington and create a multivariate model that would accurately predict the house prices. While there are several steps in the OSEMin method to get this, one of the most important steps is the second step known as “Scrubbing” or cleaning the data. In this step, we take a dataset that is filled with outliers, missing data, ambiguous datatypes and get it ready to be explored further and modelled. Cleaning the dataset goes a long way to improving the model’s performance. The first step in this process, is get a basic understanding of what is inside the dataset. This is done with the help of Pandas’ .head(), .info() and .describe() functions. The .head() function allows us to view the first five entries in the dataset. From there, I could see a few columns that were not relevant to creating the model (“id” and “date”) so, they were removed first. With the .info() function, you will be able to see how many entries are in the dataset, how many entries are in each column and what type of data you are working with. In the King County dataset, there was a total of 21597 entries and there were 3 columns that were missing data. Next, I could see that the column for sqft_basement was an object datatype, which means it had string values inside when the column referred to the squared footage of a house’s basement. I decided to investigate this column first. Upon checking the unique values in that column, there were over 400 “?” placeholder values. I decided to delete this column because the column had some many zero values. Next step was using the .describe() function which gives us the 5 point statistics of the dataset: the mean, median, standard deviation, minimum value and maximum value along with the quartiles for each column. From there I could see the bedrooms had a large maximum value, the waterfront, yr_renovated and view columns had a lot of null values. First, I dealt with the bedrooms issue. I used pandas’ .loc() function to see if the maximum value was possible for the size of the house. Upon inspection, the house in question had 33 bedrooms on one floor and 1.75 bathrooms. Since that did not seem possible, I deleted the row from the dataset. Next, I dealt with the null values. To confirm whether there were null values, I used pandas’ .isna().sum() function to see how many missing values each column has. In the end, the waterfront column had over 2000 null values, the yr_renovated column had over 3000 null values and the view column had 63 null values. There are several ways to deal with columns with null values: you can drop the columns completely; you can fill the null values with another value that it is relevant to the column or with a placeholder or you can leave it alone. It all depends on what type of data you are dealing with. First, I dealt with the waterfront column. Using the .unique() function, the column only consisted of 0, 1 and nan(not a number) which means that column refers to whether a house has a waterfront view or not. Since the possibility of having waterfront property can affect the price, I decided not to remove it. In order to deal with the null values, I separated the dataset into values that had waterfront views and houses that didn’t have a waterfront view and then used the .pd_concat() function to bring the two subsets back together. I used a similar method on the view column. For the yr_renovated column, I initially tried to keep the column but soon realized that the year the house was renovated had no clear relationship with the price of the house and removed the column. After making sure that there were no more null values, I checked which features would fit the model best. This was accomplished using scatter plots and histograms. These visualizations gave me an idea of how well the features were distributed and if any fixes needed to be made to improve their quality. For example, the histogram for the sqft_living column was highly skewed, so I log-transformed it to improve its distribution. The most interesting feature to deal with was zip code. Since location is important to house prices, it needed to be investigated. From the zip code column’s visualizations it looked categorical, so I decided to one-hot encode it using pandas’ .get_dummies() function. This way, the houses could be separated between their individual zip codes. Making sure the dataset you are working with is very important. In the real world, data is messy for a variety of reasons. As data scientists and aspiring data scientists we must learn to deal with these issues effectively in order to answer some very important questions for business stakeholders.