Can you predict the temperature?
Yes… with Machine Learning
Introduction to machine learning
A computer program which learns from the experience is called Machine Learning . We can identify some features,
- Automating automation
- Getting computers to program themselves
- Writing software is the bottleneck
- Let the data to do the work instead
Here We are trying to predict temperature.
We are using Google Colaboratory.
Lets start the process.
There are steps we want to follow while performing machine learning model.
- First we have to load the data.
- First we have to clean the data or preprocess the data.
- Then we have to transform the data.
- Perform data coding.
- Feature scaling.
- Feature discretization.
There are methods to do data analyzing in python. We can use juppiter note book or Google Colaboratory. Today we are discussing on Google Collab to create and run our python notebook.
We are using weatherHistory.csv file.
- We have to upload the data to Google drive
- Type google search bar google colab and click on the link
- Then click on Colab and Phython link.
- Then you are getting the bellow page.
- Go to the menu bar and click on file — -> New notebook.
- Then you will get a new note book. You can rename it as your wish.
you can use bellow link to access and download the weatherHistory.csv file.
After uploading the data into your google drive you have to get the permission to access it. It is called mount the drive.
Then go to the given Link and click on that. After selecting the preffered google account then you should click the allow button and then you will get a window like bellow. Now you only need to copy this given code and paste it inside the box given in note book and press enter.
Then you will get the output mounted like bellow image.
Now we have given the permission to access the google colab to our data. Now we can load the csv file. For that we use pandas library , import the pandas library and then we can use read_csv() method. inside the brackets we are giving the path where our data set is. After clicking shift+enter we can simply run our code. Then we are getting the dataset as follows.
Since we load the data to a variable called df , we can do many data analysis operations. Simply using df.infor() we can get the no of columns , the column names, how much memory it takes and the data types like so many descriptions we can get.
For the further analyzing or the mathematical operations we can separate our data set into object type and Numerical data. We can simply use below code fragment to separate the data.
To view the numerical data we can simply use df[‘num_vars’] command. Here ‘num_var’ is the variable we use to store the numerical data in above command.
Now we are going to the first step of temperature prediction model. The very important thing is the preprocessing the data. So that we have to clean the data. In data cleansing there are two steps.
- Dealing with Missing Values
- Handling Outliers
If we discuss on Dealing with Missing Values , First we have to identify is there are missing values. Then only we can handle the missing values. To check the missing values we can use isnull() method, if the data is missing at any cell that cell gives true and if not missing it will give false.
Using isnull() method it is difficult to visualize the data is missing or not. Because we are using a very large dataset. As a summary we can view the sum of missing values in column vise using sum() method.
However now we can see only one column has the missing values and it is Precip Type and the missing value count is 517. Further more we can sort the output using sort_values() method.
Since there are more dan 500 dat missing and we don’t kow is it cosidarable or not. So we hve to get the percentage of missing data with respect to the available data. To get that first we should know the the lenth or the size of the dataset. We can simply use len() method and as an argument we want to pass our data set as bellows.
We have the lenth of the data set. it is 96453. We can calculate the % of missing values like in bellow picture.
The missing value percentage is comparatively less. Therefor now we can handle the missing values. There are so many methods of handling missing values.
• List wise -List wise deletion (complete-case analysis) removes all data for an observation that
has one or more missing values.
• List wise deletion methods produce biased parameters and estimates. Because might be not due
• Pairwise deletion occurs when the statistical procedure uses cases that contain some missing
data. The procedure cannot include a particular variable when it has a missing value, but it can
still use the case when analyzing other variables with non-missing values.
- Dropping Variables
- Drop variables if the data is missing for more than 60%
- observations but only if that variable is insignificant.
And many more. There are advantages and disadvantages.
How ever if we drop or delete the data we may loose some important data as well. So We can save the data in another hand, In here what i did was get the most occurrence value and replace it with the missing values.
Now again check the missing value % . So we can now get 0.0 %. So we now successfully handled the missing values.
Here we get all the categorical variables.
We have to perform feature Coding. For that I checked the features with there value count.
Here , there are lots of features in “Summary” and “Daily Summary” colums. How ever there are only two values in “Precip Type” Hence we can aply ONE HOT Encoding. Lets do it future. Before encoding we have to handle the outliers. We can use copy() method to have an additional copy of data for our security pupose. Lets say our data got crashed so we can use the copy of data without going to first steps.
Outliers are unusual data points that differ significantly from the rest of the samples.
Method to Find Outliers
Two Basic Methods: Percentile and Box Plot
• Percentile :
• Define a minimum percentile and maximum percentile.
• Usually, the minimum percentile is 5%, and the maximum percentile is 95%.
• All the data points outside the percentile range, are considered as outliers.
• Box Plot
• A box plot is a graphical display for describing the distribution of data. Box plots use the
median and the lower and upper quartiles.
For the ease of visualization purposes i used Box Plot method. It is very simple. We need to call a method boxplot(). Here i used figsize() method to enlarge the size of the box plot.
Now we can see there are no outliers in the columns “Wind Bearing” , Visibility, “ Loud Cover” , While the loud cover has all the 0.0 values. Rest of the columns having Outliers and we have to handle them.
Three Basic Methods:
• Remove all the outliers :
• Replace Outlier Values with a suitable value
• Replace them with min or max quantile value
• Using IQR
• IQR or interquartile range is a measurement of variability based on dividing the dataset into
• Quantiles are divided into Q1, Q2, and Q3,
• Q1is the middle value of the first half of the dataset.
• Q2 is the median value, and
• Q3 is the middle value of the second half of the dataset.
• IQR = Q3-Q1
• We calculate the lower limit and upper limit and then simply discard all the values that are less
or above the limit and replace them with lower and upper limit accordingly.
• It will Also Work For Data That is Left skewed or Right Skewed
Bellow code performs the handling outliers using IQR method. For an example I only paste the code for removing outliers in “Humidity Column”. But we have to remove all the outliers. You can simply copy and paste the code and replace the Column names.
Then add new data to our dataset. Last line of the bellow code says that Assigning handled data to Humidity column which we are going to use in future.
This is what we get the output as. Here in before shows the before removing outliers. After showing the after removing outliers.
After handling all the outliers we can again try boxplot() method. As bellow picture mention All the outliers are successfully handled.
Now we are going to transform our data. For that first we have to draw histograms. Then only we can identify which feature is left skewed and which are the features right skewed Also we can identify What are the features do not want transform.
We can examine the data also drawing the Q-Q plots. We need to import two librares matplotlib.pylot and scipy.stats. Then we can simply call probplot()
So we can identified that “Humidity” histogram is left skewed. So we have to apply transformations and get the data as a normal distribution. There are few ways to reduce left skewness. But I prefer Exponential Transformation. Remember bellow I used the np.exp for expnential transfrmation. If you want to do power transformation you can simply replace the np.exp with lambda x: x**3 .
Further more if the graph is right skewed we can use Logarithm Transformation. only you need to replace np.exp with np.log . In here I only apply transformation for “Humidity” feature. If you need you can apply for the other features also. But you should remember if the graph left skewed. use exp transformation and if the graph right skewed use log transformation.
After performing transformation. We will get graphs as follows.
Rest of the part now we have to apply PCA. Before that we have to do coding. Earlier I mentioned that I would do feature coding later. Now we are coming to that part. It is very simple. we will encode “rain” as 1 and “snow” as 0 in the column “Precip Type”.
Feature Scaling also one of data preprocessing step.
Feature scaling refers to the methods used to normalize the range of values of independent variables.
Feature magnitude matters for several reasons:
• The scale of the variable directly influences the regression coefficient.
• Variables with a more significant magnitude dominate over the ones with a smaller magnitude
• Gradient descent converges faster when features are on similar scales.
• Feature scaling helps decrease the time to find support vectors for SVMs.
• Euclidean distances are sensitive to feature magnitude.
• Scale to maximum and minimum method is very sensitive to outliers
Here we are using StrandradScaler from the library sklearn.preprocessing.
First we have to create the object of StandradScaler and then call the fit() method. Before that we need to create a variable and assign the column names that we need to apply Scaling function.
After applying the scalling this is how our data visulize as histograms. we are getting values from -2 to 2.
Here we use Min max scale.
After performing min max scale method we can visualize the data as follows.
Now we have come to the PCA. Simply PCA is Personal Component Analysis. PCA helps to measure the data with respect to the principal components.
This is how PCA works
- Calculate the covariance matrix X of data points.
- Calculate eigen vectors and corresponding eigen values.
- Sort the eigen vectors according to their eigen values in decreasing order.
- Choose first k eigen vectors and that will be the new k dimensions.
- Transform the original n dimensional data points into k dimensions.
In notebook we have to import PCA from from sklearn.decomposition library. You will get the output as bellow image.
if we use describe() method we can summarize data as follows. we can get all the values in feature wise like mean standrd deviation Q1 ,Q2,Q3 , min, max and so on.
To get a better understand on how features are co-related we are using correlation matrix. corr() method gives the correlation matrix. This is giving the out put as follows. But it is not easy to understand. As we human it easy if we can get the data in more visualize more attractive manner. Therefore we can use heatmap() method. We use seaborn library.
Here inside heatmap() method we need to pass correlation matrix. We are using annot = True to get the value inside the cells.
From the correlation matrix we can get an idea on how each feature correlated and what are the features we can drop. To get the features we can use bellowing code sample.
With the heat map we can identify “Apparent Temperature” and “Temperature” are “Highly” correlate. Since we can use one of those features and simply drop the other. Because both features effects to the data set is same. Keeping both features is like keeping redundant data. By using drop() method we can drop the column.
Now we are going to create the model. We are creating a Linear Regression Model to predict the Apparent Temperature. For that purpose we first divide our data set in to two parts. Training data set and Testing data set. Normally 0.8 of data for training and rest of the part for testing. Bellow code fragment indicates how we are going to devide the data set into 2 . In python we have to import train_test_split library from sklearn.model_selection.
This is how we create Linear Regression model. We import LinearRegression library. After creating an object from Linear Regression then we can use fit() method. We can pass x and y values which are devided earlier from traing data set. then we can predict the Apparent temperature from the modle. Then we have to compare it with tesing dataset.