**Temperature Regression model**

This project is used to predict any apparent temperature value in one area based on several regressor or feature and prove some correlation between temperature and any features such as humidity, wind speed etc. The dataset contains temperature data and other features hourly, and the measurement taken in Szeged, Hungary between 2006 and 2016.

This project is done to answer these research questions:

• Is there a relationship between humidity and temperature?

• What about between humidity and apparent temperature?

• Can you predict the apparent temperature given the humidity?

We will divide this article into 4 parts: data preparation, data analyzing, modeling and evaluation. Data preparation will be used to explain the content of the data set and to clean the data set. For the second part, we will describe the data behavior and the correlation between target (apparent temperature) and the regressor. The model and model evaluation will be explained in the third part while we generate the conclusion about the project afterward.

**Data preparation**

Data that is used is hourly temperature data in Hungary between 2006 and 2016. There will 12 variables in the data set. Those twelve variables are:

• Time (time)

• Summary (string)

• precipType (string)

• Temperature (float)

• ApparentTemperature (float)

• Humidity (float)

• windspeed (float)

• windBearing (float)

• Visibility (float)

• Loud Cover (float)

• Pressure (millibars) (float)

• Daily Summary (string)

There will be eight numerical variables, three string variables, and one time variable. Apparent Temperatures will used as the target variable or the data that we will predict using regression model.

**Missing Value**

We will analyze the missing data for each features, to make sure that our data is clean from missing value.

Precip Type contains 517 missing values. The feature itself contains two unique values: rain and snow. Due to no significant value of missing values, we can drop the rows that contain the missing values.

**Duplicated Value**

We should assure that three is no duplicated value, it this case the time stamp of the raw data. We can see from the script below that there are 24 duplicate time stamps, so we need to drop the duplicate value and keep the last one.

**Data preparation**

**Data Correlation**

Data Correlation Correlation analysis is a method of statistical evaluation used to study the strength of a relationship between two, numerically measured, continuous variables (e.g., height and weight).

In this project, we will predict the target Apparent Temperature(C) feature. Based on correlation data we determine to omit Temperature © and Loud Cover. So we will use only five from seven numerical features.

**Categorical Features**

There are 4 categorical features in the data set. We will omit formatted Date and Daily Summary categorical features. Formatted Date is omitted due to time information only data that can not used for model calculation. Daily Summary is about daily weather information that does not match to calculate the hourly temperature data. Finally, there will five numerical and two categorical features.

**Outlier Handling**

Outlier is an observation in a dataset that lies far from the rest of the observations. In statistics, we have three measures of central tendency namely Mean, Median, and Mode. They help us describe the data. Mean is the accurate measure to describe the data when we do not have any outliers present. Median is used if there is an outlier in the dataset. Mode is used if there is an outlier AND about ½ or more of the data is the same. ‘Mean’ is the only measure of central tendency that is affected by the outliers which in turn impacts Standard deviation.

As we will use linear regression, we will consider any outlier that should be omitted to improve the model accuracy. Outlier will be checked and treated for all numerical features (5 numerical features).

We see that there are outliers in numerical features. For outliers handling we will use Interquartile range (IQR) to check the outliers and will omit those values in all features, including target variable. Total 87147 final rows from 93,364 rows is created after we do outlier handling process.

**Multicolinearity**

Multicollinearity is a statistical concept where several independent variables in a model are correlated. Two variables are perfectly collinear if their correlation coefficient is +/- 1.0. Multicollinearity among independent variables will result in less reliable statistical inferences. Multicollinearity can lead to skewed or misleading results when a researcher or analyst attempts to determine how well each independent variable can be used most effectively to predict or understand the dependent variable in a statistical model. In general, multicollinearity can lead to wider confidence intervals that produce less reliable probabilities in terms of the effect of independent variables in a model.

We wil use variance_inflation_factor (VIF) in order to determine whether the independent features is multicollinear or not. VIF is the ratio of the variance of estimation some parameters in a model that includes multiple other terms (parameters) by variance of a model constructed using only one term. Higher multicollinearity value may indicate that the feature is multicollinear.

We will omit the ‘Pressure (millibars)” feature as it has the highest multicollinear value. After we omit the feature, we have to re-calculate the VIF value to check any other multicollinear value.

There will 4 numerical features that no multicollinear with each other’s.

**Categorical Features Handling**

There are two categorical features in this predictive model. As those features are not numerical, we cannot use those features in model calculation. We need to transform them into numerical features. We will use one hot encoding method to transform those two categorical features into numerical features.

**Model and Evaluation**

We will continue to create model to predict ‘Apparent Temperature’ using regression algorithm. Linear algorithm is the one that we will choose to predict the targeted variable. As first, we will divide the dataset between target and regression variable. Apparent Temperature’ will be used as a targeted variable while the remain features will be used as regressor.

**Scaling**

As the regression features are belong to several measurement unit so we need to make the scale similar for all the regression features. Standard scaler is one of the methods that is used to equalize the measurement unit, that create all the measurement unit based on the standard deviation from its mean.

After we scale the regression features, we will divide the regression (independent) features and target (dependent) feature. Train dataset (x and y) are used to create the model, while the test dataset are used to evaluate the model using scoring metric method.

Regression model is fitted with train data set, and the fitted model is used to predict the target (Apparent Temperature).

Y_train_predict is the prediction result from train data set, and it is compared to y_train as the actual target value. We will create scatterplot to describe the correlation between target prediction value and actual value.

The difference between actual and prediction result value may depict in histogram to assure that the delta between those two values is normally distributed.

We calculate the R2 score (coefficient of determination) to evaluate the regression model. the R2 score is not finite: it is either NaN (perfect predictions) or -Inf (imperfect predictions).

R2 score is around 0.6 that we will use to predict the Apparent Temperature. We can generate the intercept and the slope coefficient, so we can describe this model such a simple linear model.

We can see from the slope coefficient that three features have a strong impact for the model. Those features are Humidity, Win Speed, and Precip Type Snow. Those three features will decrease the temperature. Higher the humidity will decrease the temperature that we predicted.

Based on the model that we already fitted we continue to predict the test data set using the regression model that already fitted using train data set.

We generate correlation graph between actual test target and the prediction result value to describe that the prediction result value compared to actual value.

**Polynomial Regression 2nd and 3rd degree**

Polynomial regression is a form of regression analysis in which the relationship between the independent variable x and the dependent variable y is modelled as an nth degree polynomial in x. In this project we will use two-degree polynomial linear Regression Polynomial, to improve the R2 score.

By using the two-degree polynomial metho we increase total features numbers from 18 to 189 features. The fitted polynomial regression model is used to predict the test data set and evaluate the result using the actual test. We improve the r2 score to 0.63 from 0.62 (simple linear regression).

Polynomial Regression it seems improve the prediction result in a suitable polynomial degree. How about we increase the polynomial degree, do it will increase the model r2 score.

We will increase polynomial degree to three, and calculate the r2 score of the prediction model.

The r2 score is realy low (negative) and it means the model that does not follow the actual target and it means we can use this model to predict the temperature. The negative value from r2 score is caused by the

If r2 is computed as 1−(SSres/SStot). where, SSres = residual error).

When SSres is greater than SStot, that equation could compute a negative value for R2, if the value of the coeficient is greater than 1. With linear regression with no constraints, r2 must be positive (or zero) and equals the square of the correlation coefficient, r. A negative r2 is only possible with linear regression when either the intercept or the slope is constrained so that the “best-fit” line (given the constraint) fits worse than a horizontal line. With nonlinear regression, the r2 can be negative whenever the best-fit model (given the chosen equation, and its constraints, if any) fits the data worse than a horizontal line.

Based on the three-degree polynomial model regression prediction result we can conclude that we cannot just increase the polynomial degree to improve the model prediction result.

**Random Forest**

Random Forest Regression is another algorithm that we will use to predict the apparent temperature. We will use the model to compare between the linear regression both simple and polynomial with other algorithm type.

We can see from the prediction result that the random forest algorithm improves the r2 score. And that is a good result but there is another problem that is found in using this algorithm. The R2 score for the train method is really high and based on the huge different between the train and test model result we can conclude the random forest regression model is overfitted. Overfitting is a modeling error in statistics that occurs when a function is too closely aligned to a limited set of data points. As a result, the model is useful in reference only to its initial data set, and not to any other data sets. So we will use the linear two-degree polynomial regressions as the best prediction model based on the current data set.

**Conclusion**

This project is done to answer these research questions:

• Is there a relationship between humidity and temperature?

The correlation between the temperature and the humidity is negatively correlated and it means the higher the value of the temperature the lower the humidity.

• What about between humidity and apparent temperature?

The correlation between the apparent temperature and the humidity is negatively correlated and it means the higher the value of the apparent temperature the lower the humidity. The apparent temperature is correlate positively with temperature feature.

• Can you predict the apparent temperature given the humidity?

Yes we can, we can use two-degree polynomial regression model to predict the apparent temperature using the humidity feature.