Logistic regression analysis with R
Logistic regression is an analysis which allow to predict discrete value on the basis os statistical analysis of the given distribution. Now we will make simple analysis on R with build in dataset mtcars
I will not give much details about cleaning and studying this data set. Some useful information about preliminary data analysis you can find in the beginning of linear regression example
> mtcars
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Creating model for Logistic regression analysis
mtcars contains column vs which represents engine type 0 – V shaped and `- straight. We will try to understand, can we work out engine type in dependants of car weight(wt), engine power (hp) and engine displacement (disp) if any.
This dataset have row names representing the car model. So we will convert name of the rows into another column and also remove data out of our interest.
Because our result can have only two values 0 and 1 we need to apply logistic regression analysis
Before we start, we need ot load libraries dplyr and tibble
> library(dplyr)
> library(tibble)
> cars <- rownames_to_column(mtcars, var="car") %>% select(car, vs, hp, disp, wt)
> cars
car vs hp disp wt
1 Mazda RX4 0 110 160.0 2.620
2 Mazda RX4 Wag 0 110 160.0 2.875
3 Datsun 710 1 93 108.0 2.320
4 Hornet 4 Drive 1 110 258.0 3.215
. . .
As for now we have all data selected and named. In total it is 32 lines in dataset (not much) and we will create our model (train it) on the set of 26 cars randomly selected. And the rest data (6 observations) will be used for test set of our model
> cars_train <- cars %>% sample_n(26)
> cars_test <- cars %>% setdiff(cars_train)
> cars_model <- glm(vs~wt + hp + disp, data=cars_train, family=binomial)
Please note, that every time I’ve run this script, the random selection will be different. To avoid it for test purposes only, you can fix the start point for random selection by using command set.seed(any number)
We also use generalized linear model function glm() with binomial logistic regression.
Prediction on the basis of our model
So now, when our model is created and trained, we can calculate and predict on its basis. How to do it. Function predict() will do this prediction, but the result will not be discrete, but kind of function and we need to understand, how to split the results into two groups.
> cars_test["calc"] <- predict(cars_model, cars_test)
> mean(cars_train$calc)
[1] -3.186593
We can use 0 as a separating point for this prediction, or, of the two types of data are equally distributed, we can use mean value (-3.186593), or, I thin, it is better to clusterings these 2 groups into two sets
> km <- kmeans(cars_test$calc, 2, algorithm = "Lloyd")
> km
K-means clustering with 2 clusters of sizes 2, 4
Cluster means:
[,1]
1 2.747616
2 -6.249240
And now we are checking our test set for predictions
> cars_test["calc"] <- predict(cars_model, cars_test)
> cars_test["predict"] <- ifelse(predict(cars_model, cars_test)>0,1,0)
> cars_test["predict"] <- ifelse(predict(cars_model, cars_test)>0,1,0)
> cars_test["mean_predict"] <-
ifelse(predict(cars_model, cars_test)>mean(cars_train$calc),1,0)
> cars_test
car vs hp disp wt predict calc mean_predict
1 Datsun 710 1 93 108.0 2.320 1 2.043189 1
2 Hornet 4 Drive 1 110 258.0 3.215 0 -2.779053 1
3 Lincoln Continental 0 215 460.0 5.424 0 -9.737519 0
4 Toyota Corolla 1 65 71.1 1.835 1 3.452043 1
5 Dodge Challenger 0 150 318.0 3.520 0 -6.425958 0
6 AMC Javelin 0 150 304.0 3.435 0 -6.054432 0
As it is possible to see, we are predicting 5 points in any model, and only one questional point for Hornet 4 drive is playing around the border. This can happen because of the small training data set, or some unusual structural differences in the engine of this car.
ar.
Problem with this dataset
The detailed analysis of the dataset given reveals, that one of the cars in the training set is standing out of this set:
> cars_train
. . .
24 Mazda RX4 0 110 160.0 2.620 -0.1783710
25 Merc 450SE 0 180 275.8 4.070 -4.0102746
26 Porsche 914-2 0 91 120.3 2.140 0.9603391
This Porsche 914-2 equipped with opposite engine, which is not straight and not V-type. This fact can disturb our training system
Published: 2021-11-17 07:09:08
Updated: 2021-11-17 07:20:21