Logistic regression analysis with R

Logistic regression is an analysis which allow to predict discrete value on the basis os statistical analysis of the given distribution. Now we will make simple analysis on R with build in dataset mtcars

I will not give much details about cleaning and studying this data set. Some useful information about preliminary data analysis you can find in the beginning of linear regression example


> mtcars
                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4

Creating model for Logistic regression analysis

mtcars contains column vs which represents engine type 0 – V shaped and `- straight. We will try to understand, can we work out engine type in dependants of car weight(wt), engine power (hp) and engine displacement (disp) if any.

This dataset have row names representing the car model. So we will convert name of the rows into another column and also remove data out of our interest.

Because our result can have only two values 0 and 1 we need to apply logistic regression analysis

Before we start, we need ot load libraries dplyr and tibble


> library(dplyr)
> library(tibble)
> cars <- rownames_to_column(mtcars, var="car") %>% select(car, vs, hp, disp, wt)
> cars
                   car vs  hp  disp    wt
1            Mazda RX4  0 110 160.0 2.620
2        Mazda RX4 Wag  0 110 160.0 2.875
3           Datsun 710  1  93 108.0 2.320
4       Hornet 4 Drive  1 110 258.0 3.215
. . .

As for now we have all data selected and named. In total it is 32 lines in dataset (not much) and we will create our model (train it) on the set of 26 cars randomly selected. And the rest data (6 observations) will be used for test set of our model


> cars_train <- cars %>% sample_n(26)
> cars_test  <- cars %>% setdiff(cars_train)
> cars_model <- glm(vs~wt + hp + disp, data=cars_train, family=binomial)

Please note, that every time I’ve run this script, the random selection will be different. To avoid it for test purposes only, you can fix the start point for random selection by using command set.seed(any number)

We also use generalized linear model function glm() with binomial logistic regression.

Prediction on the basis of our model

So now, when our model is created and trained, we can calculate and predict on its basis. How to do it. Function predict() will do this prediction, but the result will not be discrete, but kind of function and we need to understand, how to split the results into two groups.


> cars_test["calc"] <- predict(cars_model, cars_test)
> mean(cars_train$calc)
[1] -3.186593

We can use 0 as a separating point for this prediction, or, of the two types of data are equally distributed, we can use mean value (-3.186593), or, I thin, it is better to clusterings these 2 groups into two sets


> km <- kmeans(cars_test$calc, 2, algorithm = "Lloyd")
> km
K-means clustering with 2 clusters of sizes 2, 4
Cluster means:
       [,1]
1  2.747616
2 -6.249240

And now we are checking our test set for predictions


> cars_test["calc"] <- predict(cars_model, cars_test)
> cars_test["predict"] <- ifelse(predict(cars_model, cars_test)>0,1,0)
> cars_test["predict"] <- ifelse(predict(cars_model, cars_test)>0,1,0)
> cars_test["mean_predict"] <- 
       ifelse(predict(cars_model, cars_test)>mean(cars_train$calc),1,0)
> cars_test
                  car vs  hp  disp    wt predict      calc mean_predict
1          Datsun 710  1  93 108.0 2.320       1  2.043189            1
2      Hornet 4 Drive  1 110 258.0 3.215       0 -2.779053            1
3 Lincoln Continental  0 215 460.0 5.424       0 -9.737519            0
4      Toyota Corolla  1  65  71.1 1.835       1  3.452043            1
5    Dodge Challenger  0 150 318.0 3.520       0 -6.425958            0
6         AMC Javelin  0 150 304.0 3.435       0 -6.054432            0

As it is possible to see, we are predicting 5 points in any model, and only one questional point for Hornet 4 drive is playing around the border. This can happen because of the small training data set, or some unusual structural differences in the engine of this car.

ar.

Problem with this dataset

The detailed analysis of the dataset given reveals, that one of the cars in the training set is standing out of this set:


> cars_train
. . .
24          Mazda RX4  0 110 160.0 2.620  -0.1783710
25         Merc 450SE  0 180 275.8 4.070  -4.0102746
26      Porsche 914-2  0  91 120.3 2.140   0.9603391

This Porsche 914-2 equipped with opposite engine, which is not straight and not V-type. This fact can disturb our training system

Published: 2021-11-17 07:09:08
Updated: 2021-11-17 07:20:21