Data Preparing for Cluster analysis in R

Before clustering itself, we need to check dataset first. We need to check what this is (help()), how it is organised (glimpse()), any missed data (anyNA()) and also, if it is possible – visually check it, how it looks like (plot).

Dataset inspection

We will use internal, build it data set with iris species measurement iris


> help(iris)
iris                 package:datasets                  R Documentation
Edgar Anderson's Iris Data
Description:
     This famous (Fisher's or Anderson's) iris data set gives the
     measurements in centimeters of the variables sepal length and
     width and petal length and width, respectively, for 50 flowers
     from each of 3 species of iris.  The species are _Iris setosa_,
     _versicolor_, and _virginica_.
. . .
> glimpse(iris)
Rows: 150
Columns: 5
$ Sepal.Length  5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4,...
$ Sepal.Width   3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7,...
$ Petal.Length  1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5,...
$ Petal.Width   0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2,...
$ Species       setosa, setosa, setosa, setosa, setosa, setosa, setosa...
> anyNA(iris)
[1] FALSE

As we can see, this is a full dataset.

Plotting 2D datasets

We will use plot() to display our data


> plot(iris$Petal.Length,
+      iris$Petal.Width,
+      pch=c(20,22,24)[unclass(iris$Species)],
+      bg=c("red","green","blue")[unclass(iris$Species)])

pch - give numbers if dot shapes and link them with variable Species, and bg - identify the colour of the selected dots.

**Iris dataset**
The dependence of Iris Petal.Length and Iris Petal.Width

Original image: 650 x 613

As we can see – data are grouped into 3 different clusters and now we need to calculate it statistically.

Preliminary analysis of the datasets

It is a good idea to manually check unlabelled dataset, to see is there are any groupping patterns is present


> unlabeled_iris <- iris %>% select(-Species)
> pairs(unlabeled_iris)

This script will create table unlabeled_iris with removed species. So it will be anonymous dataset. And we will plot all possible pairs there.

**Iris dataset pairs**
Anonymous pairs for iris dataset. Visual analysis should provide sme idea about grouping this data

Visual inspection of this pairs should give some ideas about possible clustering data for automatic learning.

Published: 2021-11-17 08:22:34
Updated: 2021-11-17 14:04:37