R Programming Language, An Overview
R is a type of open source programming language which is mainly used for statistical computations and visual representation of data. Developed by Ross Ihaka and Robert Gentleman in 1993, it includes a great number of resources for statistical inference and graphical visualization of data.
Primary Uses
In general, the R programming language is used for three main purposes
1. Statistical Inference
It is the process of using the results of data analysis to be able to identify various underlying properties of the given data.
2. Data Analysis
Data analysis is the process of collecting, studying, cleaning, transforming, and modeling the data about a given population so that useful information could be discovered.
3. Machine Learning
In this context, machine learning is a method of training the ai with data so that it can create various statistical models with itself through identifying patterns and taking decisions with the least human interventions
Machine Learning using R
K Means
K means is the process of dividing the available data into different groups or clusters by using random points in the chart(preferably the most far apart points) and calculating the distance of each and every point in data with these points and combining the points with the least distances to the particular points. Now, in the newly formed clusters, a centroid is calculated which is then the governing point to create new clusters based off of the distances of these centroids, and this process is continued for either a set number of rounds (which will be provided) or till the point where the clusters don’t change.
Example using R
For this example, we will be using the USArrests data set of the default data sets of R
> data(“USArrests”)
> df <- scale(USArrests)
> set.seed(1)
> kmeans_result <- kmeans(df, 4, nstart = 25)
> print(kmeans_result)
The aforementioned lines of code give us the following output
K-means clustering with 4 clusters of sizes 13, 13, 16, 8
Cluster means:
Murder Assault UrbanPop Rape
1 -0.9615407 -1.1066010 -0.9301069 -0.96676331
2 0.6950701 1.0394414 0.7226370 1.27693964
3 -0.4894375 -0.3826001 0.5758298 -0.26165379
4 1.4118898 0.8743346 -0.8145211 0.01927104
Clustering vector:
Alabama Alaska Arizona Arkansas California
4 2 2 4 2
Colorado Connecticut Delaware Florida Georgia
2 3 3 2 4
Hawaii Idaho Illinois Indiana Iowa
3 1 2 3 1
Kansas Kentucky Louisiana Maine Maryland
3 1 4 1 2
Massachusetts Michigan Minnesota Mississippi Missouri
3 2 1 4 2
Montana Nebraska Nevada New Hampshire New Jersey
1 1 2 1 3
New Mexico New York North Carolina North Dakota Ohio
2 2 4 1 3
Oklahoma Oregon Pennsylvania Rhode Island South Carolina
3 3 3 3 4
South Dakota Tennessee Texas Utah Vermont
1 4 2 3 1
Virginia Washington West Virginia Wisconsin Wyoming
3 3 1 1 3
Within cluster sum of squares by cluster:
[1] 11.952463 19.922437 16.212213 8.316061
(between_SS / total_SS = 71.2 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"