R package classInt for univariate class intervals

The R package classInt provides different methods to calculate univariate class intervals. These different styles are shown and shortly explained with example data from the UN.

Data

The data used in this tutorial is the global sex ratio (gender relation) of the total population in December 2012. The original data is provided by the United Nations Statistics Division.
The sex ratio is the amount of women per 100 men for each country in the world.
To see this data in an interactive map please visit: http://climvis.de/worldwide-sex-ratio/

So first download and read the data to R and load the R package ‘classInt’. This example data contains missing values. So remove the missing values and set the values of the sex ratio to x:

# Library
library(classInt)

# Download and Read Data
download.file('http://climvis.de/wp-content/uploads/2014/07/UN_Gender-Relations.csv', destfile='UN_Gender-Relations.csv', method='wget')
dat=read.csv('UN_Gender-Relations.csv', head=T, sep=';', na.strings='NA')

# Remove missing values
dat=subset(dat,is.na(dat$Sex.ratio)==F)

# x: Sex Ratio
x=dat$Sex.ratio

Lets look at the data. The plot below shows the values of x, the Density and Histogram as well as the Empirical Cumulative Distribution Function (ecdf).

GR-plot-hist

## Plot: x, Density and Histogram, Empirical Cumulative Distribution Function
par(mfrow=c(1,3), cex=1)

# x
plot(x, t='p', pch=20, xlab='Index', main='x: Gender Ratio')
abline(h=100)

# Density and Histogram
hist(x, prob=T, col='lightblue', main='Histogram and Density of x', ylim=range(density(x)[2]))
lines(density(x))

# Empirical Cumulative Distribution Function
plot(ecdf(x))

classInt: Classification Methods

In the following examples the styles are calculated for 6 classes (n=6). In order to visualize the different classes they are plotted with the following colors:

# colors of classes
pcol= c('darkblue', 'blue', 'lightblue', 'palegreen', 'lightpink', 'brown3')

The R package classInt provides the functions classIntervals to calculate the class intervals and the function findColours to assign colors from a given vector (pcol) and returns two attributes (table and palette) for constructing a legend. The package classInt also provides different styles for classification which are shortly explained and visualized:

Style: fixed

Within the fixed style you can choose the intervals of the classes. So define n+1 breaks. The following breaks are used in this Choropleth Map. The x values which equal 100 (same amount of men and women) go into one class colored green.

In the next step the classes are calculated and visualized. To generate some extra space for the legend ylim is enlarged.

# style='fixed'
nclass=classIntervals(x, n=6, style='fixed', fixedBreaks=c(30, 80, 90, 99, 100, 110, 120), intervalClosure='right')
colcode = findColours(nclass,pcol)

# plot
par(mfrow=c(1,2))
plot(nclass, pal=pcol , main='fixed')
plot(x, pch=20, col=colcode, xlab='Index', ylab='x', main='fixed', ylim=c(min(x), max(x)*4/3 ))
legend('topleft', legend=c(names(attr(colcode, 'table'))), fill=c(attr(colcode, 'palette')), title='women/100 men', cex=0.8)

fixed

 

Style: equal, pretty, sd

equal: The range of the variable is divided into n parts.

pretty: Compute a sequence of about ‘n+1’ equally spaced ‘round’ values which cover the range of the values in ‘x’. The values are chosen so that they are 1, 2 or 5 times a power of 10.

sd: Like pretty but with the centred and scaled variables.

# style='equal'
nclass=classIntervals(x, n=6, style='equal', intervalClosure='right')
colcode = findColours(nclass,pcol)

equal pretty sd

Style: quantile

With the style quantile the intervals are calculated so that each class contains more or less the same amount of values.

quantileStyle: kmean, hclust

kmean: Cluster with low variance and similar size; wikipedia: K-means_clustering

hclust: Cluster with short distance; wikipedia: Hierarchical_clustering

kmeanshclust

Style: jenks, fisher

jenks: Jenks’ Natural Breaks Classification Method seeks to reduce the variance within classes and maximize the variance between classes.

fisher: Fisher’s Natural Breaks Classification is an improvement of jenks.

jenks