The R package classInt provides different methods to calculate univariate class intervals. These different styles are shown and shortly explained with example data from the UN.
The data used in this tutorial is the global sex ratio (gender relation) of the total population in December 2012. The original data is provided by the United Nations Statistics Division.
The sex ratio is the amount of women per 100 men for each country in the world.
To see this data in an interactive map please visit: http://climvis.de/worldwide-sex-ratio/
So first download and read the data to R and load the R package ‘classInt’. This example data contains missing values. So remove the missing values and set the values of the sex ratio to x:
# Library library(classInt) # Download and Read Data download.file('http://climvis.de/wp-content/uploads/2014/07/UN_Gender-Relations.csv', destfile='UN_Gender-Relations.csv', method='wget') dat=read.csv('UN_Gender-Relations.csv', head=T, sep=';', na.strings='NA') # Remove missing values dat=subset(dat,is.na(dat$Sex.ratio)==F) # x: Sex Ratio x=dat$Sex.ratio
Lets look at the data. The plot below shows the values of x, the Density and Histogram as well as the Empirical Cumulative Distribution Function (ecdf).
## Plot: x, Density and Histogram, Empirical Cumulative Distribution Function par(mfrow=c(1,3), cex=1) # x plot(x, t='p', pch=20, xlab='Index', main='x: Gender Ratio') abline(h=100) # Density and Histogram hist(x, prob=T, col='lightblue', main='Histogram and Density of x', ylim=range(density(x))) lines(density(x)) # Empirical Cumulative Distribution Function plot(ecdf(x))
classInt: Classification Methods
In the following examples the styles are calculated for 6 classes (n=6). In order to visualize the different classes they are plotted with the following colors:
# colors of classes pcol= c('darkblue', 'blue', 'lightblue', 'palegreen', 'lightpink', 'brown3')
The R package classInt provides the functions classIntervals to calculate the class intervals and the function findColours to assign colors from a given vector (pcol) and returns two attributes (table and palette) for constructing a legend. The package classInt also provides different styles for classification which are shortly explained and visualized:
Within the fixed style you can choose the intervals of the classes. So define n+1 breaks. The following breaks are used in this Choropleth Map. The x values which equal 100 (same amount of men and women) go into one class colored green.
In the next step the classes are calculated and visualized. To generate some extra space for the legend ylim is enlarged.
# style='fixed' nclass=classIntervals(x, n=6, style='fixed', fixedBreaks=c(30, 80, 90, 99, 100, 110, 120), intervalClosure='right') colcode = findColours(nclass,pcol) # plot par(mfrow=c(1,2)) plot(nclass, pal=pcol , main='fixed') plot(x, pch=20, col=colcode, xlab='Index', ylab='x', main='fixed', ylim=c(min(x), max(x)*4/3 )) legend('topleft', legend=c(names(attr(colcode, 'table'))), fill=c(attr(colcode, 'palette')), title='women/100 men', cex=0.8)
Style: equal, pretty, sd
equal: The range of the variable is divided into n parts.
pretty: Compute a sequence of about ‘n+1’ equally spaced ‘round’ values which cover the range of the values in ‘x’. The values are chosen so that they are 1, 2 or 5 times a power of 10.
sd: Like pretty but with the centred and scaled variables.
# style='equal' nclass=classIntervals(x, n=6, style='equal', intervalClosure='right') colcode = findColours(nclass,pcol)
With the style quantile the intervals are calculated so that each class contains more or less the same amount of values.
kmean: Cluster with low variance and similar size; wikipedia: K-means_clustering
hclust: Cluster with short distance; wikipedia: Hierarchical_clustering
Style: jenks, fisher
jenks: Jenks’ Natural Breaks Classification Method seeks to reduce the variance within classes and maximize the variance between classes.
fisher: Fisher’s Natural Breaks Classification is an improvement of jenks.