Interactive Data Visualization with R ggplot2

RVisLearner - Interactive Data Visualization with R graphics, ggplot2, Shiny

The interactive RVisLearner is a learning app that shows how to create data visualizations with the base R functions hist and plot from the graphics package, and with the corresponding functions of the Tidyverse ggplot2 package. Both the R graphics and the ggplot2 package can be used to plot data as histogram, scatter plot, box plot etc. and the resulting plots are similarly pleasing. They differ in how the arguments are supplied to the functions and in terms of the learning curve.

Start the RVisLearner

1 RVisLearner Usage

Top

The RVisLearner app has three tab panels, the first one for the data set, the second one for histogram plots, the third tab for scatter plots. Each tab is organized in a sidebar panel and a main panel. The sidebar panel contains general information, the main panel contains collapsible panels for parameter selection, plots and source code.

1-1 The Dataset

Top

The first tab panel features the dataset. For our visualizations, we use a synthetic data set demo_data.csv containing a sample of demographic data, with a total of 200 observations for six variables (Age, Height, Weight, BMI, BMI Category, Gender), 100 men and 100 female. Two of the variables (BMI Category and Gender) are categorical, the other numerical. The task is to visualize height, weight and body-mass-index (BMI) at different ages and compare the genders. Before starting the visualization part, it is useful to look at the summary statistics of the data set, which can be obtained using the R function summary. This function takes as input an object, for example a data frame, and returns summaries of the results of model fitting functions such as Min, Max, Median, Mean etc.

Example code

The following example code reads the data from a csv file and stores it in a data frame df, then extracts men and women data in two subsets df_m and df_f and creates the summary statistics for the subsets and the entire dataset.

df = read.csv(
  "demo_data.csv",
  header = TRUE,
  sep = ",",
  encoding = 'UTF-8'
)
df_m = subset(df, df$Gender == "M") # men
df_f = subset(df, df$Gender == "F") # women
summary(df_m) # summary statistics men
summary(df_f) # summary statistics women
summary(df) # summary statistics

1-2 Histograms

Top

A histogram is the graphic representation of the frequency distribution of a variable. The data of the variable are sorted according to their size and divided into a number of bins that can, but do not have to be, of the same width. A rectangle is drawn above each class, the area of which is proportional to the frequency of this class. A histogram can be a frequency plot, where the area represents the count of elements in this class, or a density plot, where the area represents the percentage of elements in this class.

The second tab panel explains the creation of histograms for the example data set. Histograms in R are created using the base R function hist from the graphics package and alternatively using the functions ggplot and geom_histogram from the ggplot2 package. In order to learn the differences between these two packages, select a variable, a histogram type and a package in the Parameter panel and observe the output in the Plot panel, as depicted in the screenshot below.

Example code: Histograms with R graphics vs ggplot2

Histograms created with R graphics

Function hist

This example code shows how to create two frequency histograms with hist for the column Height. The data for the histogram plot are the columns Height extracted from the data frames df_m and df_f, respectively, and are passed as first argument to the function hist, along with other arguments: the number of breaks, color and title of the plot, labels for the axes and a grid for the plot. The function par arranges the two histogram plots in a layout with 1 row and 2 columns.

df = read.csv("demo_data.csv")
df_m = subset(df, df$Gender == "M") # men
df_f = subset(df, df$Gender == "F") # women
x_m = df_m[, "Height"]
x_f = df_f[, "Height"]
par(mfrow = c(1, 2))
c = c("deepskyblue", "darkorange") # colors
h_m = hist(x_m, breaks = 15, col = c[1],
     xlab = 'Height', ylab = 'Frequency', main = 'Men',
     panel.first = grid())
h_f = hist(x_f, breaks = 15, col = c[2],
     xlab = 'Height', ylab = 'Frequency', main = 'Women',
     panel.first = grid())

Output

Histograms created with ggplot2

Functions ggplot and geom_histogram

This example code shows how to create two histograms with ggplot for the column Height. The ggplot graphs are saved as gg_m and gg_f and arranged in a grid. The properties of the plot (fill, color, labels) are set manually. In order to arrange the plots in a grid layout, the function grid.arrange from the package gridExtra is used.

library(ggplot2)
library(gridExtra)

df = read.csv("demo_data.csv")
df_m = subset(df, df$Gender == "M") # men
df_f = subset(df, df$Gender == "F") # women
c = c("deepskyblue", "darkorange") # colors
gg_m = ggplot(df_m, aes(x = Height)) +
  geom_histogram(bins = 15, fill = c[1], color = "#000000") +
  labs(x = "Height", y = "Frequency") +
  ggtitle("Men")     
gg_f = ggplot(df_f, aes(x = Height)) +
  geom_histogram(bins = 15, fill = c[2], color = "#000000") +
  labs(x = "Height", y = "Frequency") +
  ggtitle("Women")
grid.arrange(gg_m, gg_f, ncol = 2)

Output

When plotting only two variables of a data frame and specifying the attributes (labels, colors etc.) manually, the difference between graphics and ggplot2 is not that significant. The strength of ggplot2 is to create visualizations for dataframes with many variables, without having to extract and pre-process the variables, and then pass them as lenghty argument list to a plot function.

1-3 Scatter plots

Top

The third tab panel explains the creation of scatter plots for the example data set. Scatter plots show the raw data of your data set, one variable (y-axis) vs. another (x-axis) as data points, which results in a scattered appearance. Usually, a regression line is added, to determine if the variable shows a trend. When plotting multiple y-variables against the same x-axis, the plot must be visually enhanced with colors and symbols. If the data are grouped in clusters, it is safe to assume that they exhibit correlations, if they are randomly scattered all over the space, they will be independent and without correlation. Scatter plots in R are created using the R function plot from the graphics package and alternatively using the functions ggplot and geom_point from the ggplot2 package. With ggplot, trend lines are added using geom_smooth.

Tools & References

Top

This learning app assumes that you have basic knowledge of R syntax with RStudio as development environment. Here you find a selection of related links for looking deeper into these topics.

[1] R Shiny shiny.rstudio.com/.
[2] Tidyverse ggplot2 ggplot2.tidyverse.org.
[3] R Graphics package rdocumentation.org/packages/graphics.
[4] R-Tutorial "Fundamentals", this website evamariakiss.de/tutorial/r.
[5] R-Tutorial (elab2go, in German) elab2go.de/demo-r1/.