The best solution of data visualization: ggplot2 tutorial and example

Time:2020-6-28

preface

ggplotIt is a drawing system with complete grammar and easy to usePythonandRCan be introduced and used in the field of data analysis visualization has a very wide range of applications. FromRHow to useggplot2First of all, give me some reasons that I think are most worthy of recommendation:

  • Using the design method of “layer” overlay, on the one hand, it can increase the connection between different graphs, on the other hand, it is also conducive to learning and understanding thepackagephotoshopOld players should be able to understand the great convenience
  • It has a wide range of applications, detailed documents, and?And corresponding functions can be found inRFunction description document and corresponding instance found in
  • stayRandPythonIt can be used in both languages to reduce the learning cost of the transition between the two languages

Basic concepts

This paper adoptsggplot2Data set ofdiamonds

> head(diamonds)
# A tibble: 6 x 10
  carat cut       color clarity depth table price     x     y     z
  <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

#Variable meaning
price  : price in US dollars (\6–\,823)
carat  : weight of the diamond (0.2–5.01)
cut    :   quality of the cut (Fair, Good, Very Good, Premium, Ideal)
color  : diamond colour, from D (best) to J (worst)
clarity: a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
x      : length in mm (0–10.74)
y      : width in mm (0–58.9)
z      : depth in mm (0–31.8)
depth  : total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43–79)
table  : width of top of diamond relative to widest point (43–95)

Based on the concept of layer and canvas,ggplot2The following grammatical framework is extended:

Source: https://mp.weixin.qq.com/s/us…

The best solution of data visualization: ggplot2 tutorial and example

  • data: data source, generallydata.frameStructure, otherwise it will be converted to this structure
  • Individual mapping and common mapping:ggplot()Inmapping = aes()Parameters belong to common mapping and will begeom_xxx()andstat_xxx()Inherited, andgeom_xxx()andstat_xxx()The mapping parameter in is a personality mapping and only works internally
  • mapping: mapping, including color type mappingcolor;fill, shape type mappinglinetype;size;shapeAnd location type mappingx,yetc.
  • geom_xxx: geometric objects, including point graph, line graph, column graph and histogram, etc., also including auxiliary drawing curve, oblique line, horizontal line, vertical line and text, etc
  • aesthetic attributes: drawing parameters, includingcolour;size;hapeetc.
  • facetting: faceting, dividing a dataset into multiple subsetssubsetAnd then plot the same chart for each subset
  • theme: Specifies the subject of the chart
Ggplot (data = Nall, mapping = AES (x =, y =)) + ාdata set
    geom_ xxx()|stat_ Xxx() + ා geometric layer / statistical transformation
    coord_ Xxx() + ාcoordinate transformation, default Cartesian coordinate system     
    scale_ Xxx() + ාscale adjustment, adjust specific scale  
    facet_ Xxx() + ාfacet, transform one of the variables into facet  
    Guides() + (legend adjustment)
    Theme() (theme system)

These concepts can be looked back after reading the full text, which is equivalent to a summary. These concepts have mastered the basicggplot2The core logic of

The meaning of some core concepts can be derived fromRStudioOfficialcheat sheetIt is generally known in the figure:
The best solution of data visualization: ggplot2 tutorial and example

The best solution of data visualization: ggplot2 tutorial and example

Some chestnuts

Through examples andRCodeIntroduction from shallow to deepggplot2The syntax of.

1. Scatter diagram of five internal organs

library(ggplot2)

#Indicates that we use the diamonds dataset, 
ggplot(diamonds) + 
  #Draw a scatter diagram: the abscissa x is depth, the ordinate y is price, the color of the points is distinguished by the color column, alpha transparency, size point size, shape shape (solid square), and the width of the stroke point border
  geom_point(aes(x = carat, y = price, colour = color), alpha=0.7, size=1.0, shape=15, stroke=1) +
  #Add fit line
  geom_smooth(aes(x = carat, y = price), method = 'glm') +
  #Add horizontal line
  geom_hline(yintercept = 0, size = 1, linetype = "dotted", color = "black") +
  #Add vertical line
  geom_vline(xintercept = 3, size = 1, linetype = "dotted", color = "black") +
  #Add axis and image title
  labs(title = "Diamonds Point Plot", x = "Carat", y = "Price") +
  #Adjust the display range of the axis
  coord_cartesian(xlim = c(0, 3), ylim = c(0, 20000)) +
  #Change the theme. This theme is simple. You can also get other themes in ggthemes package
  theme_linedraw()

The best solution of data visualization: ggplot2 tutorial and example

2. Custom picture layout & multiple geometric drawings

library(gridExtra)
#Build data set
df <- data.frame(
  x = c(3, 1, 5),
  y = c(2, 4, 6),
  label = c("a","b","c")
)  

p <- ggplot(df, aes(x, y, label = label)) +
  #Remove abscissa information
  labs(x = NULL, y = NULL) +
  #Switch theme
  theme_linedraw()

p1 <- p + geom_point() + ggtitle("point")
p2 <- p + geom_text() + ggtitle("text")
p3 <- p + geom_bar(stat = "identity") + ggtitle("bar")
p4 <- p + geom_tile() + ggtitle("raster")
p5 <- p + geom_line() + ggtitle("line")
p6 <- p + geom_area() + ggtitle("area")
p7 <- p + geom_path() + ggtitle("path")
p8 <- p + geom_polygon() + ggtitle("polygon")

#Construct ggplot picture list
plots <- list(p1, p2, p3, p4, p5, p6, p7, p8)
#Custom picture layout
gridExtra::grid.arrange(grobs = plots, ncol = 4)

The best solution of data visualization: ggplot2 tutorial and example

3. Box line drawing

In statistics, an intuitive graph showing the dispersion of data is often used to show the dispersion of dependent variables under a certain factor variable in exploratory analysis.

Here are some of the longest used methods of box line drawing:

Library (ggplot2) - drawing
Library (ggsci) ා use color matching

#Using the diamonds data box, the classification variable is cut, and the target variable is depth
p <- ggplot(diamonds, aes(x = cut, y = carat)) +
  theme_linedraw()

#When a factor type variable is used, the color is directly used to distinguish different categories. Later, the legend is set in the upper right corner
p1 <- p + geom_boxplot(aes(fill = cut)) + theme(legend.position = "None")
#When there are two factor variables, you can set one of them to X and the other to distinguish by legend color
p2 <- p + geom_boxplot(aes(fill = color)) + theme(legend.position = "None")
#Transpose the box diagram
p3 <- p + geom_boxplot(aes(fill = cut)) + coord_flip() + theme(legend.position = "None")
#Use out of the box color schemes: including scale_ fill_ jama(), scale_ fill_ nejm(), scale_ fill_ lancet(), scale_ fill_ Brewer() (Blue Series)
p4 <- p + geom_boxplot(aes(fill = cut)) + scale_fill_brewer() + theme(legend.position = "None")

#Construct ggplot picture list
plots <- list(p1, p2, p3, p4)
#Custom picture layout
gridExtra::grid.arrange(grobs = plots, ncol = 2)

The best solution of data visualization: ggplot2 tutorial and example

When the box line graph of a continuous variable involves several discrete variables, we often use facetsfacettingTo improve the visibility of the chart.

library(ggplot2)

ggplot(diamonds, aes(x = color, y = carat)) +
  #Switch theme
  theme_linedraw() +
  #The color of the box line is filled according to the factor variable color
  geom_boxplot(aes(fill = color)) +
  #Faceting: essentially, the data frame is divided into multiple subsets according to the factor variable color class, and the same boxplot is drawn on each subset
  #Note that scale = "free" should be added in general, otherwise the data scale of the sub dataset will be pulled apart when there is a large difference
  facet_wrap(~cut, scales="free")

The best solution of data visualization: ggplot2 tutorial and example

4. Histogram

library(ggplo2)

#Normal histogram
p1 <- ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = cut)) + 
  theme_linedraw() +
  scale_fill_brewer()

#Stacked histogram
p2 <- ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity), position = "identity") + 
  theme_linedraw() +
  scale_fill_brewer()
  
#Cumulative histogram
p3 <- ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity), position = "fill") + 
  theme_linedraw() +
  scale_fill_brewer()

#Classification histogram
p4 <- ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge") + 
  theme_linedraw() +
  scale_fill_brewer()

#Construct ggplot picture list
plots <- list(p1, p2, p3, p4)
#Custom picture layout
gridExtra::grid.arrange(grobs = plots, ncol = 2)

The best solution of data visualization: ggplot2 tutorial and example

5. Coordinate system

Except for those used in the front box line drawingcoord_flip()The method realizes the coordinate axis rotation,ggplotIt also provides many functions related to coordinate system.

library(ggplot2)

bar <- ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = cut), show.legend = FALSE, width = 1) + 
  #Specified ratio: the ratio of length to width is 1, which is convenient to display the figure
  theme(aspect.ratio = 1) +
  scale_fill_brewer() +
  labs(x = NULL, y = NULL)

#Axis rotation
bar1 <- bar + coord_flip()
#Draw polar coordinates
bar2 <- bar + coord_polar()

#Construct ggplot picture list
plots <- list(bar1, bar2)
#Custom picture layout
gridExtra::grid.arrange(grobs = plots, ncol = 2)

The best solution of data visualization: ggplot2 tutorial and example

6. Tile diagram and thermal diagram

Exploratory analysis in machine learningcorrplotDirectly draw the correlation coefficient diagram of all variables to judge the overall correlation coefficient.

library(corrplot)
#Calculate correlation coefficient matrix of dataset and visualize it
mycor = cor(mtcars)
corrplot(mycor, tl.col = "black")

The best solution of data visualization: ggplot2 tutorial and example

ggplotMore personalized tile drawing is provided:

library(RColorBrewer)
#Generate correlation coefficient matrix
corr <- round(cor(mtcars), 2)
df <- reshape2::melt(corr)
p1 <- ggplot(df, aes(x = Var1, y = Var2, fill = value, label = value)) +
  geom_tile() +
  theme_bw() +
  geom_text(aes(label = value, size = 0.3), color = "white") +
  labs(title = "mtcars - Correlation plot") +
  theme(text = element_text(size = 10), legend.position = "none", aspect.ratio = 1)
p2 <- p1 + scale_fill_distiller(palette = "Reds")
p3 <- p1 + scale_fill_gradient2()
gridExtra::grid.arrange(p1, p2, p3, ncol=3)

The best solution of data visualization: ggplot2 tutorial and example

More examples

There are 50 classic onesggplot2Drawing example:

http://r-statistics.co/Top50-…

The best solution of data visualization: ggplot2 tutorial and example

Other articles

1. Machine learning must know must know and algorithm principle

Introduction to machine learning: what is machine learning
Machine learning must know must know: convex optimization
Machine learning algorithm: xgboost
Machine learning must know must know: gradient descent method

2. Data analysis and reptile cases

Python data analysis: who is the “first” domestic film in 2018
How to use Python crawler to realize simple PV brush amount — Taking CSDN as an example
Python script builds its own free agent IP pool from zero to one

3. Relevant experience

Autumn recruitment interview: what efforts should be made to get Tencent data post offer from zero base
How to use data thinking to win 90% of investors in the stock market
How hard is actuary certificate to be tested and how to prepare?

Reference

[1] https://ggplot2-book.org/intr…
[2] https://rstudio.com/resources…
[3] https://r4ds.had.co.nz/data-v…
[4] https://www.sohu.com/a/320024…