Data Visualization¶
Display of data in a graphical/tabular format
Helps us understand the data; humans are better at recognizing visual patterns than numeric patterns
Limitation: Pareidolia; seeing patterns that do not exist
Why is visualization important?¶
Widely different distributions can have the same statistical properties
Example Case | Visualization | |
---|---|---|
Anscombe’s Quartet | ||
Datasaurus Dozen | All have the same mean, std, and correlation |
Characteristics of Great Visualization¶
- Story
- Truthful: not aimed to mislead viewers, accurate representation
- Perception: Easy to understand
- Functional: useful and enlightening
- Insightful: reveals something that wouldn't have been possible otherwise
- Aesthetics: Grammar of Graphics
- Beautiful
Grammar of Graphics¶
- Theme
- Grid
- Background
- Typography
- Labels
- Title
- Subtitle
- Caption: Extra text, source of data, etc
- Axis Labels
- Legend
- Title
- Items: try your best to align with the actual elements
- Coordinates
- System
- Cartesian
- Polar
- Range
- Do not skew the axis: Use the correct minimum & max range
- Exceptions
- only possible values are included
- For eg: for human body temperature, you should show 98-105 F; you shouldn’t start at 0
- small movements matter
- GDP across time
- important time range
- only possible values are included
- Tip: When presenting, show the full view, then zoom in clarifying the reason
- System
- Facets: Subplots
- Scales: Axis
- Discrete
- Continuous
- Continuous, breaks
- Continuous, Log
- Almost never use dual y-axes
- Will lead to appearance of spurious correlations
- Exception: When the 2 axes measure the same thing. For eg
- Axis 1: Percentages; Axis 2: Absolute value
- Axis 1: Celsius; Axis 2: Fahrenheit
- Geometries: Type of Plot (check below)
- Aesthetics
- Position-X
- Position-Y
- Color
- Discrete
- Continuous
- Size
- Shape
- Opacity
- Animation (for Time)
- Data
Plots¶
Type | Chart | Purpose | Visual | Limitations |
---|---|---|---|---|
Uni-Variate | Pie Chart | Part of a whole | Not easy to compare between parts as it uses angles and areas | |
Bar Chart | Comparing values of groups | |||
Lollipop Chart | Comparing values of groups where only the end of the bar matters | |||
Column Chart | Comparing value across time | |||
Waffle Chart | Show values as squares | |||
Box/Box-Whiskers Plot | Helps understand the distribution of a variable | |||
1D Histogram | Visualizes the frequency distribution of attribute Relative uncertainty of each bin frequency \(\propto \dfrac{1}{\sqrt{\text{count}}}\) The convention of analyzing these bins: Values are left-inclusive and right-exclusive; Last bin is right-inclusive Continuous Data: binning | - Shape may change dramatically depending on bin settings - Bins with few counts have high statistical uncertainty - Interpretation can be difficult without huge amounts of data | ||
Density Plot | Smooth version of histogram - Bandwidth - Kernel | |||
Pyramid Histogram | ||||
Violin Plot | Smooth version of pyramid histogram | |||
Strip | Jitter | |||
Q-Q Plot | Quantile-Quantile plot comparing a distribution’s quantiles with quantiles of a known distribution (such as Normal distribution) | |||
Beeswarm | ||||
Ridge Plot | Multi-variate density plot offsetting the densities | |||
Raincloud Plot | Jitter plot + Box Plot + Violin Plot | |||
Treemap | Hard to interpret | |||
Conditional Quantitative | - Bin quantitative data - Make different plots This will be useful for error distribution inspection | |||
Bi-Variate | Scatter Plot | |||
Line Plot | Comparing trend of value across time | |||
2D Histogram | Helps understand frequency of co-occurance of 2 attributes | |||
Heatmap | Looking at overall patterns | |||
Mosaic Plot | Hard to interpret | |||
Stem & Leaf Plots | Understand the distribution of values of an attribute Useful when there aren’t many values Steps - Split values into groups, where each group contains those values that are the same except for the last digit - Each group becomes a stem, while the last digit of a group are the leaves - Stems will be the higher-order digits - Leaves will be the lower-order digits - Plot stems vertically and leaves horizontally | |||
Tri-Variate | Contour Plots | Used for spacial data | ||
Multi-Variate | Parallel Coordinates | |||
Pair Plot/ Scatter plot matrix | Basically a matrix of scatter plots | May get overwhelming for large number of variables | ||
Correlogram Heatmap | ||||
Correlogram Bar | ||||
Regression | Coefficients plot | |||
Marginal effects plot | ||||
Comparison | Sparklines | |||
Slopegraphs | ||||
Geospatial/ Map | Choropleth | May hide details, such as population | ||
IDK | ||||
IDK | ||||
2D Map | Requires 'projection' Best: Robinson Most-common: Mercator, but worst | Distortion of area | ||
Text | Word cloud | Looks cool, but not very useful Kinda like Pie Charts Just use bar chart instead |
Cheat Sheet¶
Uncertainty Visualization¶
Helps avoid misunderstanding of uncertainty
- Histogram
- Box plot