Tolkien of Dataviz

Apr 13

"J.R.R. Tolkien has become a sort of mountain, appearing in all subsequent fantasy in the way that Mt. Fuji appears so often in Japanese prints. Sometimes it’s big and up close. Sometimes it’s a shape on the horizon. Sometimes it’s not there at all, which means that the artist either has made a deliberate decision against the mountain, which is interesting in itself, or is in fact standing on Mt. Fuji." - Terry Pratchett

What Tolkien accomplished in the mid-20th century was to codify fantasy, to collect the loose legends of orcs, dwarves, and elves into a single vision, and to wrap them into a singular, epic, essentially non-sequelable story. The Lord of the Rings is written in a way that is not amenable to a direct continuation or expansion (though people tried). However, it is infinitely open to adaptation into movies and videogames or an even looser inspiration. Your story might involve Tolkienesque elves, anti-Tolkien elves, or elves in space, but it is hard to run away from the archetype.

I read The Lord of the Rings and related books multiple times in my teenage years. Peter Jackson's movie trilogy aged very well in its CGI and epic battles, but less so in gender and racial representation. Yet any new fantasy work is directly compared to those books and movies.

My recent first trip to Japan was defined by not visiting Mt. Fuji or even seeing it in the distance. Perhaps my next trip to Japan would be defined by Mt. Fuji in some other way.

Data visualization is defined by Edward Tufte in "The Visual Display of Quantitative Information". $17.50 in a used bookstore for a 2013 reprint of the 2nd edition.

Tufte poses his book as a self-exemplifying artifact printed on quality paper in a large format with manual typesetting and interlacing of text and graphics. Chapter 8 opens with a question of how many distinct locations can be located by an unassisted human eye in one square inch of space. Tufte draws 80 parallel vertical black lines crossed by 80 parallel horizontal lines. The lines are separated by the 79 whitespaces in each direction. The resulting picture has (80+79)^2=25,281 intersections, which are all distinguishable locations.

All data visualization amounts to placing ink in some pattern on a blank background (paper or screen) in order to communicate the data to the reader. Communication can be good or bad, but how do you judge? Tufte's most sticky idea is the data-ink ratio, where the numerator is the ink devoted to data points, and the denominator also includes all other parts such as axes, labels, legends, and occasional frivolous decorations. Plots that only show the fitted model instead of any data achieve a perfect value of zero. Plots that assume axes and labels to be known from context can approach a value of one.

Tufte associates higher data-ink ratio with more honesty and proposes methods to increase it. The first round of graphical optimization is a little trim of the plots you would observe in a typical newspaper article, research paper, or conference talk. While Matplotlib and other modern software default to putting a closed four-sided box around each plot, Tufte eagerly ditches the right and top boundaries. Where software places axes tick marks at "nice" numbers like 0, 10, 20, 30, Tufte replaces those the empirically smallest and largest values in the dataset.

His biggest crusade is against the boxplot, a common visualization of empirical data that aims to communicate summary statistics (median, quartiles, outliers). Tufte redesigns the boxplot to use as few lines as possible, in what is known as midgap plot. In theory, the midgap plot makes the data-ink ratio go up... in practice it causes more errors in reader comprehension, as shown by studies on undergrads. The modern, anti-Tuftean school of thought promotes more involved forms such as the raincloud plot, combining raw data points, summary statistics, and non-parametric distribution estimations.

Tufte's second most sticky idea is data density, or the number of data points per unit area. Since he is of a very high opinion of the human eye's ability to discern detail, he elevates density to a sparkline, a time series plot with no axes, inserted directly into the text, taking up the space of an average word and achieving the data-ink ratio of 1. A sparkline relies on the reader's strong understanding of the axes' meaning and scaling and allows to quickly communicate the overall trend: is the quantity rising, falling, fluctuating, or oscillating? Sparklines have been implemented into common spreadsheet software and are often used to communicate the dynamics of stock prices, weather time series, and also... Oh, that's it actually. There are few other time series where the axes' meaning and scale is just as immediately obvious and standardized.

Turning the qualitative guides to good graphics into a quantified target is good in moderation. The numerical target allows you to do gradient descent in images and explain why a figure is improved by removing redundant tick labels or reducing white space. The ink drop test is a useful intuition-building exercise to find which parts of your figure actually carry information. However, too many steps of gradient descent overfit to the bad proxy objective and lead to absolutist recommendations.

Tufte is also conspicuously upset about the usage of color in quantitative communications. The only thing he hates more is hatching. Fortunately, the theory and practice of colors in scientific communication has advanced a lot in the last two decades. Even Matplotlib switched the high-saturation pure R, G, B default colors to a much more muted blue-orange-green-red sequence, which I immediately flag in any presentation. Similarly, I regularly flag the jet/rainbow colormap in papers and talks from Very Fancy Labs, which are just about the worst choice in terms of visual distortion of data. Cynthia Brewer and Fabio Crameri fixed this for me and showed how to use color as a powerful dimension of data visualization.

Long before I read Tufte myself, my PhD advisor applied the ink drop test to my draft figures. We did streamline some subplot organization and removed some dots from panel labels ("a" instead of "a."). But in other examples ink went up as we shifted the message from potential energy landscapes (few smooth curves) to force fields (many spiky arrows). My figures communicate not the empirical values I happened to observe, but the physical meaning and bounds on the weird new quantities I have to introduce in every paper. Design stress anyone?

I put a lot of effort into my figures and take a lot of pride in the results. I am not winning figure awards and scoring journal covers. As a theorist, I aim to illustrate the variety of behaviors that a system can have. The main genre of my figures is, to borrow Tufte's term, small multiples or collections of plots in identical coordinates with some difference of parameters or regimes to facilitate comparison. My paper writing process starts with the skeleton of figures that tell the story: the problem setup, the method, the different effects/regimes, and the final flowchart or vision. The paper text then contextualizes and enriches the central points made in the figures.

However, even if your figures are not beautiful, they can be so useful that someone will want to steal them. My first published paper was on astrophysics, specifically the distribution of star masses. I derived a model that fit the empirical data reasonably well and my coauthor drew up Fig. 4 for the paper. Shortly after the paper was published in early 2016, the figure was blatantly stolen and used without citation in a problem at the International Astronomy Olympiad 2016 (not to be confused with the International Olympiad on Astronomy and Astrophysics because... reasons). The version of the figure presented to the high school student competitors relabeled the tick marks, removed a few of the curves, and got rid of the legend. But the version in the official solutions retained the exact spelling of the surviving legend items, so that the figure is unmistakably ours. A few years later, an unrelated professor contacted us and respectfully asked for permission to reproduce the same figure in an astrophysics textbook, which we gladly allowed. Good graphics have legs.

Orcs can copy or defy Tolkien. Japanese itineraries lead to Mt. Fuji or away from it. Scientific graphics use Tufte's forms and arguments or resist them. You can cherry pick the data-ink increment methods or you might have never heard of sparklines and midgap plots. But if someone in the last 40 years critiqued your figures, chances are their language descends from Tufte's book.

Andrei Klishin

Tolkien of Dataviz

Church of Optimization 2: The Beverage