Computational data wizardry
My scientific work involves blurring the line between analytical and computational methods: I try to come up with good new mathematical paradigms that account for phenomena of interest, but I also need to crunch some numbers. This all sounds nice and neat in research statements, but how does this blending work in practice, on a daily level?
It is very common these days to implement a research computation or simulation and run it on your computer. When I was starting undergrad some eleven years ago, my computational lab instructor asked the students to come by with their USB sticks, one at a time, and gave out pirated Mathematica, Matlab, and Mathcad copies. Each required learning a different syntax and waiting till a new version was released (and, where applicable, pirated) to get new features. Today, with the rise of free, light-weight, easily extensible Python, computations are available to everyone. More and more mainstream physics undergrad programs include instruction in computational techniques, with the basic understanding of parameters like step size or pseudorandom seed. Introducing the language of simulations to the students is, undoubtedly, good.
It is helpful to think of a simulation as a mapping from the state space to the data space. A point in the state space consists of your input parameters - both the "physical" ones like gravity or coupling strength, and the "computational" ones like step size, simulation length, seed, or replica index. A point in the data space is a configuration, a trajectory recording, a snapshot, or any other detailed description of system state or its history. Sometimes the mapping between the two is nothing more than plugging definite numbers into an analytic, close-form formula. Other times it is an implementation of a deterministic algorithm, or even a stochastic one that becomes deterministic once the seed is fixed. It is convenient to encapsulate the simulation as a class or a callable function: you call it by providing a point in the state space, it burns some CPU (or GPU) time, and returns a point in the data space.
However, the stories we tell in scientific papers are never about just one point. It is always about how a change in input causes a change in output. To trace that change, we usually need to make a systematic sweep over parameters, or perform the stochastic simulation several times with different seeds to get statistics. Often it is not a single one-dimensional sweep, but several sweeps running in different directions, or a whole multi-dimensional grid of points. In other words, our story is not a point, but a tapestry across the two spaces. The encapsulated simulation code is still crucial to tell the story, but it is just a building block. Out of such blocks, we need to build a superstructure.
The typical simulations I run take anywhere from a few seconds to about an hour. Of these, I need between dozens to thousands to make my science arguments. Sometimes, for every state point I need to perform several operations: for instance, generate a stochastic trajectory and compute some observables on it. All this requires running my simulation many times and storing the output on the hard drive. The first few times I used a naive method: I created a list of state points (jobs to be computed) titled job_sizes.yml, where each one wrote output to a file with an exciting name like out_w10_h9_r2_mm_1.pkl. My superstructure script just checked if the output file existed yet, and if not, ran the computation and created the file. The digits in the file name refer to state point parameters: width, height, replica index, and multi marginalizing index (in my paper on entropic order). In order to run several series of computations, I made several folders with stacks of those files. After each file was produced, a post-processing script iterated through them and stitched the data into a single figure that you see in the final paper.
If the code structure above looks familiar, congratulations, you are a computational scientist! If it looks unwieldy, fragile, and depressing, then you also notice the limitations of the approach. What if I need even more parametric sweeps? What if I need to store more arguments of each state point, which are not only integers but floating point numbers or strings? What if I need to perform multiple operations with each data set in an automated pipeline? What if all these things need to happen on a remote computational cluster? There's gotta be a better way! Fortunately, I am not the only one with those thoughts.
I would like to introduce you to Signac, a Python data management package for scientific computing. Signac the Package is named after the French pointillist painter Paul Signac, who composed his paintings out of a myriad of individual little paint patches, much like we do with state spaces and data spaces today. Signac package doesn't run simulations for you, but takes care of data management. You create state points described by arbitrarily long dictionaries of parameters, define operations, and let it run. Within seconds, Signac finds and sorts state points by any subset of criteria requested. You trade in the "human readability" of your state points, if of course you think out_w10_h9_r2_mm_1.pkl was ever human readable. You get the freedom of organizing the data pieces within each state point, whether it takes up kilobytes or terabytes, on your local laptop or a remote supercomputing cluster.
Signac was developed by people with some serious computer science training but aimed at users without such training, like myself. The tutorial and the detailed reference are excellent and more than enough to get started. If you are interested in the philosophical origins of Signac and benchmark comparison to other data storage methods, they are in this paper. More broadly, Signac fits neatly into larger research software stacks that often originate from lazy refactoring as described here.
A data management system like Signac encourages dancing with the data, investigating more parts of the parameter space, being flexible about organizing schemas and adaptive about analysis pipeline, all without sacrificing rigor and reproducibility. It has become a standard part of my computational infrastructure in most new projects: as soon as I implement and debug a new simulation, I sketch out the operations flow, outline the first few parameter sweeps I want, and start building up the superstructure.