plot.data - Danielle Callan

This is a project near and dear to my heart. There are days I can’t decide if it’s a masterpiece or a hot mess. It’s probably a little of both, and I suspect no matter how much effort I put into it I’ll never be quite satisfied. That’s the way it is with works of art. The plot.data project represents significant growth for me both as a developer and a data visualization practitioner. It’s the first R package I wrote and is the statistical layer of the data visualization system used by VEuPathDB.

Development

The plot.data package began its life functionally programmed. It was fairly simple in design with one function per known required plot type. Sometimes in hindsight I think this was naive. In truth though, it wouldn’t have made sense to factor it any other way until we grew the broader system more and gained some better perspective.

These functions had a simple goal, take a data.frame of raw data and provide plot-ready aggregates as JSON. For example, the `boxplot` function would take raw numeric values in a data.frame and find a five number summary and outliers. It’d then bundle that result into a nice bit of JSON and write it to a file. The only thing the function would formally return is the file handle.

That proved harder than I’d like to write meaningful unit tests for. It also was creating a lot of unnecessary overhead in the form of boilerplate code responsible for basic validation and for guaranteeing the response meet the established API. Soon each plot type had two functions, one providing a data.frame of the results and the other the file handle as previously described. The function returning a file handle of JSON formatted data would in this new paradigm simply call its counterpart which returned a data.frame and format the results.

That helped with unit tests some, but came at the cost of having some very odd arguments to the functions that returned only data.frames. I decided this problem and the one of validation were really the same. We needed classes. I had had some previous exposure to the S3 class system and decided to start there. Thus plot.data v1.0.0 was brought to life. It is made available to clients needing it through Rserve and the EDA Data Service.

Future Directions

It has served reasonably well for a couple of years. I suspect the next step will be to refactor the classes. As they are organized now the classes represent plot types. There is a `boxplot` class for example, with methods to get either a data.frame or filehandle. The next iteration of plot.data I hope to be more modular and flexible. We’re starting to see plot classes with similar statistics available to them and I’d like to be better able to handle that long term.

Knowing what I now know, I have a guess what the new classes might be. I think we might see `Visualization` as a class that houses multiple `Plot` objects. The `Plot` class could see the usual suspects as child classes. It would also know about one or more `Panel` objects, to support things like what ggplot2 calls facets. Then any `Panel` could have multiple `Group` objects. These would be for stratifying within a `Panel`, like by color. Finally, there would be a `Bin` object for when any `Group` consists of multiple bins along the X-axis, as in the histogram. I’d also strongly consider adding something like `PlotJSON` in order to make validation against the defined API easier.

I also suspect that plot.data will be moving from the S3 class system to S4. As the effort grows it is attracting more hands. The S3 system relies heavily on convention while S4 is more formal. It was good to be able to implement classes quickly and it was not a lot of trouble to maintain convention with only one or two developers. Sooner rather than later though I think that will become too great a headache to continue.

Result

As far as the statistics themselves go, I genuinely believe plot.data does at least as good a job as any other package I’ve seen and more quickly. It often even does a better job. Our binning of histograms for example uses one of three different algorithms depending on the statistical characteristics of the data. You can find some thoughts I put together previously on this topic here.

For anyone interested, the source code can be found here.

Development

Future Directions

Result

1 Comment