RNASeq Transcription Summary

This is a project which made me really internalize the lesson that you need to use the right tools for the job. That’s something I’d been slowly learning over a couple years, but this represented a turning point. At the beginning of this effort I was writing R and by the end I was writing JavaScript. The final result is still a patchwork of the two, but one day I hope to remedy that.

Background

There was a request to be able to view all RNASeq data across all samples in all datasets available for a particular gene in a single visualization. There wasn’t much detail besides that. Everyone I talked to had a different idea as to what that actually meant, what the visualization should be or how to implement it. The only thing there was consensus on was that I was writing it. In hindsight I guess I should have seen that as a vote of confidence. At the time it was frustrating.

I probably heard every statistical chart you can think of mentioned. Someone said a heatmap, but couldn’t clearly explain what its axes would be. I remember another said scatter plot, where each point represented a sample by sample comparison of expression values and was colored  according to the dataset they came from. There were other ideas, but the scatter plot was the most coherent one I heard so I decided to start there.

It was too busy, and that problem would only get worse as we loaded more datasets. I was told it was hard to get the information you needed from it, but no one could say what information they needed. Finally someone showed me a hand drawing of a dot plot. The one axis would be categorical and have an entry for each data set. The other would represent expression and would still have one point per sample. Something in my head clicked, and I realized I knew what they needed.

Development

I started making a box plot which included jittered points for all samples. That way within an experiment you could still compare raw values from one sample to another, like the example I was shown. Across experiments however you could compare things like a five number summary and minimum and maximum values. It wasn’t perfect, because comparing raw values across experiments where there may be bias and differences in experimental processes never is. It was the best they’d seen yet though.

Once people saw the box plot a lot of requests for specific improvements came along. That seemed promising, but also a bit daunting. The original plot I had been able to put together fairly quickly using the plotly R package. The features being requested were not easy to implement that way. They included a lot of custom annotations and buttons providing different types of interaction. I quickly found myself writing bits of JavaScript on top of the R.

Result

Eventually I got something looking like this:

rnaseq transcription summary example visualization with data sets listed along the yaxis and expression values along the xaxis.
interactive version here

The implementation was a nightmare and the final result was never quite what I wanted it to be. Both in terms of the underlying code and the visualization I delivered I knew I could do better with a different strategy.

Take-Away

I decided then that I needed more serious tools for developing visualizations. With some time the EDA project gave me that. To date I’ve been focused on the statistical layer and the backend. I hope to change that if I have the time. Specifically, I hope to learn visx and make plot components. I hope to get away from tools like plotly entirely, but I can’t help but feel a fondness for plotly too. It taught me a lot, albeit the hard way.