MicrobiomeDB R package suite

This project made me feel like I was flying. It was pure simple fun from start to finish, except I’m not sure it will ever truly be finished. Better still, I think I might have genuinely built something potentially useful to people, which is a very nice feeling.

Background

The original request was for something much simpler than I produced, if I’m honest. The project PI for MicrobiomeDB wanted to be able to get data from various analyses out of the site in a format he could use to build various plots for publications and presentations. That proved more difficult than anyone liked. Eventually I suggested we make a simple mechanism to reproduce those analyses outside the site instead.

My rationale was that the analyses (alpha diversity, beta diversity, etc) on the site were being driven by an R package that we had developed, that someone could potentially use directly. I figured if we had an adequate mechanism to download the curated data from the site and get it into R, we could produce the same data locally. It worked and took maybe a week or so to put together something enough for a single person to mostly sort of use.

Then I got excited. Why not let others use it to? Why not add features that let people customize their analyses more than they might be able to on the site? Or make it easier to use with their own data? How much work could that possibly be?

Several months was the answer, and I regret nothing.

Development

I had put together a single R package so far, wrapping an existing package called microbiomeComputations that we used to drive our analyses on the website. The new package also had some helper functions for creating custom MbioDataset objects from our download files. It was enough for one person to muddle through recreating specific analyses, provided I was available to troubleshoot when something didn’t work.

That prototype that I had quickly put together in a week needed some serious refactoring. Firstly, the curated data from the site needed to be more easily available, and in a separate package. So another package got added to the growing suite. Now I had two user-facing packages. The first was MicrobiomeDB and housed the new MbioDataset class definition and constructors and wrapped the existing microbiomeComputations package. The second was microbiomeData, which extended MicrobiomeDB to make the curated data from the website available as MbioDataset objects.

This represented a major improvement, because now a person didn’t have to wait for a large data package to download and install if they weren’t interested in the curated data. Also, as a developer I could incrementally update the code and the data separately. However, an issue arose.

The data was being added to the package semi-manually by me. I had a helper function that I ran per dataset after downloading some files and there were 20 datasets. Then the PI found a systemic bug in the data that required they all be reloaded after my script was updated. I didn’t want to do that.

Instead, I decided my next step was to incorporate not just the final MbioDataset objects into the R package, but to use it to track the raw download files as well. Then I could add a script that would look through a controlled directory and once for every sub-directory it found call my helper function for making MbioDataset objects. I also added some unit tests for the data itself to validate that some minimum usability criteria were met.

For most of the analyses, once a user had a result in their local R, it was fairly straightforward to use ggplot2 to produce interesting visualizations that were similar to what could be done on the website. This wasn’t the case for the correlations computations however. This is largely because for these the site produces network diagrams rather than statistical charts. For those I introduced custom HTML widgets built in d3 and a helper function that would produce the right network automatically depending on the type of correlation result in hand.

The final step was to make sure people could figure out how to use these cool new tools. So I made pkgdown sites hosted as GitHub pages for each package in the microbiomeDB GitHub organizations. Then I started adding examples and vignettes as tutorials for each of the types of analyses provided on the website.

Result

In the end something unexpected had happened. I had a suite of five R packages in front of me:

veupathUtils– a general purpose package that had pre-dated this project, but acquired some new features for it. It was also an existing dependency of microbiomeComputations (see below).
corGraph– a new package housing two HTML widgets to build correlation networks with d3 in the same general style as the website. It also includes a shiny app allowing people to use those widgets standalone by providing data tables to find correlations for.
microbiomeComputations– another package pre-dating this project, which is where all the functions used to drive the analyses in both the site and the MicrobiomeDB package live. Those analyses include ranked taxonomic abundances, alpha diversity, beta diversity, differential abundance and correlations. This package saw some light refactoring for this project.
MicrobiomeDB– This is the primary user-facing package, wrapping the three listed before into one simple and consistent API that lets people recreate analyses from the website, and then some.
microbiomeData– This is an extension to the MicrobiomeDB package, housing all of the curated datasets from the website.

Future Directions

There is more I would like to do with this. I would like to add support for TreeSummarizedExperiment objects, and lean on miaverse to let people import their own data conveniently. I’d also like to add helper functions for generating some simple visualizations, possibly incorporated into Rmarkdown reports. For a complete list of directions I’d like to go and things I’d like to explore, see the issues associated with each repository.