The EDA Java Services - Danielle Callan

I’ve never admitted it to anyone, except technically to the entire internet here, but I wasn’t sure I could handle the EDA project when I first started it. Three years later and I think it’s one of the greatest things I’ve ever been a part of professionally. I now find that feeling of being challenged with something you’re not sure you can do and then doing it a bit intoxicating, if also a bit stressful.

Background

This project’s goal was to create an Exploratory Data Analysis Workspace and was full of technology that was new to me. It was my first time learning about raml. I had to learn about docker-compose and Docker Compose. That’s confusing just to write down let alone use. I also had to make more use of Java than I ever had before. Most of my experience with Java prior to this was through a handful of courses as an undergraduate. I had to be comfortable reading Kotlin, though I didn’t have to write it myself.

With time I learned these things at least well enough to be comfortable working in the EDA system. The combined effect though when I first looked at one of these services resulted in near complete overwhelm. Add to that the complexity of the overall system and how these various services interact with one another and I was sure I was in over my head. That feeling quickly evaporated though and was replaced with excitement and curiosity as I started to dig in. I’ve always loved learning, and it was very helpful here.

The EDA Infrastructure

The shortest and least complicated explanation I can give for the system follows. There is a subsetting service which operates per dataset and is responsible for filtering records. Records belong to what we call entities. An entity might be a person, their home, their community or a physical sample collected from them. Entities have known relationships one with another. There is also a merging service which merges subsetted data across entities. There is a compute service which launches asynchronous jobs for any long running analyses. It consumes subsetted data and its results are also handled by the merging service. The output of the merging service is always a single tabular dataset. This is in turn consumed by the only public facing service, the data service, to create JSON the client can consume, primarily to drive visualizations.

Truth be told it’s a bit more complicated than that, but that’s the general idea. The compute and data services have Java plugins which are responsible for particular computations and visualizations respectively. Each plugin has a dedicated API schema defined using raml. This is where I made the majority of my contributions, and continue to.

Development

To date, I’ve written plugins in the compute service for various analytical tools common to the microbiome community. These include Alpha and Beta diversity, for example. I’ve also written a dozen or so plugins in the data service to drive various visualizations. Most of those have been statistical charts so far, but there is one map visualization now as well. There’s a good chance we’ll be making some diagrams in the future too.

So far all of these plugins call out to Rserve for statistics. Rserve in turn relies on a family of in-house R packages that I have contributed heavily to and even created a number of. These include plot.data, which I’ve talked about before. Initially the plan was to write as much in Java as possible and only call to R for operations which truly required it. It became clear quickly that most plugins would require R at least in part and so the strategy changed. Writing the statistics in R became the priority, in their entirety, and rewriting in Java became seen as a performance optimization that should be done only as needed.

We mean to incorporate the concept of derived variables. The idea is that a user might want to take a variable which is native to a study and perform some operation on it to create a new variable. These aren’t long running like the computes, and need to be able to be subsetted on like any native variable. We have tentative plans to incorporate them into the merging service at least in the short term. When that happens, I’ll likely be writing those plugins as well. I imagine these will be the first to rely entirely on Java.

For those interested, source code can be found at the following locations:

The subsetting service
The merging service
The compute service
The data service

Background

The EDA Infrastructure

Development

1 Comment