Back to All Events

Leveraging ADAM and Spark for Genomic Analysis

Justin Paschall from the ADAM project will be discussing ADAM, and how it can be leveraged for genomic analysis.

The open source ADAM and Big Data Genomics projects ( https://github.com/bigdatagenomics ), founded in the UC Berkeley AMPLab — which also brought the community Apache Spark — has re-envisioned the way genomic (and scientific analysis in general) uses clusters of computers on-premise and in the cloud to solve Big Data challenges in science and biomedicine.

Spark provides computational advantages of:

• In-memory and data locality based cluster computing

• Efficiency in development time through the use of higher level operations to manipulate distributed collections

• End-user advantages of interactive data exploration and rock solid reproducibility, critical to the scientific process.

Presentation Overview

In this talk, Justin will present and discuss:

• The motivations behind the application of Apache Spark to the genomics domain

• Details of the implementation of Spark based genomics data formats and APIs (for both users and developers)

• Strategies to integrate Spark/BDG based tools into larger bioinformatics pipelines which integrate workflow management and dockerized components.

• Steps for getting involved in the open source BDG developer and user community. Reference: https://www2.eecs.berkeley.edu/Pubs/TechRpts/2013/EECS-2013-207.pdf

About Justin Paschall:

Justin Paschall, M.S., a scientific software engineer from the AMPLab at UC Berkeley, works at the nexus of Big Data and next-generation sequencing driven genomics. His previous experience includes five years at the NIH's National Center for Biotechnology Information (NCBI), as well as three years at the European Bioinformatics Institute (EBI) in the UK. Justin is driven by a desire to enable science and medical research through efficient and accessible solutions to Big Data challenges in the life sciences.