Efficient Manipulation and Retrieval of Large-Scale Next Generation Sequencing (NGS) Data

Andy Kuhn
Andy Kuhn
University of Wisconsin

Kuhn A, He M.
Biomedical Informatics Research Center

Research area: Biomedical Informatics 

Background: A new generation of technologies allows sequencing and genotyping of personal genomes at a speed and accuracy unimaginable a few years ago. These technologies have already caused an explosion of data on human genome diversity and data on how interactions between the entire genome and non-genomic factors affect health. The most common data generated in next-generation sequencing (NGS) are stored in variant call format (VCF) files. One VCF file can contain hundreds of gigabytes of genetic information, which makes it difficult to manipulate and retrieve information. We developed a method to manipulate and retrieve multiple large-scale VCF files efficiently.

Methods: We developed a programming method to simulate a number of VCF files for evaluating the performance of manipulation of multiple VCF files. Then we incorporated a program called Tabix in our program to compress and index the multiple VCF files. We also developed a program with both user friendly interface and command line modes to load multiple VCF files, manipulate and retrieve genetic information and quality scores from them. To test the efficiency, we compared the performances of processing 500 and 1000 VCF files individually on a BIRC server.

Results: The running time of retrieving a variant and its quality scores within 1000 VCF files was 0.193 milliseconds (MS) while the running time of retrieving a variant and its quality scores within 500 VCF files was 0.083 MS.

Conclusions: The comparison shows the running time of manipulating a large number of VCF files is similar to the time used to manipulate small number of VCF files. It means the developed method can be used to manipulate thousands of VCF files, which are a real dataset of a NGS study, in a reasonable time.