PetaGene software to be used by AstraZeneca to compress huge genomics data sets
PetaGene’s PetaSuite software has been chosen to compress the huge data sets used at AstraZeneca’s Centre for Genomics Research (CGR).
The CGR investigates the underlying genetic causes of disease and aims to integrate genomics across AstraZeneca’s drug discovery platform.
The scale of data required for such work is enormous. AstraZeneca’s CGR has so far processed more than 200,000 genomics datasets, generating more than a petabyte of data - equivalent to the streaming HD movies for 40 years without a break.
PetaSuite is designed to accelerate data transfer for cloud computing and reduce storage costs for research projects involving genomics data.
Dr Vaughan Wittorff, co-founder and chief commercial officer of PetaGene, which is based in Cambridge’s Hauser Forum, said: “Using genomic data for biopharmaceutical targets discovery requires large cohorts with massive multi-petabyte data sets.
“The time required to transfer these data from sequencers to compute clusters as well as the cost of storage can cripple these large initiatives.
“PetaSuite addresses the challenges caused by growing volumes of genomics data and achieves up to 10x reductions in storage costs and transfer times, while adhering to the industry-standard BAM and FASTQ genomics file formats.”
PetaGene says its compression software will enable the CGR to compress more than 200,000 BAM files in a 24-hour period and will add the compressed data to tiered cloud storage.
Using the software, the CGR can reduce its data by an average of 76 per cent or, to look at it another way, achieve a fourfold expansion in storage capacity. The lossless compression of files reduces transfer times to less than a quarter, and the software enables unmodified analysis tools to run more quickly.
Slavé Petrovski, vice president and head of genome analytics and bioinformatics, discovery sciences, R&D, at AstraZenecam said: “AstraZeneca’s Centre for Genomics Research has the bold ambition to analyse up to two million genomes by 2026. Minimizing the storage footprint and transfer time of genome data while maximizing data access and compute processing is a necessity to enable us to achieve our ambition.”
PetaSuite is typically used as an intrinsic part of a client’s cloud or locally hosted analysis pipeline.
Data is compressed ready to use as it is processed, and moves to the next stage of analysis without it needing to be decompressed later.