A Google search engine for DNA

Google for DNA: New Search Engine Revolutionizes Genetic Data Mining

Computer scientists at ETH Zurich have developed "MetaGraph," a powerful digital tool that acts like a search engine for DNA. This innovation allows researchers to scan through millions of published DNA sequences in seconds, dramatically accelerating the pace of biomedical research.

The Problem: A Data Deluge in Genetics

The revolution in DNA sequencing has generated an avalanche of public genetic data. Central repositories like the Sequence Read Archive (SRA) and the European Nucleotide Archive (ENA) now hold a colossal 100 petabytes of raw sequence data—roughly equivalent to the entire text content of the internet.

Until now, searching this mountain of data was a massive bottleneck. Scientists needed immense computing power and had to download entire datasets—a process that was slow, expensive, and inefficient, often making comprehensive searches practically impossible.

The Solution: A Full-Text Search for Genetic Code

MetaGraph solves this by indexing the raw DNA data, much like Google indexes the web. Instead of downloading files, researchers can simply paste a DNA sequence they are interested in into a search mask. Within seconds, the tool reveals every public dataset where that sequence has appeared.

"It's a kind of Google for DNA," summarizes Professor Gunnar Rätsch, the data scientist at ETH Zurich who led the development.

This method is not only fast but also cost-effective. The study notes that compressing all public biological sequences onto a few hard drives makes large queries remarkably cheap, at approximately $0.74 per megabase.

How It Works: The Power of Compression and Indexing

The core innovation of MetaGraph is its ability to intelligently compress and structure data.

Intelligent Indexing: The tool uses complex mathematical graphs to index both the raw genetic sequences and their associated metadata (descriptive information about the sample).
Massive Compression: This process compresses the data by a factor of about 300. Imagine a book summary: while it doesn't contain every single word, it retains all the crucial plotlines and connections. Similarly, MetaGraph's index is compact but retains all the necessary biological information.
Scalability: Unlike other approaches, MetaGraph becomes more efficient as the amount of data grows, requiring less additional computing power for larger queries.

Broad Applications: From Pandemics to Balcony Plants

As an open-source tool, MetaGraph's potential applications are vast:

Antibiotic Resistance: Quickly identifying resistance genes in bacterial DNA or finding bacteriophages (viruses that kill bacteria).
Pandemic Response: Accelerating the analysis of new pathogens, as was crucial during the COVID-19 pandemic.
Cancer Research: Detecting specific mutations in tumor cells.
Rare Diseases: Helping diagnose patients with uncommon hereditary conditions.

The tool is already live and available for queries, with about half of the world's public sequence data currently indexed. The team aims to have the entire repository searchable by the end of 2025.

Looking even further ahead, co-developer Dr. Andre Kahles speculates, "If the rapid development in DNA sequencing continues, it may become commonplace to identify your balcony plants more precisely," suggesting a future where genetic search is as ubiquitous as a web search is today.

References

Reference: Karasikov, M., Mustafa, H., et al. Efficient and accurate search in petabase-scale sequence repositories. Nature (2025). doi: 10.1038/s41586-025-09603-w