Here is an alternative solution, although yours is a good one. If you only need ...

Here is an alternative solution, although yours is a good one.

If you only need a single SNP or a group of SNPs within a region, you can use tabix[1] to index gzipped TSVs and query by genomic position. The position of SNPs can be obtained from a lookup table (2.5M is not very big even for R) or from an API if you were to say - query by rsid.

tabix also works over http (and s3) and can utilize RANGE queries to select a subset of a file...so you only wind up downloading a or reading a small portion once it is indexed and can do something like this:

tabix https://www.file.url.tsv chr1:1-1000

The command above would return variants on chromosome 1 between 1 and 1000.

The following variant browser works in this way: https://elegansvariation.org/data/browser/ - Theres no formal database (e.g. MySQL) running here, just tabix (actually bcftools which uses tabix) to select variants in a particular region, wrap them in JSON, and return to the client.

Setting this up on S3 requires configuring CORS... the igv browser also uses tabix indexes and provides guidance on how to set this up [2]

[1] https://www.htslib.org/doc/tabix.html [2] https://github.com/igvteam/igv.js/wiki/Data-Server-Requireme...

I created a similar solution to what you have done using this alternative approach, writing a wrapper in R that invoked bcftools under the hood. The dataset I was working with was a lot smaller (1.6M Snps x 252 individuals), but should work with larger genotype sets as well.