Results 1 to 9 of 9

Thread: file formats in bioinformatics

  1. #1
    Member
    Join Date
    Feb 2014
    Location
    Canada
    Posts
    17
    Thanks
    23
    Thanked 0 Times in 0 Posts

    file formats in bioinformatics

    Quote Originally Posted by JamesB View Post
    Putting numbers to things; where I work (Wellcome Trust Sanger Institute) I've been told we have more spinning disk than CERN, perhaps in part because they have some extremely agressive data filtering techniques. Even so, there has been a lot of reluctance to adopt modern compression techniques unless they're also not noticably slower than their old counterparts (based on Deflate; now based on very simplistic modelling + rANS amongst other things). There's quite a bit of room for lossy compression of some data types too.
    James, just curious: is your application sensitive to decompression speed only, or both compression and decompression speed?

  2. #2
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    437
    Thanks
    137
    Thanked 152 Times in 100 Posts
    Quote Originally Posted by PSHUFB View Post
    James, just curious: is your application sensitive to decompression speed only, or both compression and decompression speed?
    It's a bit of both; compression speed matters but decompression slightly more so. Which means SIMD tANS is probably a better fit than SIMD rANS as it looks to be a bit more symmetrical, but that's just a historical quirk. That said, when comparing against the tried and trusted methods (good old fashioned Zlib) it's quite easy to beat it on compression speed for these files (eg BAM vs CRAM) so some asymmetry has been accepted for a long time.

    I'd say for people looking to improve on existing Bioinformatics file formats, BCF is the one to go for. This is a format holding variant calls (DNA mutations) and ultimately is the end product of the DNA -> sequencing (FASTQ) -> alignment (BAM) -> analysis (BCF); even better if it's searchable too. This isn't something that most people have focussed on yet, but in the long term I think the DNA alignment files will largely just be considered an intermediate and so only appropriate for light weight on-the-fly compression rather than permanent archival. (Depending on rarity of sample of course.)

  3. #3
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    I found an interesting post about this: https://kaushikghose.wordpress.com/2...-bam-vcf-what/
    But still no .bcf samples

  4. #4
    Member RamiroCruzo's Avatar
    Join Date
    Jul 2015
    Location
    India
    Posts
    15
    Thanks
    137
    Thanked 10 Times in 7 Posts
    Sometimes it's good to be a Biotech student, here ya go,

    ftp://ftp-trace.ncbi.nih.gov/1000gen..._07/exon/snps/

  5. The Following 2 Users Say Thank You to RamiroCruzo For This Useful Post:

    PSHUFB (4th January 2017),Shelwien (30th December 2016)

  6. #5
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    .vcf files are just patches, their contents are mostly unrelated to actual dna data.
    But there're lots of huge files on that ftp, in all kinds of formats (except .bcf), so its still interesting.

  7. #6
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    437
    Thanks
    137
    Thanked 152 Times in 100 Posts
    Be aware that the VCF world has a variety of formats too, and confusing minor version number changes that are somehow majorly breaking the format compatibility. Joy of joys!

    Anyway, VCF is typically one line (record) per SNP or Indel. (SNP being Single Nucleotide Polymorphism - basically a substitution - and indel being short for insertion or deletion; so they're basically the edit operators when aligning two strings). Each line has a fixed bunch of standard fields and then a variable number of INFO fields. These contain various bits of data from the variant callers (bcftools, gatk, etc) often written in overly precise floating point values amongst other things.

    gVCF is a variant of VCF that has one line per genomic coordinate, regardless of whether it is a SNP. Hence most lines basically say "I match the reference genome". It's probably still just called .vcf (or .bcf) though, for added confusion.

    Both follow the same VCF specification, but obviously gVCF will be much larger. The purpose of the extra "matches-ref" entries in the gVCF files becomes clear when you consider merging data between multiple samples. Previously we took our data from 1000 genomes and did a joint variant analysis on it to produce a 1000 genomes VCF file. If we then added 10 more genomes, we'd have to reanalyse all 1010 genomes to get the new joint calls. This is slow and tedious! gVCF however permits addition of extra samples without reanalysing all the old ones because the joint callers can now tell the difference between "this sample found no SNP" and "this sample had insufficient data to call a SNP". Furthermore it knows all the stats (in the INFO fields) to merge the various marginal probabilities together to come up with the joint answer.

    I suspect gVCF is gaining more traction due to this, but it's a bit beyond my normal work so I only loosely understand what people are doing. See http://gatkforums.broadinstitute.org...-a-regular-vcf for more.

    Then of course there are binary variants of these called BCF. This is basically just a binary encoded VCF and then chucked through a specialised usage of zlib that permits random access. It's rather poor as a compressed format. For a long while (even now?) GATK didn't support the newer BCF formats and so people just used VCF.gz instead (but be warned that's bgzf compressed and not gz - although it's compatible at decompression with gzip the reverse isn't true). The bulk of CPU time in a lot of processing software is string to float/int and int/float to string conversion routines. Tragic! (I optimised some of this in bcftools, but even so it's a sign of starting in the wrong place.)

  8. The Following 3 Users Say Thank You to JamesB For This Useful Post:

    Jarek (3rd January 2017),Shelwien (4th January 2017),yogitetradim (4th January 2017)

  9. #7
    Member
    Join Date
    Feb 2014
    Location
    Canada
    Posts
    17
    Thanks
    23
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by JamesB View Post
    That said, when comparing against the tried and trusted methods (good old fashioned Zlib) it's quite easy to beat it on compression speed for these files (eg BAM vs CRAM) so some asymmetry has been accepted for a long time.
    I didn't follow the above. In particular, I didn't follow the reasoning from "it's quite easy to beat it on compression speed for these files" to "so some asymmetry has been accepted for a long time". Clearly, gzip is asymmetric, and also there are faster compression and decompression (and both) algorithms, but I didn't understand the connection.

  10. #8
    Member
    Join Date
    Feb 2014
    Location
    Canada
    Posts
    17
    Thanks
    23
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by RamiroCruzo View Post
    Sometimes it's good to be a Biotech student, here ya go,

    ftp://ftp-trace.ncbi.nih.gov/1000gen..._07/exon/snps/
    Thanks! The directory you linked only has some small files, but can you recommend a good directory with "typical" sequences/patches to download?

    For example:

    ftp://ftp-trace.ncbi.nih.gov/1000gen...coverage/snps/

    Has some big files, but I have no idea what "low coverage" means and if it would perhaps imply that it's a non-representative set.

    Note that here by "representative" I don't mean like "important sequences" or anything like that, but basically sequences that are representative, entropy-wise, of average data stored and retrieved in bioinformatics systems.

  11. #9
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    437
    Thanks
    137
    Thanked 152 Times in 100 Posts
    Quote Originally Posted by PSHUFB View Post
    I didn't follow the above. In particular, I didn't follow the reasoning from "it's quite easy to beat it on compression speed for these files" to "so some asymmetry has been accepted for a long time". Clearly, gzip is asymmetric, and also there are faster compression and decompression (and both) algorithms, but I didn't understand the connection.
    Reading that back I don't understand what I was trying to say either!

    Ayway, historically BAM was a standard format for aligned data for a considerable time and accepted. It has strong asymmetry in encode / decode speeds, but due to the "mix it all together and then gzip" approach it's weak compressing. I guess what I meant is that we can both beat it on speed for compression and decompression plus asymmetry has been deemed acceptable so far so we shouldn't be too worried about it.

    Regarding your other point, "low coverage" relates to the DNA sequences placed against the genome. It's like a random sampling; take enough samples and you'll gradually observe most of the sample space, possibly with considerable redundancy, although the nature of random sampling means there will always be gaps. Low coverage is usually referring to the average redundancy ("depth"); so a 5x sample say has the genome, on average, sequenced 5 times over. It's not uniform and there will be many holes where no data covers that particular region. Hence people generally prefer high depth (say 40x) to get higher consensus accuracy and to reduce the number of holes that aren't covered by data. This obviously changes the compression statistics somewhat too as it implies more redundancy, but it is data type dependent. The DNA sequence is redundant, but the identifiers and quality values (one quality value per "base" / nucleotide) are not.

    See for example some of the graphical visualisations of these aligned sequences at
    https://ics.hutton.ac.uk/tablet/tablet-screenshots/ or https://sourceforge.net/projects/sta...urce=directory

    Edit: putting it another way, if you want a representative subset of a high coverage data set, you need to pick a few regions and slice through it (so all the data for chr1:10,000,000 to chr1:11,000,000 for example) so you have data at the same depth / coverage as the total set without having a huge file. Maybe just pick an entire chromosome, like chr22, which is only a few % of the total genome. There are various tools that can do this subsetting, including samtools (SAM, BAM, CRAM) and bcftools (VCF, BCF). They permit this over ftp and http too so you don't need to download the entire data set.

    Edit 2: Albeit slow, an example of viewing a BAM file directly over ftp:

    samtools tview -p chr22:16388319 ftp://ftp-trace.ncbi.nih.gov/1000gen..._coverage.cram
    Code:
      16388321  16388331  16388341  16388351  16388361  16388371  16388381  16388391  16388401  16388411  16388421
    NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
    AGCAGACACATGTGTGCAGTGATGGCTACTGTCAGGAGGTCTGTGCAGATGCTTGGAGGGGGCTGGGGCCAGCAGAGTCGGGTGGATTCAGAGATGAGTTCACTGAAAAGGAGGCCAG
    agcagac        GCAGTGATGGCTACTGTCAGGAGGTCTGTGCAGATGCTTGGAGGGGGCTGGGGCCAGCAGAGTCGGGTGGATTCAGAGATGAGTTCACTG       ggccag
    AGCAGACACA     gcagtgatggataatgtcaggacgtgtgtgcagatggttggagggggctggggccagcagagtcgcgtgtattcagagatgagttccctg
    AGCAGACACAT    GCAGTGATGGCTACTGTCAGGAGGTCTGTGCAGATGCTTGGAGGGGGCTGGGGCCAGCAGAGTCGGGTGGATTCAGAGATGAGGTCACTG
    AGCAGACACATGTGTGCAGTGATGGCTACTG                GATGCTTGGAGGGGGCTGGGGCCAGCAGAGTCGGGTGGATTCAGAGATGAGTTCACTGAAAAGGAGGCCAG
    agcagacacatgtgtgcagtgatggctactgtca               tgcttggagggggctggggccagcagagtcgggtggattcagagatgagttcactgaaaaggaggccag
    AGCAGACACATGTGTGCAGTGGTGGCTACTGTCAGGAGGTCTGTGCAGATGCTTGGAGGGGGCTGGGGCCAGCAGA            cagagatgagttcactgaaaaggaggccag
    AGCAGACACATGTGTGCAGTGATGGCTACTGTCAGGAGGTCTGTGCAGATGCTTGGAGGGGGCTGGGGCCAGCAGAGTCGGGTGGATTC        gttcactgaaaaggaggccag
    agcagacacatgtgtgcagagatggctactgtcaggaggtctgtgcagatgcttggagggggctggggccagcagagtcgggtggattca
    agcagacacatgtgtgcagtgatggctactgtcaggaggtctgtgcagatgcttggagggggctggggccagcagagtcgggtggattca
                                                                ggctggggccagcagagtcgggtggattcagagatgagttcactgaaaaggaggccag
                                                                   tggggccagcagagtcgggtggattcagagatgagttcactgaaaaggaggccag
                                                                             agagtcgggtggattcagagatgagttcactgaaaaggaggccag
    And similarly with one of the joint called VCF files:

    Code:
    @ seq3a[/tmp]; bcftools view -H ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz 22:16390000-16393000|cut -c 1-250
    22    16390164    rs545334900    C    G    100    PASS    AC=1;AF=0.000199681;AN=5008;NS=2504;DP=21032;EAS_AF=0;AMR_AF=0;AFR_AF=0;EUR_AF=0.001;SAS_AF=0;AA=c|||;VT=SNP    GT    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0
    22    16390218    rs563622392    G    T    100    PASS    AC=1;AF=0.000199681;AN=5008;NS=2504;DP=21303;EAS_AF=0.001;AMR_AF=0;AFR_AF=0;EUR_AF=0;SAS_AF=0;AA=g|||;VT=SNP    GT    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0
    22    16390331    rs575263612    G    A    100    PASS    AC=15;AF=0.00299521;AN=5008;NS=2504;DP=27596;EAS_AF=0.001;AMR_AF=0.0043;AFR_AF=0;EUR_AF=0.0089;SAS_AF=0.002;AA=g|||;VT=SNP    GT    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0
    22    16390504    rs542582108    C    A    100    PASS    AC=1;AF=0.000199681;AN=5008;NS=2504;DP=27647;EAS_AF=0;AMR_AF=0;AFR_AF=0;EUR_AF=0;SAS_AF=0.001;AA=c|||;VT=SNP    GT    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0
    22    16391873    rs539358425    CAA    C    100    PASS    AC=11;AF=0.00219649;AN=5008;NS=2504;DP=21663;EAS_AF=0.002;AMR_AF=0;AFR_AF=0.0038;EUR_AF=0.002;SAS_AF=0.002;VT=INDEL    GT    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    
    22    16391910    rs539916033    C    CA    100    PASS    AC=40;AF=0.00798722;AN=5008;NS=2504;DP=23249;EAS_AF=0.0089;AMR_AF=0;AFR_AF=0.0098;EUR_AF=0;SAS_AF=0.0184;AA=?|AAAAAAAAA|AAAAAAAAAA|unsure;VT=INDEL    GT    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|
    22    16391918    rs561239096    A    T    100    PASS    AC=23;AF=0.00459265;AN=5008;NS=2504;DP=20930;EAS_AF=0;AMR_AF=0.0029;AFR_AF=0.0083;EUR_AF=0.002;SAS_AF=0.0082;AA=a|||;VT=SNP    GT    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|
    22    16392063    rs375187140    T    G    100    PASS    AC=17;AF=0.00339457;AN=5008;NS=2504;DP=13848;EAS_AF=0.003;AMR_AF=0.0014;AFR_AF=0;EUR_AF=0;SAS_AF=0.0133;AA=t|||;VT=SNP    GT    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0
    22    16392087    rs369384048    G    A    100    PASS    AC=33;AF=0.00658946;AN=5008;NS=2504;DP=9786;EAS_AF=0.0109;AMR_AF=0.0101;AFR_AF=0.0015;EUR_AF=0.006;SAS_AF=0.0072;AA=g|||;VT=SNP    GT    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|
    22    16392597    rs565129926    G    C    100    PASS    AC=1;AF=0.000199681;AN=5008;NS=2504;DP=27908;EAS_AF=0;AMR_AF=0;AFR_AF=0;EUR_AF=0;SAS_AF=0.001;AA=g|||;VT=SNP    GT    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0|0    0
    Those lines are actually really long; around 10kb. They're listing the genotypes per sample, which as this is a joint call across all 1000 genomes samples (much more than 1000!) it's *hideously* inefficiently encoded. The format really needs dumping and starting again from scratch IMO.

Similar Threads

  1. .wsi and .dsi formats
    By peppe89ba in forum Data Compression
    Replies: 10
    Last Post: 3rd February 2016, 19:42
  2. Compressed Pixel Formats
    By gregkwaste in forum Data Compression
    Replies: 20
    Last Post: 1st April 2014, 03:33
  3. Lossless compression formats using LZMA or LZMA2 ?
    By Karhunen in forum Data Compression
    Replies: 6
    Last Post: 3rd February 2012, 05:15
  4. Steganography for universal recompression of lossy formats
    By Shelwien in forum Data Compression
    Replies: 3
    Last Post: 7th November 2011, 18:05
  5. Dealing with container formats
    By subwolf in forum Data Compression
    Replies: 16
    Last Post: 2nd September 2009, 22:14

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •