EDIT: total fubar on my part - ignore below unless you're curious on the *previous* contest
That is for 454 sequencing instrument output files ("SFF" format).
The instrument works via a technique known as pyrosequencing, where a single base type is offered up to the DNA template and if it gets incorporated a flash of light is emitted with intensity proportional to the number of bases. Eg if the template is:
and we flow past in sequence a repeated stream of T, A, C, G, then the events generated by the machine essentially amount to run lengths:
ACCAG => T(1) A(0) C(2) G(0) T(0) A(1) C(0) G(1)
Those 0, 1 and 2 values are theoretical. What actually happens is you get a bunch of values clustered around specific distributions, which you tease apart to work out where 0, 1, 2, etc peaks are. Graphically the intensity values, base-calls and quality values may look something like this (real data, but not from the competition):
So there are very obviously high correlations between the signal intensities and the base-calls, as one was called directly from the other, and likely high correlations between the quality values and the signals too. I haven't looked at the competition data, but if it's like the early 454 data I saw then the signals themselves will have been normalised and processed to remove artifacts. The normalisation makes sure that intensity 400 is the median value for a 4-mer (eg AAAA), etc.
The other processing involves removing cross correlations. The DNA being sequenced is many identical molecules, in order to improve the signal strength. In theory all molecules incorporate the same number of DNA bases at the same rate, but over time some lag. This means that signal starts to spread. I believe a mathematical model for how to correct this was published by Svantesson in the early days of pyrosequencing. It's presumably been improved since then, but her papers may give you an idea of the processing that happens to the raw data before it's presented in files to the user.
Discussion of techniques for how to compress the data though I think may disqualify you from entry to the contest. Explaining what the data means and the attributes of it are presumably OK though. You'd have to check the rules carefully. (I'm not entering anyway.)
Last edited by JamesB; 8th February 2012 at 17:22.
Bah ignore me, this is a new one I wasn't aware of. I was assuming it was a link to the previous sequencing instrument compression competition from topcoder:
Apparently that's finished, so there are discussions between the entrants now on how they did it: