The investigators propose is to establish a general statistical methodology to<br/>estimate error rates for any of the existing second-generation<br/>sequencing technologies. They have identified an approach that is broadly<br/>applicable, fast, and easy to implement. Important strengths are that it<br/>only requires intensity data; it is applicable to the data types that<br/>are typically shared publicly; and it does not require the availability<br/>of a reference genome or genomes ---a key condition in threat detection<br/>applications.<br/><br/>Sequencing is a technology to read small words made out of DNA or RNA.<br/>In may applications across biology we need to identify words that occur<br/>in a book where they are not supposed to be (for example mutations in<br/>cancer or pathogens in the intestinal flora). Often these 'bad' words<br/>are similar to other words which occur elsewhere in the book, differing<br/>only by a letter or two. As seqeuncing is not free of errors, to know<br/>whether we are seeing a bad word or a poorly read good word we need to<br/>know how easy it is to misread a letter. This proposal is to assess<br/>exactly this, with the goal of providing better foundations to all<br/>scientific research that uses sequencing.