Page 54 - Fister jr., Iztok, and Andrej Brodnik (eds.). StuCoSReC. Proceedings of the 2015 2nd Student Computer Science Research Conference. Koper: University of Primorska Press, 2015
P. 54
living cells are mostly carried out by proteins that are en- enables sequencing of entire genomes within days (although
coded by genes on the DNA. However, the genetic material the evaluation of the data obtained may take weeks). One
has many other types of sequences, for example: regulatory drawback of NGS technologies compared to conventional
sequences, repeats of various length (up to several kilobases) methods is the relatively high error rate per read, but these
and number of occurrence, special telomer sequences at the techniques rely on quantity over quality and eventually, due
end of chromosomes, pseudo-genes (sequences almost iden- to the immense number of reads produced, incorrect / faulty
tical to functional genes) etc. The protein coding portion reads can be either corrected or ignored. More details can
of the human genome is approx. 1-2%. Since most eukary- be found in references [8] and [9].
otic organisms (organisms with cell nucleus) are diploid (the
genome is present in two copies) natural variations between 2.2 Insights into Bioinformatics and de novo
copies may occur, not to mention some plants where the assembly
genome may be present in 6-8 copies.
Data analysis starts with aˆA˘ IJcleaningaˆA˘ I˙ the data of faulty
2. INTRODUCTION TO NEXT-GENERATION reads and artifacts left behind by template generation meth-
SEQUENCING (NGS) AND BIOINFOR- ods etc. During sequencing, determining the identity of a
MATICS single nucleotide based on the signal information, is termed
base calling. Since no system is perfect, base calling has an
2.1 Next-Generation Sequencing error rate. Each platform has its unique measurement for
raw base calling quality, but for the final output it is con-
The term sequencing refers to the process through which verted to a so called Phred score. Phred score by definition
researchers obtain the nucleotide order of a specific DNA is: −10 × log10 P , where P is the probability of the given
molecule. The first sequencing method was introduced in nucleotide being called incorrectly. Raw sequences coming
1975, by Sanger and co-workers. This method was slow and directly from the sequencers have typical Phred scores of 30
could only obtain about 400 bp length of information at a to 40 and rarely exceed 60, higher values may occur after
time. Since then, the 3rd generation of sequencing meth- processing of raw data. A Phred score of 30 is considered
ods were developed and the 4th generation is already on its good, since most NGS methods produce reads far shorter
way. In this work by the term NGS we mean 3rd generation than 1000 bases, and in general, bases with Phred score
sequencers of which the first was introduced in 2005. The lower than 20 should be ignored.
common properties of these platforms are:
• Most of them utilize sequencing by synthesis method Coverage (the number of times a single position is read / se-
quenced, preferably by overlapping reads and not identical
• Simultaneously sequence many - up to millions of - ones) is another important attribute of sequencing experi-
templates ments. Coverage can be applied to single bases or positions,
but without data processing, only an approximation of av-
• Produce millions of short reads 2 ( typically 30-100 erage coverage can be given. If the length of the target se-
bases, rarely 200-300 depending on the platform), re- quence is known (or there is an acceptable approximation),
sulting in hundreds of millions (or even a few billions) the number of reads required for desired average coverage
high-quality bases can easily be calculated. For most applications 20-30× cov-
erage is suitable, but for example, identifying rare variants,
• Utilization of universal adapters, to prime the synthe- and de novo assembly often require coverage up to 80-100
sis reaction and also to immobilize templates times. Actual coverage may largely differ from theoretical
coverage, because generation of the template is non-random.
During sequence by synthesis, information is obtained through When analyzing data, regions with low or too high coverage
synthesizing a new complementary strand to the template must be handled with caution.
and reading the information of the newly incorporated base.
Enzymes carrying out DNA synthesis cannot produce new Sequence information is usually stored in FastQ format, that
DNA strands on their own, they require a template they can is a regular text file, and contains short meta data regarding
”copy”. Labelled bases can be recoded when incorporated the sequence (eg. read id, sequence accession id, date/time,
into the newly forming strand and due to the complemen- platform / machine specific information) the actual sequence,
tarity, the sequence of the two strands is interchangeable. and the corresponding Phred quality score encoded as ASCII
characters instead of numbers.
The general schema of an NGS method is to: acquire small
fragments (templates), separate, then sequence them. There Filtering experimental data is rather data-, than CPU-intensive
are various methods to generate such amount of templates as raw data from sequencers can reach 600 GiB per run, and
(eg. random priming, PCR amplification, breaking of DNA), computations required for ’data cleaning’ are relatively sim-
and each may produce characteristic errors that need to be ple. Yet, as all applications handling vast amount of data,
corrected later on, during data processing. Also it is impor- this step can also benefit from a distributed system. There
tant to note that the different NGS platforms have different, are many applications of NGS, but we will only introduce
yet characteristic error types and various error rates. The de novo assembly in this work. De novo genome assembly
high data throughput, achieved by massive parallelization, is when the reference genome for an organism is built, or in
general, assembly of the genome for any species, for valida-
2In bioinformatics the sequence obtained from a single piece tion etc. As from the technical / informatics point of view,
of DNA molecule is called a read this is analogous to reconstructing a large (unknown) string
StuCoSReC Proceedings of the 2015 2nd Student Computer Science Research Conference 54
Ljubljana, Slovenia, 6 October
coded by genes on the DNA. However, the genetic material the evaluation of the data obtained may take weeks). One
has many other types of sequences, for example: regulatory drawback of NGS technologies compared to conventional
sequences, repeats of various length (up to several kilobases) methods is the relatively high error rate per read, but these
and number of occurrence, special telomer sequences at the techniques rely on quantity over quality and eventually, due
end of chromosomes, pseudo-genes (sequences almost iden- to the immense number of reads produced, incorrect / faulty
tical to functional genes) etc. The protein coding portion reads can be either corrected or ignored. More details can
of the human genome is approx. 1-2%. Since most eukary- be found in references [8] and [9].
otic organisms (organisms with cell nucleus) are diploid (the
genome is present in two copies) natural variations between 2.2 Insights into Bioinformatics and de novo
copies may occur, not to mention some plants where the assembly
genome may be present in 6-8 copies.
Data analysis starts with aˆA˘ IJcleaningaˆA˘ I˙ the data of faulty
2. INTRODUCTION TO NEXT-GENERATION reads and artifacts left behind by template generation meth-
SEQUENCING (NGS) AND BIOINFOR- ods etc. During sequencing, determining the identity of a
MATICS single nucleotide based on the signal information, is termed
base calling. Since no system is perfect, base calling has an
2.1 Next-Generation Sequencing error rate. Each platform has its unique measurement for
raw base calling quality, but for the final output it is con-
The term sequencing refers to the process through which verted to a so called Phred score. Phred score by definition
researchers obtain the nucleotide order of a specific DNA is: −10 × log10 P , where P is the probability of the given
molecule. The first sequencing method was introduced in nucleotide being called incorrectly. Raw sequences coming
1975, by Sanger and co-workers. This method was slow and directly from the sequencers have typical Phred scores of 30
could only obtain about 400 bp length of information at a to 40 and rarely exceed 60, higher values may occur after
time. Since then, the 3rd generation of sequencing meth- processing of raw data. A Phred score of 30 is considered
ods were developed and the 4th generation is already on its good, since most NGS methods produce reads far shorter
way. In this work by the term NGS we mean 3rd generation than 1000 bases, and in general, bases with Phred score
sequencers of which the first was introduced in 2005. The lower than 20 should be ignored.
common properties of these platforms are:
• Most of them utilize sequencing by synthesis method Coverage (the number of times a single position is read / se-
quenced, preferably by overlapping reads and not identical
• Simultaneously sequence many - up to millions of - ones) is another important attribute of sequencing experi-
templates ments. Coverage can be applied to single bases or positions,
but without data processing, only an approximation of av-
• Produce millions of short reads 2 ( typically 30-100 erage coverage can be given. If the length of the target se-
bases, rarely 200-300 depending on the platform), re- quence is known (or there is an acceptable approximation),
sulting in hundreds of millions (or even a few billions) the number of reads required for desired average coverage
high-quality bases can easily be calculated. For most applications 20-30× cov-
erage is suitable, but for example, identifying rare variants,
• Utilization of universal adapters, to prime the synthe- and de novo assembly often require coverage up to 80-100
sis reaction and also to immobilize templates times. Actual coverage may largely differ from theoretical
coverage, because generation of the template is non-random.
During sequence by synthesis, information is obtained through When analyzing data, regions with low or too high coverage
synthesizing a new complementary strand to the template must be handled with caution.
and reading the information of the newly incorporated base.
Enzymes carrying out DNA synthesis cannot produce new Sequence information is usually stored in FastQ format, that
DNA strands on their own, they require a template they can is a regular text file, and contains short meta data regarding
”copy”. Labelled bases can be recoded when incorporated the sequence (eg. read id, sequence accession id, date/time,
into the newly forming strand and due to the complemen- platform / machine specific information) the actual sequence,
tarity, the sequence of the two strands is interchangeable. and the corresponding Phred quality score encoded as ASCII
characters instead of numbers.
The general schema of an NGS method is to: acquire small
fragments (templates), separate, then sequence them. There Filtering experimental data is rather data-, than CPU-intensive
are various methods to generate such amount of templates as raw data from sequencers can reach 600 GiB per run, and
(eg. random priming, PCR amplification, breaking of DNA), computations required for ’data cleaning’ are relatively sim-
and each may produce characteristic errors that need to be ple. Yet, as all applications handling vast amount of data,
corrected later on, during data processing. Also it is impor- this step can also benefit from a distributed system. There
tant to note that the different NGS platforms have different, are many applications of NGS, but we will only introduce
yet characteristic error types and various error rates. The de novo assembly in this work. De novo genome assembly
high data throughput, achieved by massive parallelization, is when the reference genome for an organism is built, or in
general, assembly of the genome for any species, for valida-
2In bioinformatics the sequence obtained from a single piece tion etc. As from the technical / informatics point of view,
of DNA molecule is called a read this is analogous to reconstructing a large (unknown) string
StuCoSReC Proceedings of the 2015 2nd Student Computer Science Research Conference 54
Ljubljana, Slovenia, 6 October