Page 56 - Fister jr., Iztok, and Andrej Brodnik (eds.). StuCoSReC. Proceedings of the 2015 2nd Student Computer Science Research Conference. Koper: University of Primorska Press, 2015
P. 56
ers (”the nodes” of the de Bruijn based graph) of user utility program. Although the de novo assembly algoritm is
defined length. The first two jobs prepare data for the third still under development it already shows promise and great
job which assembles k-mers into as large contigs as possible results. Further work includes assembly of contigs into a
(this is carried out in the Reduce phase of the job). A simple full genome if possible, add data cleaning part to the suite
but efficient trick is utilized to ensure that k-mers that most to process raw data and widen the accepted type and format
likely belong together, end up at the same Reducer and can of input data.
be assembled into a contig. Throughout the algorithm the
sequence information is used as key, thus k-mers arrive to 6. REFERENCES
this job in lexicographically sorted format, and since a strict
overlapping policy is used (adjacent ”nodes” must overlap all [1] Stefan K. Bohlander ABCs of genomics ASH Education
but one base), sequences that are close to each other in the Book December 6, 2013 vol. 2013 no. 1 316-323
data, are most likely belong together. When data is read
by the framework from the disk, it is split to smaller chunks [2] Ombrello MJ, Sikora KA, Kastner DL. Genetics,
and these data splits are sent to Mappers. In this job each genomics and their relevance to pathology and therapy.
unique Mapper uses a unique key when it emits key-value Best practice & research Clinical rheumatology.
pairs, thus when the framework groups the intermediate re- 2014;28(2):175-189. doi:10.1016/j.berh.2014.05.001.
sults, data that was processed by a particular Mapper gets
grouped together and sent to same Reducer. [3] Velvet: algorithms for de novo short read assembly
using de Bruijn graphs. D.R. Zerbino and E. Birney.
We neither explicitly build any graph nor contigs are as- Genome Research 18:821-829.
sembled via an Eulerian path, but rather the connectivity
of the de Bruijn based graph is used. The Reducer iter- [4] Y.-J. Chang, C.-C. Chen, C.-L. Chen, and J.-M. Ho, A
ates through the data is was assigned to, and for each k-mer de novo next generation genomic sequence assembler
it collects the start- and end-slices. These are stored in based on string graph and MapReduce cloud computing
TreeMap structures and the k-mers (”nodes”) are stored in framework., BMC Genomics, vol. 13 Suppl 7, no. Suppl
a set. After this, the contig assembly is carried out by an 7, p. S28, Jan. 2012.
iterative process. The first element of the node set is taken
and attempt is made to extend it in both directions. Also [5] R. Dahm, Discovering DNA: Friedrich Miescher and
forking (multiple adjacent ”node” is present) is kept track of. the early years of nucleic acid research, Human
This extension step is continued as long as the current node Genetics, vol. 122, no. 6. pp. 565-581, 2008.
can be combined with another at least in one direction. The
nodes used in the extension are removed from the set. If a [6] J. D. Watson and F. H. C. Crick, Molecular structure
node cannot be extended it is saved and a new starting node of nucleic acids, Nature, vol. 171, no. 4356, pp. 737-738,
is taken from the node set. 1953.
4. EVALUATION OF CORRECTNESS [7] F. H. C. Crick, The origin of the genetic code, Journal
of Molecular Biology, vol. 38, no. 3. pp. 367-379, 1968.
The software / assembly algorithm was first tested for cor-
rectness by using artificially generated test data, using the [8] A. Alekseyenko et al, NEXT-GENERATION DNA
small utility tool developed alongside the main program. A SEQUENCING INFORMATICS. New York, New York:
random sequence of the four letters A,T,C,G was generated Cold Spring Harbor Laboratory Press, 2013.
in arbitrary length and then random subsequences were gen-
erated as reads in the amount to reach 5-10 times average [9] L. Liu, et al., Comparison of next-generation
coverage. Given this input, the algorithm is capable to fully sequencing systems, Journal of Biomedicine and
reconstruct the original sequence. For the next stage of test- Biotechnology, vol. 2012. 2012.
ing, we chose to use the genome of the Pseudorabies virus
(PRV). The Kaplan strain of PRV has genome size of ap- [10] Phillip E C Compeau, Pavel A Pevzner & Glenn
proximately 143kbp (kilobase pair). It serves our purpose Tesler, How to apply de Bruijn graphs to genome
well, because it is small, it is very tightly packed (approx. assembly Nature Biotechnology 29, 987-991 (2011)
70 are contained within the genome and it has nested and doi:10.1038/nbt.2023
overlapping genes), it has two large repeat regions that have
opposite orientation. Due to these features it poses a great [11] http://www.bioinformatics.org/sms/iupac.html.
challenge for a de novo genome assembly algorithm such as
detailed above. Using 40 base pair reads with 38 base pair
long k-mers the algorithm does fairly well. It produces 238
contigs, with maximal contig length of 58210bps, minimum
of 47bps and an average of 12020bps. With some manual
work the original sequence can be assembled.
5. CONCLUSION
Our aim was to create a de novo assembly algorithm us-
ing the MapReduce paradigm and Hadoop as a MapReduce
execution framework. Our current work presents the devel-
opment of such program along with a useful compression
StuCoSReC Proceedings of the 2015 2nd Student Computer Science Research Conference 56
Ljubljana, Slovenia, 6 October
defined length. The first two jobs prepare data for the third still under development it already shows promise and great
job which assembles k-mers into as large contigs as possible results. Further work includes assembly of contigs into a
(this is carried out in the Reduce phase of the job). A simple full genome if possible, add data cleaning part to the suite
but efficient trick is utilized to ensure that k-mers that most to process raw data and widen the accepted type and format
likely belong together, end up at the same Reducer and can of input data.
be assembled into a contig. Throughout the algorithm the
sequence information is used as key, thus k-mers arrive to 6. REFERENCES
this job in lexicographically sorted format, and since a strict
overlapping policy is used (adjacent ”nodes” must overlap all [1] Stefan K. Bohlander ABCs of genomics ASH Education
but one base), sequences that are close to each other in the Book December 6, 2013 vol. 2013 no. 1 316-323
data, are most likely belong together. When data is read
by the framework from the disk, it is split to smaller chunks [2] Ombrello MJ, Sikora KA, Kastner DL. Genetics,
and these data splits are sent to Mappers. In this job each genomics and their relevance to pathology and therapy.
unique Mapper uses a unique key when it emits key-value Best practice & research Clinical rheumatology.
pairs, thus when the framework groups the intermediate re- 2014;28(2):175-189. doi:10.1016/j.berh.2014.05.001.
sults, data that was processed by a particular Mapper gets
grouped together and sent to same Reducer. [3] Velvet: algorithms for de novo short read assembly
using de Bruijn graphs. D.R. Zerbino and E. Birney.
We neither explicitly build any graph nor contigs are as- Genome Research 18:821-829.
sembled via an Eulerian path, but rather the connectivity
of the de Bruijn based graph is used. The Reducer iter- [4] Y.-J. Chang, C.-C. Chen, C.-L. Chen, and J.-M. Ho, A
ates through the data is was assigned to, and for each k-mer de novo next generation genomic sequence assembler
it collects the start- and end-slices. These are stored in based on string graph and MapReduce cloud computing
TreeMap structures and the k-mers (”nodes”) are stored in framework., BMC Genomics, vol. 13 Suppl 7, no. Suppl
a set. After this, the contig assembly is carried out by an 7, p. S28, Jan. 2012.
iterative process. The first element of the node set is taken
and attempt is made to extend it in both directions. Also [5] R. Dahm, Discovering DNA: Friedrich Miescher and
forking (multiple adjacent ”node” is present) is kept track of. the early years of nucleic acid research, Human
This extension step is continued as long as the current node Genetics, vol. 122, no. 6. pp. 565-581, 2008.
can be combined with another at least in one direction. The
nodes used in the extension are removed from the set. If a [6] J. D. Watson and F. H. C. Crick, Molecular structure
node cannot be extended it is saved and a new starting node of nucleic acids, Nature, vol. 171, no. 4356, pp. 737-738,
is taken from the node set. 1953.
4. EVALUATION OF CORRECTNESS [7] F. H. C. Crick, The origin of the genetic code, Journal
of Molecular Biology, vol. 38, no. 3. pp. 367-379, 1968.
The software / assembly algorithm was first tested for cor-
rectness by using artificially generated test data, using the [8] A. Alekseyenko et al, NEXT-GENERATION DNA
small utility tool developed alongside the main program. A SEQUENCING INFORMATICS. New York, New York:
random sequence of the four letters A,T,C,G was generated Cold Spring Harbor Laboratory Press, 2013.
in arbitrary length and then random subsequences were gen-
erated as reads in the amount to reach 5-10 times average [9] L. Liu, et al., Comparison of next-generation
coverage. Given this input, the algorithm is capable to fully sequencing systems, Journal of Biomedicine and
reconstruct the original sequence. For the next stage of test- Biotechnology, vol. 2012. 2012.
ing, we chose to use the genome of the Pseudorabies virus
(PRV). The Kaplan strain of PRV has genome size of ap- [10] Phillip E C Compeau, Pavel A Pevzner & Glenn
proximately 143kbp (kilobase pair). It serves our purpose Tesler, How to apply de Bruijn graphs to genome
well, because it is small, it is very tightly packed (approx. assembly Nature Biotechnology 29, 987-991 (2011)
70 are contained within the genome and it has nested and doi:10.1038/nbt.2023
overlapping genes), it has two large repeat regions that have
opposite orientation. Due to these features it poses a great [11] http://www.bioinformatics.org/sms/iupac.html.
challenge for a de novo genome assembly algorithm such as
detailed above. Using 40 base pair reads with 38 base pair
long k-mers the algorithm does fairly well. It produces 238
contigs, with maximal contig length of 58210bps, minimum
of 47bps and an average of 12020bps. With some manual
work the original sequence can be assembled.
5. CONCLUSION
Our aim was to create a de novo assembly algorithm us-
ing the MapReduce paradigm and Hadoop as a MapReduce
execution framework. Our current work presents the devel-
opment of such program along with a useful compression
StuCoSReC Proceedings of the 2015 2nd Student Computer Science Research Conference 56
Ljubljana, Slovenia, 6 October