Page 55 - Fister jr., Iztok, and Andrej Brodnik (eds.). StuCoSReC. Proceedings of the 2015 2nd Student Computer Science Research Conference. Koper: University of Primorska Press, 2015
P. 55
m its sub-strings. One of the key difficulties is, that the key, and values with the same key are passed to the same
genetic material is far from random sequence of letters. As reducer), which is then aggregated - again by a program-
mentioned before, the genetic material has many different mer defined function - by Reducers. The actual processing
types of sequences and also variations between the copies of is coordinated by the execution framework (including the
the genome that naturally occur, must be taken into consid- division of data to several splits and gathering intermediate
eration when reconstructing a genome (coverage plays im- data). Although, it may seem that this two phase processing
portant role in filtering out plausible and artificial variants structure is very restrictive, in reality many algorithms can
of the same sequence). Due to these characteristics of the ge- be carried out this way, especially if complex problems are
netic material, automated full reconstruction of a reference decomposed to a series of MapReduce jobs. In MapReduce
genome is quite difficult, even current solutions aim to cre- key-value pairs are the basic data structure and both keys
ate as large contiguous fragments (contigs) as possible, and and values can be primitives or arbitrarily complex struc-
then these contigs are assembled manually or algorithmically tures. The data is stored in a distributed file system (HDFS)
using references from closely related species (although even across data nodes. To avoid moving large amount of data
in the latter case, manual proofreading is still a must). around the cluster, Hadoop brings the code to be executed
to the data and tries to schedule tasks on nodes where the
Modern genome assemblers use de Bruijn digraphs or meth- required data is present, to achieve data locality. It is impor-
ods based on them. The concept behind de Bruijn digraphs tant to note that each Mapper and Reducer runs in isolation
in de novo genome assembly (see details in reference [10]), and there are no real global variables in Hadoop. More in-
is to use these k-mers as nodes, and connect the nodes formation about Hadoop and MapReduce can be found at
if they overlap all but one base, that can be done with the official website: https://hadoop.apache.org/.
a computationally simple exact matching. The use of k-
mers is very useful, because at given genome size (L) and To reduce file size and speed up computation we have devel-
fixed read length, the number of k-mers is also fixed, thus oped a compression method that can reduce pure sequence
using more reads to achieve higher coverage does not re- information to 25% of its original size, and can compress
sult in increased complexity. The n-dimensional k-ary de (lossy compression) FastQ files to 50% of their size, on av-
Bruijn digraph (D(k,n)) is a directed graph with kn nodes erage. The compression stores only a subset of the original
that are labelled as an n-digit k-ary number (the k-mers Phred scores, and the practical idea behind this, is that in
of the original sequence), and node a1a2a3a4...an is adja- most cases the range 20 to 83 is more than enough handle
cent to k other nodes that are a2a3a4...anα, where α is every scenario as Phred scores below 20 is so poor and scores
from {0, 1, 2, ...k −1}. Due to biological constraints genomes above 80 are so good that values beyond these can be cut
rarely contain all possible subsequence even for short read off. Also ambiguity codes (letters that may represent more
lengths, thus a slightly different graph is used instead, that than one nucleobase see ref [11]) are not considered useful
is called de Bruijn based digraph, where the principle is the in de novo assembly, therefore they are omitted from this
same, but nodes have different in-, and out-degrees. A single compressed file format. This small program was developed
Eulerian-path can reconstruct the original sequence in such separately, but can easily be expressed as a MapReduce job.
graph. However, in large graphs bubbles and tips may oc-
cur and to avoid these, further transformation of the graph is The de novo assembly program is written in Java since Hadoop
required. Let S be a genomic sequence of length L. A gener- itself is implemented in this language. The algorithm itself
alized de Bruijn digraph (GDB) G(p’,e’,K,T,H) is a directed is a series of MapReduce jobs, where each subsequent job
graph with p’ nodes that are labelled as b1b2b3...bk for some takes its input from the previous job’s output stored on the
k in K = {k1k2k3...kn} set of integers and ki < L and e’ arcs distributed file system. Currently the algorithm contains
labelled as (t, c1c2c3...ck) with t in T = {t1t2t3ti}, ti ≤ k −1, four jobs that do the following tasks:
and some h in H = {h1h2h3...hs}, hi ≤ L in a way that node
b1b2b3...bk is adjacent to b(t+1)b(t+2)...bkc1c2c3...ch, where 1. counts unique reads present in the input
b1 and ci are from the set {A, C, T, G}. It is clear from the
definition that nodes in the GDB have variable length. An- 2. produce k-mers of user defined size from reads
other observation can be made: in the GDB a node can be
shifted t spaces to the left (whereas in DBB t = 1, thus 3. assembles of contigs from k-mers
only a single base can be shifted), and a sequence of var-
ious length of h can be concatenated form the right. An 4. assembles contigs into a single genome (if possible). -
Eulerian-path in the graph gives back the original sequence. Currently in development.
3. THE MAPREDUCE FRAMEWORK AND Processing of input arguments, configuring separate jobs
THE DEVELOPED ALGORITHMS and submitting them to the framework is orchestrated by
a MapReduce Driver. This is a separate class that holds the
MapReduce can refer to a programming paradigm and an main entry point for executing the Java program (Hadoop
execution framework at the same time. MapReduce as a pro- accepts a collection of jobs in jar format).
gramming paradigm has it roots in the functional program-
ming, where higher-order functions exist, which can accept The first job is a simple word count implemented in MapRe-
other functions as arguments. This paradigm defines a two duce fashion. The Mapper emits a sequence and a integer 1,
phase general logic to process large amounts of data. The and the reducer counts the unique sequences and emits each
data is first transformed by Mappers in a programmer de- sequence and an integer representing the number of occur-
fined function (in a parallel manner) to yield intermediate re- rence. This serves as input for the second job, that produces
sults (the immediate results are sorted and grouped by their
StuCoSReC Proceedings of the 2015 2nd Student Computer Science Research Conference 55
Ljubljana, Slovenia, 6 October
genetic material is far from random sequence of letters. As reducer), which is then aggregated - again by a program-
mentioned before, the genetic material has many different mer defined function - by Reducers. The actual processing
types of sequences and also variations between the copies of is coordinated by the execution framework (including the
the genome that naturally occur, must be taken into consid- division of data to several splits and gathering intermediate
eration when reconstructing a genome (coverage plays im- data). Although, it may seem that this two phase processing
portant role in filtering out plausible and artificial variants structure is very restrictive, in reality many algorithms can
of the same sequence). Due to these characteristics of the ge- be carried out this way, especially if complex problems are
netic material, automated full reconstruction of a reference decomposed to a series of MapReduce jobs. In MapReduce
genome is quite difficult, even current solutions aim to cre- key-value pairs are the basic data structure and both keys
ate as large contiguous fragments (contigs) as possible, and and values can be primitives or arbitrarily complex struc-
then these contigs are assembled manually or algorithmically tures. The data is stored in a distributed file system (HDFS)
using references from closely related species (although even across data nodes. To avoid moving large amount of data
in the latter case, manual proofreading is still a must). around the cluster, Hadoop brings the code to be executed
to the data and tries to schedule tasks on nodes where the
Modern genome assemblers use de Bruijn digraphs or meth- required data is present, to achieve data locality. It is impor-
ods based on them. The concept behind de Bruijn digraphs tant to note that each Mapper and Reducer runs in isolation
in de novo genome assembly (see details in reference [10]), and there are no real global variables in Hadoop. More in-
is to use these k-mers as nodes, and connect the nodes formation about Hadoop and MapReduce can be found at
if they overlap all but one base, that can be done with the official website: https://hadoop.apache.org/.
a computationally simple exact matching. The use of k-
mers is very useful, because at given genome size (L) and To reduce file size and speed up computation we have devel-
fixed read length, the number of k-mers is also fixed, thus oped a compression method that can reduce pure sequence
using more reads to achieve higher coverage does not re- information to 25% of its original size, and can compress
sult in increased complexity. The n-dimensional k-ary de (lossy compression) FastQ files to 50% of their size, on av-
Bruijn digraph (D(k,n)) is a directed graph with kn nodes erage. The compression stores only a subset of the original
that are labelled as an n-digit k-ary number (the k-mers Phred scores, and the practical idea behind this, is that in
of the original sequence), and node a1a2a3a4...an is adja- most cases the range 20 to 83 is more than enough handle
cent to k other nodes that are a2a3a4...anα, where α is every scenario as Phred scores below 20 is so poor and scores
from {0, 1, 2, ...k −1}. Due to biological constraints genomes above 80 are so good that values beyond these can be cut
rarely contain all possible subsequence even for short read off. Also ambiguity codes (letters that may represent more
lengths, thus a slightly different graph is used instead, that than one nucleobase see ref [11]) are not considered useful
is called de Bruijn based digraph, where the principle is the in de novo assembly, therefore they are omitted from this
same, but nodes have different in-, and out-degrees. A single compressed file format. This small program was developed
Eulerian-path can reconstruct the original sequence in such separately, but can easily be expressed as a MapReduce job.
graph. However, in large graphs bubbles and tips may oc-
cur and to avoid these, further transformation of the graph is The de novo assembly program is written in Java since Hadoop
required. Let S be a genomic sequence of length L. A gener- itself is implemented in this language. The algorithm itself
alized de Bruijn digraph (GDB) G(p’,e’,K,T,H) is a directed is a series of MapReduce jobs, where each subsequent job
graph with p’ nodes that are labelled as b1b2b3...bk for some takes its input from the previous job’s output stored on the
k in K = {k1k2k3...kn} set of integers and ki < L and e’ arcs distributed file system. Currently the algorithm contains
labelled as (t, c1c2c3...ck) with t in T = {t1t2t3ti}, ti ≤ k −1, four jobs that do the following tasks:
and some h in H = {h1h2h3...hs}, hi ≤ L in a way that node
b1b2b3...bk is adjacent to b(t+1)b(t+2)...bkc1c2c3...ch, where 1. counts unique reads present in the input
b1 and ci are from the set {A, C, T, G}. It is clear from the
definition that nodes in the GDB have variable length. An- 2. produce k-mers of user defined size from reads
other observation can be made: in the GDB a node can be
shifted t spaces to the left (whereas in DBB t = 1, thus 3. assembles of contigs from k-mers
only a single base can be shifted), and a sequence of var-
ious length of h can be concatenated form the right. An 4. assembles contigs into a single genome (if possible). -
Eulerian-path in the graph gives back the original sequence. Currently in development.
3. THE MAPREDUCE FRAMEWORK AND Processing of input arguments, configuring separate jobs
THE DEVELOPED ALGORITHMS and submitting them to the framework is orchestrated by
a MapReduce Driver. This is a separate class that holds the
MapReduce can refer to a programming paradigm and an main entry point for executing the Java program (Hadoop
execution framework at the same time. MapReduce as a pro- accepts a collection of jobs in jar format).
gramming paradigm has it roots in the functional program-
ming, where higher-order functions exist, which can accept The first job is a simple word count implemented in MapRe-
other functions as arguments. This paradigm defines a two duce fashion. The Mapper emits a sequence and a integer 1,
phase general logic to process large amounts of data. The and the reducer counts the unique sequences and emits each
data is first transformed by Mappers in a programmer de- sequence and an integer representing the number of occur-
fined function (in a parallel manner) to yield intermediate re- rence. This serves as input for the second job, that produces
sults (the immediate results are sorted and grouped by their
StuCoSReC Proceedings of the 2015 2nd Student Computer Science Research Conference 55
Ljubljana, Slovenia, 6 October