Page 6 - Krész, Miklós, and Andrej Brodnik (eds.). MATCOS-13. Proceedings of the 2013 Mini-Conference on Applied Theoretical Computer Science. Koper: University of Primorska Press, 2016.
P. 6
query and the target structures, thus we also considered (e.g., C, N, O) and bonds (e.g., simple, double, triple) as
this problem. We tested the correctness and efficiency of our well.
implementations on real data sets.
In another screening method, two matrices are calculated for
2. THE SUBGRAPH ISOMORPHISM each molecule. Columns and rows correspond to frequent
atom types3 (e.g., C, N, O) and bond types (e.g., simple,
PROBLEM double), respectively. Initially, all matrix elements are zero.
The bonds of the molecules are examined sequentially. If
Definition 1. A labeled graph is defined as a 6-tuple the current bond connects two atoms of frequent type, then
G = (V, E, ΣV , ΣE, lV , lE) where V is the set of vertices, E the two elements of the matrix are increased by one. It can
is the set of undirected edges, ΣV and ΣE are sets of vertex easily be proved that each element of the query’s matrix
and edge labels, and lV : V → ΣV , lE : E → ΣE denote is less than or equal to the corresponding element in the
label functions mapping a vertex or an edge to a label, re- target’s matrix if query ⊆ target.
spectively.
Simple example molecules are shown in Figure 3. The cor-
Definition 2. A graph G1 = (V1, E1, ΣV1 , ΣE1 , lV1 , lE1 ) responding matrices are shown in Table 1. The elements of
is subgraph isomorphic to G2 = (V2, E2, ΣV2 , ΣE2 , lV2 , lE2 ) the matrices of the query and the target are separated by a
(denoted by G1 ⊆ G2) if there exists an injective function colon. As the query’s matrix contains non-zero elements and
f : V1 → V2 satisfying all elements are zero in the target’s matrix, query target
holds.
1. ∀u∈V1 (lV1 (u) = lV2 (f (u)))
Figure 3: Example molecules to screening
2. ∀u, v∈V1 ((u, v) ∈ E1 ⇒ (f (u), f (v)) ∈ E2)
3. ∀u, v∈V1 ((u, v)∈E1 ⇒ lE1 ((u, v)) = lE2 ((f (u), f (v))))
Notice that G2 does not necessarily contain G1 as induced
subgraph. A simple example is shown in Figure 2. A possible
mapping is
f (1) = 14 f (5) = 5 f (9) = 12
f (2) = 2 f (6) = 8 f (10) = 3
f (3) = 1 f (7) = 4
f (4) = 6 f (8) = 11
Figure 2: An example where G2 contains G1 as sub- C N O F P S ...
structure single 4 : 0 1 : 0 1 : 0 0 : 0 0 : 0 0 : 0 . . .
double 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 . . .
Note furthermore, that hydrogen is usually not represented triple 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 . . .
in molecules as its presence can be deduced from our chem-
ical knowledge. Table 1: Matrix
In cheminformatics, query molecule, target molecule, atoms, 4. ALGORITHMS
bonds and substructure are usually used instead of G1, G2,
vertices, edges and subgraph, respectively, therefore these We have studied three well-known substructure search algo-
notions are used in the rest of the paper. rithms, the Ullmann algorithm [5], the VF2 [2], and the
atom-reordering step of the QuickSI [4]. Although they
3. SCREENING are all backtracking algorithms, they build the backtrack-
ing trees in different ways. Furthermore, QuickSI rearranges
A preprocessing step, called screening, is used to exclude as the atoms of the query molecule in the first step to achieve
many molecules as possible from the target database that better performance. Each node of the backtracking tree rep-
cannot contain the query molecule as substructure. Our aim resents a partial isomorphism from the query to the target
was not to study and improve the existing screening meth- molecule. As this tree can be exponential in size, the pur-
ods, therefore only some simple conditions were checked pose of the algorithms is to filter out nodes that can not
before running the implemented substructure search algo- be extended to achieve a subgraph isomorphism function.
rithms. If query ⊆ target then the atom/bond count of the Therefore, different feasibility functions are applied by the
query is less than or equal to the atom/bond count of the methods.
target. Similar inequalities hold for various types of atoms
The Ullmann algorithm maintains a boolean matrix (de-
noted by M ) representing the possible branches at the dif-
ferent depths. Mi,j is true if the current partial mapping
can be extended by adding f (i) = j to achieve a subgraph
3The frequent atom types were determined based on a public
molecule set of National Cancer Institute [3].
m a t c o s -1 3 Proceedings of the 2013 Mini-Conference on Applied Theoretical Computer Science 6
Koper, Slovenia, 10-11 October
this problem. We tested the correctness and efficiency of our well.
implementations on real data sets.
In another screening method, two matrices are calculated for
2. THE SUBGRAPH ISOMORPHISM each molecule. Columns and rows correspond to frequent
atom types3 (e.g., C, N, O) and bond types (e.g., simple,
PROBLEM double), respectively. Initially, all matrix elements are zero.
The bonds of the molecules are examined sequentially. If
Definition 1. A labeled graph is defined as a 6-tuple the current bond connects two atoms of frequent type, then
G = (V, E, ΣV , ΣE, lV , lE) where V is the set of vertices, E the two elements of the matrix are increased by one. It can
is the set of undirected edges, ΣV and ΣE are sets of vertex easily be proved that each element of the query’s matrix
and edge labels, and lV : V → ΣV , lE : E → ΣE denote is less than or equal to the corresponding element in the
label functions mapping a vertex or an edge to a label, re- target’s matrix if query ⊆ target.
spectively.
Simple example molecules are shown in Figure 3. The cor-
Definition 2. A graph G1 = (V1, E1, ΣV1 , ΣE1 , lV1 , lE1 ) responding matrices are shown in Table 1. The elements of
is subgraph isomorphic to G2 = (V2, E2, ΣV2 , ΣE2 , lV2 , lE2 ) the matrices of the query and the target are separated by a
(denoted by G1 ⊆ G2) if there exists an injective function colon. As the query’s matrix contains non-zero elements and
f : V1 → V2 satisfying all elements are zero in the target’s matrix, query target
holds.
1. ∀u∈V1 (lV1 (u) = lV2 (f (u)))
Figure 3: Example molecules to screening
2. ∀u, v∈V1 ((u, v) ∈ E1 ⇒ (f (u), f (v)) ∈ E2)
3. ∀u, v∈V1 ((u, v)∈E1 ⇒ lE1 ((u, v)) = lE2 ((f (u), f (v))))
Notice that G2 does not necessarily contain G1 as induced
subgraph. A simple example is shown in Figure 2. A possible
mapping is
f (1) = 14 f (5) = 5 f (9) = 12
f (2) = 2 f (6) = 8 f (10) = 3
f (3) = 1 f (7) = 4
f (4) = 6 f (8) = 11
Figure 2: An example where G2 contains G1 as sub- C N O F P S ...
structure single 4 : 0 1 : 0 1 : 0 0 : 0 0 : 0 0 : 0 . . .
double 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 . . .
Note furthermore, that hydrogen is usually not represented triple 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 . . .
in molecules as its presence can be deduced from our chem-
ical knowledge. Table 1: Matrix
In cheminformatics, query molecule, target molecule, atoms, 4. ALGORITHMS
bonds and substructure are usually used instead of G1, G2,
vertices, edges and subgraph, respectively, therefore these We have studied three well-known substructure search algo-
notions are used in the rest of the paper. rithms, the Ullmann algorithm [5], the VF2 [2], and the
atom-reordering step of the QuickSI [4]. Although they
3. SCREENING are all backtracking algorithms, they build the backtrack-
ing trees in different ways. Furthermore, QuickSI rearranges
A preprocessing step, called screening, is used to exclude as the atoms of the query molecule in the first step to achieve
many molecules as possible from the target database that better performance. Each node of the backtracking tree rep-
cannot contain the query molecule as substructure. Our aim resents a partial isomorphism from the query to the target
was not to study and improve the existing screening meth- molecule. As this tree can be exponential in size, the pur-
ods, therefore only some simple conditions were checked pose of the algorithms is to filter out nodes that can not
before running the implemented substructure search algo- be extended to achieve a subgraph isomorphism function.
rithms. If query ⊆ target then the atom/bond count of the Therefore, different feasibility functions are applied by the
query is less than or equal to the atom/bond count of the methods.
target. Similar inequalities hold for various types of atoms
The Ullmann algorithm maintains a boolean matrix (de-
noted by M ) representing the possible branches at the dif-
ferent depths. Mi,j is true if the current partial mapping
can be extended by adding f (i) = j to achieve a subgraph
3The frequent atom types were determined based on a public
molecule set of National Cancer Institute [3].
m a t c o s -1 3 Proceedings of the 2013 Mini-Conference on Applied Theoretical Computer Science 6
Koper, Slovenia, 10-11 October