Page 69 - Fister jr., Iztok, and Andrej Brodnik (eds.). StuCoSReC. Proceedings of the 2018 5th Student Computer Science Research Conference. Koper: University of Primorska Press, 2018
P. 69
text-dependent chaining of heterogeneous data
[Extended Abstract]
Rok Gomišcˇek Tomaž Curk
University of Ljubljana, University of Ljubljana,
Faculty of Computer and Information Science Faculty of Computer and Information Science
Vecˇna pot 113, Vecˇna pot 113,
1000 Ljubljana, Slovenia 1000 Ljubljana, Slovenia
rok.gomiscek@fri.uni-lj.si tomaz.curk@fri.uni-lj.si
ABSTRACT Since parallel chains between two indirectly connected ob-
ject types can return different relations, we can see each
A typical heterogeneous data set may contain different types chain representing a different context which presents a dif-
of objects, where a small subset of direct relations among ob- ferent kind of relation. For example, the structure of the
ject types is supported by data. In our research we develop protein-protein interaction networks, which are used to rep-
methods to predict the relation between indirectly related resent functional relations among gene products, may vary
object types by chaining connections on intermediate object greatly across different tissues. We propose to group par-
types, and methods to discover contexts of indirect relations. allel chains according to their contexts and change the op-
timization criteria so that similar chains will be clustered
A chain CXZ is a sequence of relations that connect object together. We will evaluate various clustering methods, such
types X and Z, while CXZ|T specifies that X and Z are as k-means, k-NN and hierarchical clustering. We will also
connected through objects of type T . Two object types can explore various possible data representations for clustering,
have multiple parallel chains connecting them, each chain such as original data, approximations and product of factor
providing a different prediction. Different chains might re- matrices on a path.
sult in different predictions, each correctly describing the
relation, but in a different context. We will validate our model on synthetic and real data. We
will create synthetic data where we will guarantee that the
We wish to ensure that different chains between two indi- target relation between indirectly connected object types
rectly connected object types result in similar predictions, can be predicted by a chain of connecting relations. We
but only when sharing the same context. When predicting will generate the data by creating relations between object
a new connection between object types using different paths types where transitivity can be applied, that is, if xk → vl
between them, we might obtain different predictions. Our and vl → zn are true then xk → zn must also be true,
proposal on how to deal with this is to group predictions of where → denotes a function that maps objects to objects.
different chains together based on their similarity. This value will be used to evaluate the predictions of our
method by computing the squared Euclidean distance be-
First we will propose a method to improve the results of tween the actual data and the prediction. Some values will
chaining matrices on different paths between two indirectly be removed from the data to simulate missing values and
connected object types. To predict the relation between in- used for evaluation. We will also validate our model on real
directly connected object types matrices on a given chain data where there are multiple parallel paths between two
are multiplied and then all predictions can be combined to- indirectly connected object types. We can find these kind of
gether. Transitivity of relations tells us that two objects data in biological data sets with relations connecting genes,
that belong to two indirectly connected object types are con- proteins, diseases, gene function annotation (gene ontology
nected if they connect to intermediate objects on some path. terms), pathways, and other biological entities.
The number of intermediate objects between two objects can
vary greatly, but that number is not taken into account. We Keywords
will propose an approach to normalize the predictions by
accounting for the number of different paths connecting ob- machine learning, matrix factorization, data fusion, chain-
jects from the two object types. ing, transitivity
StuCoSReC Proceedings of the 2018 5th Student Computer Science Research Conference DOI: https://doi.org/10.26493/978-961-7055-26-9.71 71
Ljubljana, Slovenia, 9 October
[Extended Abstract]
Rok Gomišcˇek Tomaž Curk
University of Ljubljana, University of Ljubljana,
Faculty of Computer and Information Science Faculty of Computer and Information Science
Vecˇna pot 113, Vecˇna pot 113,
1000 Ljubljana, Slovenia 1000 Ljubljana, Slovenia
rok.gomiscek@fri.uni-lj.si tomaz.curk@fri.uni-lj.si
ABSTRACT Since parallel chains between two indirectly connected ob-
ject types can return different relations, we can see each
A typical heterogeneous data set may contain different types chain representing a different context which presents a dif-
of objects, where a small subset of direct relations among ob- ferent kind of relation. For example, the structure of the
ject types is supported by data. In our research we develop protein-protein interaction networks, which are used to rep-
methods to predict the relation between indirectly related resent functional relations among gene products, may vary
object types by chaining connections on intermediate object greatly across different tissues. We propose to group par-
types, and methods to discover contexts of indirect relations. allel chains according to their contexts and change the op-
timization criteria so that similar chains will be clustered
A chain CXZ is a sequence of relations that connect object together. We will evaluate various clustering methods, such
types X and Z, while CXZ|T specifies that X and Z are as k-means, k-NN and hierarchical clustering. We will also
connected through objects of type T . Two object types can explore various possible data representations for clustering,
have multiple parallel chains connecting them, each chain such as original data, approximations and product of factor
providing a different prediction. Different chains might re- matrices on a path.
sult in different predictions, each correctly describing the
relation, but in a different context. We will validate our model on synthetic and real data. We
will create synthetic data where we will guarantee that the
We wish to ensure that different chains between two indi- target relation between indirectly connected object types
rectly connected object types result in similar predictions, can be predicted by a chain of connecting relations. We
but only when sharing the same context. When predicting will generate the data by creating relations between object
a new connection between object types using different paths types where transitivity can be applied, that is, if xk → vl
between them, we might obtain different predictions. Our and vl → zn are true then xk → zn must also be true,
proposal on how to deal with this is to group predictions of where → denotes a function that maps objects to objects.
different chains together based on their similarity. This value will be used to evaluate the predictions of our
method by computing the squared Euclidean distance be-
First we will propose a method to improve the results of tween the actual data and the prediction. Some values will
chaining matrices on different paths between two indirectly be removed from the data to simulate missing values and
connected object types. To predict the relation between in- used for evaluation. We will also validate our model on real
directly connected object types matrices on a given chain data where there are multiple parallel paths between two
are multiplied and then all predictions can be combined to- indirectly connected object types. We can find these kind of
gether. Transitivity of relations tells us that two objects data in biological data sets with relations connecting genes,
that belong to two indirectly connected object types are con- proteins, diseases, gene function annotation (gene ontology
nected if they connect to intermediate objects on some path. terms), pathways, and other biological entities.
The number of intermediate objects between two objects can
vary greatly, but that number is not taken into account. We Keywords
will propose an approach to normalize the predictions by
accounting for the number of different paths connecting ob- machine learning, matrix factorization, data fusion, chain-
jects from the two object types. ing, transitivity
StuCoSReC Proceedings of the 2018 5th Student Computer Science Research Conference DOI: https://doi.org/10.26493/978-961-7055-26-9.71 71
Ljubljana, Slovenia, 9 October