Page 59 - Fister jr., Iztok, and Andrej Brodnik (eds.). StuCoSReC. Proceedings of the 2015 2nd Student Computer Science Research Conference. Koper: University of Primorska Press, 2015
P. 59
ce, eliminate the possibility of unique alleles occurring [13] that represents the probability that two individuals ran-
because of different populations breeding. This is one of the domly drawn from a populations have the same genotype at
reasons why we implemented the possibility to include GPS multiple loci. The PIC estimator is defined as:
coordinates of each sample genotyped. Comparison between
individuals can therefore be done also by viewing geograph- max max
ical locations where samples were obtained.
P(ID) = pi4 + (2pipj )2
4.3 Allele Frequency
min min
Allele frequency is the number of copies of a particular allele
divided by the number of copies of all alleles at the genetic where pi and pj are the frequencies of the ith and jth alleles
place (locus) in a population. It is usually expressed as and i = j [9].
a percentage. In population genetics allele frequencies are
used to depict the amount of genetic diversity at the indi- We implemented the P(ID) for each locus using allele fre-
vidual, population, and species level. It is also the relative quencies from a population sample.
proportion of all alleles of a gene that are of a designated
type. The frequencies of alleles at a given locus are usually 5. FUTURE WORK
graphed as a histogram or allele frequency spectrum or a
allele frequency distribution. We have done the later. In The database application is a basic tool that aids researchers
population genetics it is important to study the changes in to gain more insight in their genotypization data, but they
the distribution of frequencies of alleles that usually occur key problem remains unsolved. Individual identification is
on account of genetic drift, mutation, or migration of a pop- challenging because the allele lengths are somewhat inac-
ulation [5]. curate due to mutation, but even more importantly due to
different laboratory equipment and processes used for am-
4.4 Polymorphism Information Content (PIC) plifying DNA. We explored different ideas one of which is
clustering using convex hulls. The data is by nature multi-
The PIC value is mainly used to assess the diversity of a dimensional and hence, the clustering is subject to the curse
gene or DNA segment in a population, which throws light of dimensionality [2].
on the evolutionary pressure of the allele and the mutation
the locus might have undergone over a time period. The PIC Computing convex hulls in high dimensions is exponential
value will be close to zero if there is little alleleic variation, on the number of dimensions. Arguably, in our case we are
and it can reach a maximum of 1.0 if a genotype has only not required to compute the convex hulls, instead we only
new alleles which is a rare phenomenon. PIC values are need to answer if a point would fall inside a convex hull of a
defined as: set of points. The following are a few approaches/ideas on
how to efficiently answer such queries.
n 1. Finding a feasible solution to a Linear Program.
For a set A = (X[1]X[2]...X[n])
P ICi = 1 − Pi2j Solve the following Linear problem:
minimize (over x) : 1
j=1 subject to: Ax = P
xT ∗ [1] = 1 x[i] ≥ 0 for all i
where P ICi is the polymorphic information content of a where:
marker i, pij is the frequency of the jth pattern for marker xT is the trasnpose of x
i, and the summation extends over n patterns. [1] is the all −1 vector
4.5 Effective number of alleles (ne) The problem has a solution if and only if the point is
in the convex hull.
This measure is the number of equally frequent alleles it
would take to achieve a given level of gene diversity. That 2. Solving m linear equations with n unknowns.
is, it allows us to compare populations where the number We use the following property:
and distributions of alleles differ drastically. The formula is:
Any vector point v inside a convex hull of points [v1, v2, ..., vn]
AE = 1 r 1 can be represented as a (ki ∗ vi) where 0 ≤ ki ≤ 1
r j=1 1 − Dj and (ki) = 1. Correspondingly, no point outside
of the convex hull will have such representation. In
where Dj is the gene diversity of the jth of r loci. m-dimensional space, this will result in to a set of m
linear equations with n unknowns.
4.6 Probability of Identity (PI)
3. A point lies outside of the convex hull of a set of points,
Identifying individuals is of great importance in conserva- if and only if the direction of all the vectors from it to
tion genetics and molecular ecology. Statistical estimates are all other points are on less then one half of a circle as
commonly used to compute the probability of sampling iden-
tical genotypes. These statistical estimates usually assume
a random association between alleles within and among loci.
Probability of Identity is a popular estimator [10], [6], [8],
StuCoSReC Proceedings of the 2015 2nd Student Computer Science Research Conference 59
Ljubljana, Slovenia, 6 October
because of different populations breeding. This is one of the domly drawn from a populations have the same genotype at
reasons why we implemented the possibility to include GPS multiple loci. The PIC estimator is defined as:
coordinates of each sample genotyped. Comparison between
individuals can therefore be done also by viewing geograph- max max
ical locations where samples were obtained.
P(ID) = pi4 + (2pipj )2
4.3 Allele Frequency
min min
Allele frequency is the number of copies of a particular allele
divided by the number of copies of all alleles at the genetic where pi and pj are the frequencies of the ith and jth alleles
place (locus) in a population. It is usually expressed as and i = j [9].
a percentage. In population genetics allele frequencies are
used to depict the amount of genetic diversity at the indi- We implemented the P(ID) for each locus using allele fre-
vidual, population, and species level. It is also the relative quencies from a population sample.
proportion of all alleles of a gene that are of a designated
type. The frequencies of alleles at a given locus are usually 5. FUTURE WORK
graphed as a histogram or allele frequency spectrum or a
allele frequency distribution. We have done the later. In The database application is a basic tool that aids researchers
population genetics it is important to study the changes in to gain more insight in their genotypization data, but they
the distribution of frequencies of alleles that usually occur key problem remains unsolved. Individual identification is
on account of genetic drift, mutation, or migration of a pop- challenging because the allele lengths are somewhat inac-
ulation [5]. curate due to mutation, but even more importantly due to
different laboratory equipment and processes used for am-
4.4 Polymorphism Information Content (PIC) plifying DNA. We explored different ideas one of which is
clustering using convex hulls. The data is by nature multi-
The PIC value is mainly used to assess the diversity of a dimensional and hence, the clustering is subject to the curse
gene or DNA segment in a population, which throws light of dimensionality [2].
on the evolutionary pressure of the allele and the mutation
the locus might have undergone over a time period. The PIC Computing convex hulls in high dimensions is exponential
value will be close to zero if there is little alleleic variation, on the number of dimensions. Arguably, in our case we are
and it can reach a maximum of 1.0 if a genotype has only not required to compute the convex hulls, instead we only
new alleles which is a rare phenomenon. PIC values are need to answer if a point would fall inside a convex hull of a
defined as: set of points. The following are a few approaches/ideas on
how to efficiently answer such queries.
n 1. Finding a feasible solution to a Linear Program.
For a set A = (X[1]X[2]...X[n])
P ICi = 1 − Pi2j Solve the following Linear problem:
minimize (over x) : 1
j=1 subject to: Ax = P
xT ∗ [1] = 1 x[i] ≥ 0 for all i
where P ICi is the polymorphic information content of a where:
marker i, pij is the frequency of the jth pattern for marker xT is the trasnpose of x
i, and the summation extends over n patterns. [1] is the all −1 vector
4.5 Effective number of alleles (ne) The problem has a solution if and only if the point is
in the convex hull.
This measure is the number of equally frequent alleles it
would take to achieve a given level of gene diversity. That 2. Solving m linear equations with n unknowns.
is, it allows us to compare populations where the number We use the following property:
and distributions of alleles differ drastically. The formula is:
Any vector point v inside a convex hull of points [v1, v2, ..., vn]
AE = 1 r 1 can be represented as a (ki ∗ vi) where 0 ≤ ki ≤ 1
r j=1 1 − Dj and (ki) = 1. Correspondingly, no point outside
of the convex hull will have such representation. In
where Dj is the gene diversity of the jth of r loci. m-dimensional space, this will result in to a set of m
linear equations with n unknowns.
4.6 Probability of Identity (PI)
3. A point lies outside of the convex hull of a set of points,
Identifying individuals is of great importance in conserva- if and only if the direction of all the vectors from it to
tion genetics and molecular ecology. Statistical estimates are all other points are on less then one half of a circle as
commonly used to compute the probability of sampling iden-
tical genotypes. These statistical estimates usually assume
a random association between alleles within and among loci.
Probability of Identity is a popular estimator [10], [6], [8],
StuCoSReC Proceedings of the 2015 2nd Student Computer Science Research Conference 59
Ljubljana, Slovenia, 6 October