Page 37 - Fister jr., Iztok, Andrej Brodnik, Matjaž Krnc and Iztok Fister (eds.). StuCoSReC. Proceedings of the 2019 6th Student Computer Science Research Conference. Koper: University of Primorska Press, 2019
P. 37
imum similarity from instances of the other groups [13]. If we have a dataset as presented in Figure 1, we can use the
In Figure 1, a simple clustering problem is presented, where basic clustering function, expressed as:
we can simply find the centers of the clusters. In Figure 2,

K −1 N −1

Class 0 f (O, Z) = wi,j × oi − zj 2 , (7)
Class 1
0 j=0 i=0

2 Class 2 where O is a dataset without data labels, Z represents cen-
ters of clusters, W is a matrix with weights for instances for
a2 a2 4 each cluster, N is a number of instances in data set O, K
represents a number of clusters in Z and oi − zj 2 is the
6 Euclidean distance between instance oi and cluster center
zj. In paper [7] authors added weights W which are in ba-
8 sic clustering function set to one. Through weights, we can
provide additional information to the PSO algorithms. One
10 example of weights usage can be with setting the weight wi,j
7.5 5.0 2.5 a10.0 2.5 5.0 to one if the instance i belong to cluster j else the weight
is set to zero. Second example of weights usage can be for
Figure 1: Example of simple clustering dataset fuzzy clustering where weight wi,j has the value ∈ [0, 1].

the harder clustering problem is shown, where the centers The second fitness function for clustering optimization is
for clusters representing Class 0 and Class 2 are not so triv- defined, as follows:
ial to find. Clustering problem defined in Figure 2 is quite
N −1 K−1 wi,j × oi − zj
2 Class 0
Class 1 f (z, O) = p (z) + min 2 , (8)

0 Class 2 i=0 j=0

2 where p denotes a penalty function [1]. Optimization func-
tion from Eq. (8) is used for evaluating clusters in the K-
4 means algorithm [9], where penalty function is not used.

The third fitness function for clustering optimization takes
into account intra-cluster distances, in other words:

A−1 rj , max 0, rj
K K
p (z) = min − ze0 ,j − ze1,j , (9)

0 2 4 a1 6 8 ∀e∈I j=0

Figure 2: Example of hard clustering dataset where rj is calculated with |uj − lj| and presents the intra-
cluster distance between j attributes upper uj and lower lj
hard to solve with K-means algorithm because mean values limit, A is the number of attributes in one instance and I
of attributes a1 and a2 are very close for both classes. Even is a set containing vectors with two components, that repre-
instances of class 0 and class 2 overlap for both attributes. sents indexes of clusters. Set I is generated with the help of
Therefore we use optimization to solve the problem of clus- Alg. 1 which gives pairs of cluster indexes that need to be
tering for all three classes represented in Figure 2. For clus- checked for intra-cluster distances. The penalty function 1
tering optimization we will need a fitness function that will
guide our PSO algorithms toward optimal solutions in the Algorithm 1: Creating set I for calculating penalty of in-
search space. In this study, we propose the following fitness dividual Z based on cluster distances
functions for clustering:
Input: K
• basic clustering function, Output: I
1 I ← {};
• K-means clusters evaluation function, 2 for i ← 0 to K − 1 do
3 for j ← 1 to K − i − 1 do
• K-means clusters evaluation function with penalty func- 4 if i = K − j − 1 then
tion 1, and 5 I ← {i, K − j − 1};
6 end
• K-means clusters evaluation function with penalty func-
tion 2. 7 end

8 end
9 return I;

In the remainder of this section, the proposed fitness func- adds penalty based on intra-clusters distances, which means
tions for clustering optimization are described in more detail. if the intra-clusters distance is to small then based on that
distance the solution gets its penalty. The maximum penalty

StuCoSReC Proceedings of the 2019 6th Student Computer Science Research Conference 37
Koper, Slovenia, 10 October
   32   33   34   35   36   37   38   39   40   41   42