Page 55 - Fister jr., Iztok, and Andrej Brodnik (eds.). StuCoSReC. Proceedings of the 2018 5th Student Computer Science Research Conference. Koper: University of Primorska Press, 2018

P. 55

measure and label is determined based on K nearest train-
ing samples. Odd number should be selected for parameter
K value, because classiﬁcation label decision-making is done
by majority vote [11]. Selection of parameter K value has a
big eﬀect on ﬁnal classiﬁcation accuracy. The best approach
to select appropriate parameter K is by trial and error. Time
complexity of the algorithm is large, because most compu-
tational operations are being done during classiﬁcation and
not in training phase [10].

Figure 4: Vizualization of ten clusters as a result of applying Chi-Square distance measure [18] was used, because it can
clustering on ﬁrst three principal components of training measure diﬀerences between histograms. It is frequently
segments feature vectors used in classiﬁcation of textures, objects and shapes. Chi-
square distance measure takes distribution of values and
Recommended size of dictionary is ranging from 1000 to their frequencies into account and also has an import char-
3500 words. The reason is that classiﬁcation done with fewer acteristic of weighing rarely occured values in histogram [1].
number of words doesn’t reach the highest accuracy possible, Equation of Chi-square distance measure χ is
but it starts to stabilize with a dictionary size above 1000
words [17]. χ2(xi, xj ) = 1 d (xik − xjk)2 (1)
2 k=1 xik + xjk
2.3 Time series features
where:
Deﬁning time series features starts with segmentation and
feature extraction, described in Chapter 2.1. Segments fea- x = feature vector
ture vectors are compared with dictionary words using 1- i, j = feature vectors indices
nearest neighbors algorithm using Euclidean distance as dis-
tance measure. Histogram of dictionary word occurances is d = length of feature vectors
created and normalized with L2-norm. Example of time se- k = feature index
ries histogram is shown in Figure 5.
2.4.2 Support vector machine
Figure 5: Normalized histogram based on a dictionary of ten
words Algorithm is a part of linear classiﬁers group. It forms an
optimal hyperplane to maximize distance between parallel
Training time series, described with feature vectors, are used planes, placed on outer boundaries of training samples fea-
to train a classiﬁcation model. For classiﬁcation model test- ture vectors, based on their labels [12, 15]. Support vector
ing, time series from test dataset are used. machine (SVM) is a binary classiﬁer, but can also be used
for multilabel classiﬁcation. Multiple binary classiﬁers are
2.4 Classiﬁcation algorithms formed, with i-th label being treated as positive and the
rest as negative. Described approach is called One-Vs-All
Training classiﬁcation models was done using the following [14]. Alternative approach has a name One-Vs-One, where
described classiﬁcation models. a binary classiﬁer is formed for every pair of labels. One-Vs-
One approach is useful at dealing with unbalanced datasets,
2.4.1 K-Nearest neighbors but that comes at a cost of higher time complexity during
training and classiﬁcation process [8].
K-Nearest neighbors (KNN) algorithm classiﬁes a sample
based on distance between training samples and unclassiﬁed 2.4.3 Random forest
sample. Euclidean distance is usually used as a distance
Training starts by forming a set of decision trees with in-
duced randomness. A sample is classiﬁed based on majority
vote of decision trees [7]. Number of trees, their depth, min-
imal number of training samples to split a node and minimal
number of samples required to be at leaf node must be ad-
justed to optimize classiﬁcation accuracy. Random forest is
a fast learning and general-purpose classiﬁcation algorithm,
which is resistant to overﬁtting at large number of features
[4]. To achieve high accuracy, large set of decision trees
has to be created, but that also means slower classiﬁcation
process.

3. RESULTS

Presented feature extraction approach was implemented in
Python using scientiﬁc computing library NumPy. Machine
learning library scikit-learn was used for clustering and clas-
siﬁcation. For discrete wavelet transform, open-source soft-

StuCoSReC Proceedings of the 2018 5th Student Computer Science Research Conference 57
Ljubljana, Slovenia, 9 October

50 51 52 53 54 55 56 57 58 59 60