Page 55 - Fister jr., Iztok, and Andrej Brodnik (eds.). StuCoSReC. Proceedings of the 2018 5th Student Computer Science Research Conference. Koper: University of Primorska Press, 2018
P. 55
measure and label is determined based on K nearest train-
ing samples. Odd number should be selected for parameter
K value, because classification label decision-making is done
by majority vote [11]. Selection of parameter K value has a
big effect on final classification accuracy. The best approach
to select appropriate parameter K is by trial and error. Time
complexity of the algorithm is large, because most compu-
tational operations are being done during classification and
not in training phase [10].
Figure 4: Vizualization of ten clusters as a result of applying Chi-Square distance measure [18] was used, because it can
clustering on first three principal components of training measure differences between histograms. It is frequently
segments feature vectors used in classification of textures, objects and shapes. Chi-
square distance measure takes distribution of values and
Recommended size of dictionary is ranging from 1000 to their frequencies into account and also has an import char-
3500 words. The reason is that classification done with fewer acteristic of weighing rarely occured values in histogram [1].
number of words doesn’t reach the highest accuracy possible, Equation of Chi-square distance measure χ is
but it starts to stabilize with a dictionary size above 1000
words [17]. χ2(xi, xj ) = 1 d (xik − xjk)2 (1)
2 k=1 xik + xjk
2.3 Time series features
where:
Defining time series features starts with segmentation and
feature extraction, described in Chapter 2.1. Segments fea- x = feature vector
ture vectors are compared with dictionary words using 1- i, j = feature vectors indices
nearest neighbors algorithm using Euclidean distance as dis-
tance measure. Histogram of dictionary word occurances is d = length of feature vectors
created and normalized with L2-norm. Example of time se- k = feature index
ries histogram is shown in Figure 5.
2.4.2 Support vector machine
Figure 5: Normalized histogram based on a dictionary of ten
words Algorithm is a part of linear classifiers group. It forms an
optimal hyperplane to maximize distance between parallel
Training time series, described with feature vectors, are used planes, placed on outer boundaries of training samples fea-
to train a classification model. For classification model test- ture vectors, based on their labels [12, 15]. Support vector
ing, time series from test dataset are used. machine (SVM) is a binary classifier, but can also be used
for multilabel classification. Multiple binary classifiers are
2.4 Classification algorithms formed, with i-th label being treated as positive and the
rest as negative. Described approach is called One-Vs-All
Training classification models was done using the following [14]. Alternative approach has a name One-Vs-One, where
described classification models. a binary classifier is formed for every pair of labels. One-Vs-
One approach is useful at dealing with unbalanced datasets,
2.4.1 K-Nearest neighbors but that comes at a cost of higher time complexity during
training and classification process [8].
K-Nearest neighbors (KNN) algorithm classifies a sample
based on distance between training samples and unclassified 2.4.3 Random forest
sample. Euclidean distance is usually used as a distance
Training starts by forming a set of decision trees with in-
duced randomness. A sample is classified based on majority
vote of decision trees [7]. Number of trees, their depth, min-
imal number of training samples to split a node and minimal
number of samples required to be at leaf node must be ad-
justed to optimize classification accuracy. Random forest is
a fast learning and general-purpose classification algorithm,
which is resistant to overfitting at large number of features
[4]. To achieve high accuracy, large set of decision trees
has to be created, but that also means slower classification
process.
3. RESULTS
Presented feature extraction approach was implemented in
Python using scientific computing library NumPy. Machine
learning library scikit-learn was used for clustering and clas-
sification. For discrete wavelet transform, open-source soft-
StuCoSReC Proceedings of the 2018 5th Student Computer Science Research Conference 57
Ljubljana, Slovenia, 9 October
ing samples. Odd number should be selected for parameter
K value, because classification label decision-making is done
by majority vote [11]. Selection of parameter K value has a
big effect on final classification accuracy. The best approach
to select appropriate parameter K is by trial and error. Time
complexity of the algorithm is large, because most compu-
tational operations are being done during classification and
not in training phase [10].
Figure 4: Vizualization of ten clusters as a result of applying Chi-Square distance measure [18] was used, because it can
clustering on first three principal components of training measure differences between histograms. It is frequently
segments feature vectors used in classification of textures, objects and shapes. Chi-
square distance measure takes distribution of values and
Recommended size of dictionary is ranging from 1000 to their frequencies into account and also has an import char-
3500 words. The reason is that classification done with fewer acteristic of weighing rarely occured values in histogram [1].
number of words doesn’t reach the highest accuracy possible, Equation of Chi-square distance measure χ is
but it starts to stabilize with a dictionary size above 1000
words [17]. χ2(xi, xj ) = 1 d (xik − xjk)2 (1)
2 k=1 xik + xjk
2.3 Time series features
where:
Defining time series features starts with segmentation and
feature extraction, described in Chapter 2.1. Segments fea- x = feature vector
ture vectors are compared with dictionary words using 1- i, j = feature vectors indices
nearest neighbors algorithm using Euclidean distance as dis-
tance measure. Histogram of dictionary word occurances is d = length of feature vectors
created and normalized with L2-norm. Example of time se- k = feature index
ries histogram is shown in Figure 5.
2.4.2 Support vector machine
Figure 5: Normalized histogram based on a dictionary of ten
words Algorithm is a part of linear classifiers group. It forms an
optimal hyperplane to maximize distance between parallel
Training time series, described with feature vectors, are used planes, placed on outer boundaries of training samples fea-
to train a classification model. For classification model test- ture vectors, based on their labels [12, 15]. Support vector
ing, time series from test dataset are used. machine (SVM) is a binary classifier, but can also be used
for multilabel classification. Multiple binary classifiers are
2.4 Classification algorithms formed, with i-th label being treated as positive and the
rest as negative. Described approach is called One-Vs-All
Training classification models was done using the following [14]. Alternative approach has a name One-Vs-One, where
described classification models. a binary classifier is formed for every pair of labels. One-Vs-
One approach is useful at dealing with unbalanced datasets,
2.4.1 K-Nearest neighbors but that comes at a cost of higher time complexity during
training and classification process [8].
K-Nearest neighbors (KNN) algorithm classifies a sample
based on distance between training samples and unclassified 2.4.3 Random forest
sample. Euclidean distance is usually used as a distance
Training starts by forming a set of decision trees with in-
duced randomness. A sample is classified based on majority
vote of decision trees [7]. Number of trees, their depth, min-
imal number of training samples to split a node and minimal
number of samples required to be at leaf node must be ad-
justed to optimize classification accuracy. Random forest is
a fast learning and general-purpose classification algorithm,
which is resistant to overfitting at large number of features
[4]. To achieve high accuracy, large set of decision trees
has to be created, but that also means slower classification
process.
3. RESULTS
Presented feature extraction approach was implemented in
Python using scientific computing library NumPy. Machine
learning library scikit-learn was used for clustering and clas-
sification. For discrete wavelet transform, open-source soft-
StuCoSReC Proceedings of the 2018 5th Student Computer Science Research Conference 57
Ljubljana, Slovenia, 9 October