Page 53 - Fister jr., Iztok, and Andrej Brodnik (eds.). StuCoSReC. Proceedings of the 2018 5th Student Computer Science Research Conference. Koper: University of Primorska Press, 2018
P. 53
e series classification with Bag-Of-Words approach

Domen Kavran

Faculty of Electrical Engineering and Computer Science
Koroska cesta 46
Maribor, Slovenia

domen.kavran@um.si

ABSTRACT tained through statistical analysis or by calculating features
specifically selected for a given dataset, though expert knowl-
The amount of data generated every day increases each year edge is needed. Presented alternative approach, derived
and the pace is accelerating with development of Internet from Bag-Of-Words [17], needs no prior knowledge about
of Things (IoT). Gathered data solely doesn’t contain much provided data. Main part is the definition of dictionary
information, but with machine learning additional informa- words – segments, which are extracted from training time se-
tions and hidden patterns can be obtained to contribute to ries data and later clustered. Segments of individual time se-
time series analysis. ries are then compared with dictionary words and histogram
of word occurances is formed. Histogram is a feature vector
General purpose and problem specific time series feature ex- representing individual time series.
traction methods have been developed over the years. New
feature extraction approach, derived from Bag-Of-Words, is Bag-Of-Words approach was originally intended for docu-
presented in this paper. Main part of the approach is obtain- ment and image classification. Same concepts can be applied
ing a dictionary of time series segments – the so-called words. to time series with great results. Presented approach uses
K-Means clustering is used to form a dictionary containing elements of well-known image patch extraction algorithm to
K words, which is then used to define a feature vector of obtain overlapping windows or segments, containing parts
an individual time series as a histogram of word occurances of time series data. To speed up the training phase and
inside it. Described approach can be used for feature extrac- classification process, discrete wavelet transform was used.
tion of time series without prior knowledge of data’s nature. The transform is known for its role in data compression and
Moreover, the approach is robust and produces good clas- image processing [3]. That provided a suitable way of re-
sification results. Highest accuracy of 99.96% was achieved ducing each segment’s feature vector dimensionality. Mini
using datasets, presented in Results. Batch K-means clustering was used to speed up the dictio-
nary creation process.
Keywords
In second section of the paper, procedure of feature extrac-
time series, classification, machine learning, bag-of-words tion is described. Classification was done by 1-nearest neigh-
bor algorithm, using Chi-squared distance measure and then
1. INTRODUCTION compared with the results of support vector machines and
random decision forests in Results.
Technological progress made in industries, like automotive,
healthcare and electronics, over the past years resulted in in- 2. TIME SERIES CLASSIFICATION
creased amount of produced data. Large complex datasets,
often referred to as Big Data, must be analysed quickly Classification pipeline is shown on Figure 1. Time series
and efficiently to reveal additional informations and hidden must be described with features, which appropriately present
patterns. Data analysis aids companies with their prod- occured events in time series and are independent of its
uct development, research and customer services [5]. Differ- length. Beforehand, input time series dataset has to be split
ent technologies and programming libraries have been devel- on training and test datasets. Words inside time series are
oped to help engineers in creating pipelines for manipulating segments feature vectors, with features being approxima-
data and extracting features. Unsupervised machine learn- tion coefficients of discrete wavelet transform. Dictionary
ing algorithms are often used for analysing data and finding words are the result of clustering segments feature vectors,
correlation between variables. Various advanced supervised extracted from training time series set. Feature vector of
techniques have been developed to solve regression and clas- each time series is the created histogram of dictionary words
sification tasks. Time series data have an important role in occurances.
today’s largest industries and this paper presents a general
purpose time series feature extraction approach. Original approach has a name Bag-Of-Words for classifying
documents and Bag-Of-Features for image classification [6].
Broader interest in time series classification began at the Advantage of the approach is its robustness, simplicity and
start of the century [9], which led to development of many good results on real-world data.
different feature extraction methods [16]. For machine learn-
ing algorithms to be successful at classifying time series,
appropriate features must be selected. Features can be ob-

StuCoSReC Proceedings of the 2018 5th Student Computer Science Research Conference DOI: https://doi.org/10.26493/978-961-7055-26-9.55-59 55
Ljubljana, Slovenia, 9 October
   48   49   50   51   52   53   54   55   56   57   58