Page 19 - Fister jr., Iztok, and Andrej Brodnik (eds.). StuCoSReC. Proceedings of the 2017 4th Student Computer Science Research Conference. Koper: University of Primorska Press, 2017
P. 19
a mining big data inpatient database using Cuckoo
search
Uroš Mlakar Iztok Fister Jr. Monika Markovic´
Faculty of Electrical Faculty of Electrical Faculty of Medicine, University
Engineering and Computer Engineering and Computer of Maribor
Science, University of Maribor Science, University of Maribor
Smetanova 17, 2000 Maribor Smetanova 17, 2000 Maribor Taborska 6b, 2000 Maribor
uros.mlakar@um.si iztok.fister1@um.si monika.markovic@student.um.si
Iztok Fister
Faculty of Electrical
Engineering and Computer
Science, University of Maribor
Smetanova 17, 2000 Maribor
iztok.fister@um.si
ABSTRACT Such knowledge would be beneficial to hospitals, and also
to insurance compaines, which can make evidence based de-
This paper investigates data mining in a medical dataset by cisions, and can optimize, validate and refine the rules that
using the stochastic population-based nature-inspired Cuckoo govern their business [6]. This important hidden knowledge
search algorithm. Particularly, association rules are mined can be found with the help of data mining, with methods
by applying an objective function composed of support and such as clustering, feature selection, association rule mining,
confidence weighted by two parameters for controlling the and many more.
importance of each measure. The rules are mined in a Na-
tionwide Inpatient Sample dataset, which is a collection of This paper is structured as follows. After the introduction,
discharge records of several hospitals in the USA. Only those data mining methods are briefly discussed in Section 2, then
records, where a patient was diagnosed with Type II dia- the Cuckoo search algorithm and the Nationwide Inpatient
betes mellitus were extracted for association rule mining. Sample (NIS) dataset are presented in Sections 3 and 4.
The results show that the found rules are simple, easy to The preliminary results are presented in form of association
understand and also interesting, as they were verified with rules in Section 5, then the paper is concluded with future
actual clinical studies. The results obtained can be benefi- directions in Section 6.
cial to either doctors or insurance companies.
2. DATA MINING METHODS
Keywords
Data mining is a computing process of discovering patterns
data mining, big data, association rule mining, cuckoo search in large datasets. The goal of data mining is to extract useful
information from a dataset and transform it into an under-
1. INTRODUCTION standable structure, which may be used directly or processed
further by another algorithm. There are several methods for
With the increasing rate of data collected everyday, there which are used for data mining, such as cluster analysis [4],
is a need for automatic mining of useful information hid- dimensionality reduction [8], association rule mining [9], etc.
den within. But this may be a difficult task, since this data Association rule mining has gained a lot of attention for
is either big in volume, has variety (different data sources mining interesting patterns from large databases within the
or multiple data types), or is collected at a very fast pace research community.
(velocity). An example of such data are definitely the dis-
charge records of hospital patients. There is a lot of hidden 2.1 Association rule mining
information within this data, such as interesting connections
between apparently unrelated diseasepresenteds, or discov- Association rule mining (ARM) is a rule-based machine learn-
ering interesting risk factors, that contribute to a particular ing method for discovering interesting relations between at-
disease (although not being directly related to the disease). tributes in large databases. ARM is used for identifying
strong rules using measures of interestingness, where the
most established method is the Apriori algorithm introduced
by Agrawal et al. [1]. ARM can be mathematically expressed
as follows. Let I = {ii, i2, . . . , in} be a set of attributes
called items and T = {t1, t2, . . . , tm} the a set of transac-
tions (i.e. database). Each rule is defined as an implication
X → Y , where X, Y ⊆ I. X and Y are composed of two
different set of items, which are also known as item-sets; X
StuCoSReC Proceedings of the 2017 4th Student Computer Science Research Conference DOI: https://doi.org/10.26493/978-961-7023-40-4.19-22 19
Ljubljana, Slovenia, 11 October
search
Uroš Mlakar Iztok Fister Jr. Monika Markovic´
Faculty of Electrical Faculty of Electrical Faculty of Medicine, University
Engineering and Computer Engineering and Computer of Maribor
Science, University of Maribor Science, University of Maribor
Smetanova 17, 2000 Maribor Smetanova 17, 2000 Maribor Taborska 6b, 2000 Maribor
uros.mlakar@um.si iztok.fister1@um.si monika.markovic@student.um.si
Iztok Fister
Faculty of Electrical
Engineering and Computer
Science, University of Maribor
Smetanova 17, 2000 Maribor
iztok.fister@um.si
ABSTRACT Such knowledge would be beneficial to hospitals, and also
to insurance compaines, which can make evidence based de-
This paper investigates data mining in a medical dataset by cisions, and can optimize, validate and refine the rules that
using the stochastic population-based nature-inspired Cuckoo govern their business [6]. This important hidden knowledge
search algorithm. Particularly, association rules are mined can be found with the help of data mining, with methods
by applying an objective function composed of support and such as clustering, feature selection, association rule mining,
confidence weighted by two parameters for controlling the and many more.
importance of each measure. The rules are mined in a Na-
tionwide Inpatient Sample dataset, which is a collection of This paper is structured as follows. After the introduction,
discharge records of several hospitals in the USA. Only those data mining methods are briefly discussed in Section 2, then
records, where a patient was diagnosed with Type II dia- the Cuckoo search algorithm and the Nationwide Inpatient
betes mellitus were extracted for association rule mining. Sample (NIS) dataset are presented in Sections 3 and 4.
The results show that the found rules are simple, easy to The preliminary results are presented in form of association
understand and also interesting, as they were verified with rules in Section 5, then the paper is concluded with future
actual clinical studies. The results obtained can be benefi- directions in Section 6.
cial to either doctors or insurance companies.
2. DATA MINING METHODS
Keywords
Data mining is a computing process of discovering patterns
data mining, big data, association rule mining, cuckoo search in large datasets. The goal of data mining is to extract useful
information from a dataset and transform it into an under-
1. INTRODUCTION standable structure, which may be used directly or processed
further by another algorithm. There are several methods for
With the increasing rate of data collected everyday, there which are used for data mining, such as cluster analysis [4],
is a need for automatic mining of useful information hid- dimensionality reduction [8], association rule mining [9], etc.
den within. But this may be a difficult task, since this data Association rule mining has gained a lot of attention for
is either big in volume, has variety (different data sources mining interesting patterns from large databases within the
or multiple data types), or is collected at a very fast pace research community.
(velocity). An example of such data are definitely the dis-
charge records of hospital patients. There is a lot of hidden 2.1 Association rule mining
information within this data, such as interesting connections
between apparently unrelated diseasepresenteds, or discov- Association rule mining (ARM) is a rule-based machine learn-
ering interesting risk factors, that contribute to a particular ing method for discovering interesting relations between at-
disease (although not being directly related to the disease). tributes in large databases. ARM is used for identifying
strong rules using measures of interestingness, where the
most established method is the Apriori algorithm introduced
by Agrawal et al. [1]. ARM can be mathematically expressed
as follows. Let I = {ii, i2, . . . , in} be a set of attributes
called items and T = {t1, t2, . . . , tm} the a set of transac-
tions (i.e. database). Each rule is defined as an implication
X → Y , where X, Y ⊆ I. X and Y are composed of two
different set of items, which are also known as item-sets; X
StuCoSReC Proceedings of the 2017 4th Student Computer Science Research Conference DOI: https://doi.org/10.26493/978-961-7023-40-4.19-22 19
Ljubljana, Slovenia, 11 October