Page 20 - Fister jr., Iztok, and Andrej Brodnik (eds.). StuCoSReC. Proceedings of the 2017 4th Student Computer Science Research Conference. Koper: University of Primorska Press, 2017
P. 20
also defined as the antecedent, while Y is called the con- Table 1: Data elements considered from the NIS dataset.
sequent. To be able to select interesting rules from the set
of all possible rules, various measures of interestingness are Element Name Element description
utilised and also constrained. The most used constraints are
the minimum support and confidence, which are defined as Age Age of patient at admission (years)
follows: Atype Admission type
Died
Support(X → Y) = |t ∈ T;X ⊆ t∧Y ⊆ t| (1) Female Died during hospitalization
|T | Los Indicator of sex
DX1 Length of stay
Conf idence = Support(X ∪ Y ) (2) DX{2 − 15}
S upport(X ) PR1 Principal diagnosis
PR{2 − 15} Diagnoses{2 − 15}
Principal Procedure
Procedures{2 − 15}

Support is defined as a proportion of transactions contain- 3.1 Association rule mining using CS algorithm
ing X and Y , and the total number of transactions, while
confidence is a proportion of transactions which contain X, Since the CS is used for ARM in this paper, the solution rep-
and also contain Y . resentation has to be adapted accordingly. There are two
well established encodings available for representing rules
Although being able to find interesting rules on smaller datasets, for evolutionary algorithms (EA) and SI-based algorithms.
it faces computational problems when confronted with big- The first is the Michigan encoding [5], where each solution
ger datasets. To overcome this problem the research has represents a separate association rule. In the second, the
gone in the direction of stochastic population-based nature- Pittsburgh encoding [5], each solution represents a set of as-
inspired algorithms, that treat the ARM as an optimization sociation rules. for the purpose of this study the Michigan
problem. encoding was used. Additionally a fitness evaluation func-
tion needs to be defined in order to find the most promising
3. CUCKOO SEARCH rules:

Cuckoo search (CS) is a stochastic population-based nature- f (xi(g)) = βSupport(X → Y ) + γConf idence(X → Y )
inspired optimization algorithm proposed by Yang and Deb β+γ
in 2009 [11]. It is classified as a Swarm Intelligence (SI)
algorithm, since its mechanisms are inspired by the natural (6)
behaviour of some cuckoo species in nature. To be able to
capture the behaviour of cuckoos and adapt it to be suitable The fitness function f (x) is defined as a weighted sum of
for using as a computer optimization algorithm, the authors
idealised three rules: support and confidence. The weights β and γ control the

importance of both said measures. The user can set the

values of these weights according to the importance of each

measure in the domain of association rule mining. For the

purpose of this study, the values of β = γ = 1.

• A cuckoo lays only one egg, then dumps it into a ran- 4. NATIONWIDE INPATIENT SAMPLE
domly chosen nest, DATASET

• Nests that contain high-quality eggs, are carried over The Nationwide Inpatient Sample (NIS) dataset holds the
to the next generation, records of hospital inpatient discharges, that date back to
1988, and is used for identifying, tracking and analysing
• Any cuckoo egg may be discovered by the host bird trends in health care access, quality, and outcomes. It is
with probability pa ∈ [0, 1]. If an egg is discovered, a publicly available dataset, without any patient identifiers.
the host bird may abandon the nest, and build a new It is worth noticing that it consists merely of US hospital
one at a new location. discharges. It holds about 64 million records, with 126 clini-
cal and non-clinical data elements. Only the elements listed
Each solution in population of the CS algorithm corresponds in Table 1 were used in this study.
to a cuckoo nest, which represents the position of the egg
within the search space, and can be mathematically ex- The DX1 and DX{2 − 15} are the principal diagnosis and
pressed as follows: other diagnoses, respectively. The diagnoses are represented
as codes by following the International Classification of Dis-
xi(g) = {x(i,gj)}, fori = 1, . . . , Np andj = 1, . . . , D, (3) eases, Ninth Revision, Clinical Modification (ICD-9-CM).
Since the NIS dataset holds a lot of records, it is hard to
where NP is the population size, and D the dimension of the find association rules by considering the whole dataset. The
optimization problem. In the CS algorithm, new solutions goal of this research is to uncover other risk factors for pa-
are created by exploitation of the current solutions as: tients, who suffer from one particular disease. For this rea-
son, we chose a disease with code ’250.30’, which is Type II
xi(g+1) = x(ig) + αL(s, λ), (4) diabetes mellitus (TIIDM), which is a heterogeneous group
of disorders characterized by a variable degree of insulin re-
where sistance, impaired insulin secretion, and increased glucose
production. There are many causes of which doctors and
L(s, λ) = λΓ(λ) sin( πλ ) 1 (5) patients should be aware of and maybe even more of those
2 s(1+λ) .

π

The term L(s, λ) determines the characteristic scale and α >
0 is the scaling factor of the step size s.

StuCoSReC Proceedings of the 2017 4th Student Computer Science Research Conference 20
Ljubljana, Slovenia, 11 October
   15   16   17   18   19   20   21   22   23   24   25