Page 49 - Fister jr., Iztok, and Andrej Brodnik (eds.). StuCoSReC. Proceedings of the 2017 4th Student Computer Science Research Conference. Koper: University of Primorska Press, 2017
P. 49
tiset representation of objects in information retrieval
systems

Mikita Akulich1 Matjaž Krnc1,2 Iztok Savnik1
89172019@student.upr.si matjaz.krnc@sbg.ac.at iztok.savnik@famnit.upr.si

Riste Škrekovski1,3,4
riste.skrekovski@famnit.upr.si

1 University of Primorska, FAMNIT, Koper, Slovenia
2 University of Salzburg, Department of Efficient Algorithms, Salzburg, Austria

3 Faculty of Information Studies, Novo Mesto, Slovenia
4 University of Ljubljana, FMF, Ljubljana, Slovenia

ABSTRACT In IR most systems use the concept of an inverted index
to achieve full-text indexing of a database. The so-called
In this paper we present multiset-trie – a novel data struc- dictionary of the inverted index can be organized in differ-
ture which operates on objects represented as multisets. The ent ways in order to meet the required types of queries and
multiset-trie is a search-tree-based data structure with prop- specification of data. In our research we will be focused on
erties similar to those of a trie. In particular, we efficiently search trees [2, 4, 5, 7].
implement the standard search tree operations together with
the special set containment operations, i.e. subset and su- The proposed data structure multiset-trie can be used as
perset queries in the context of multisets. These are called an alternative implementation of the dictionary in an in-
submultiset and supermultiset, respectively, and are used for verted index. It is an extension of the set-trie data struc-
implementation of various queries that can be performed ture proposed by Savnik [6]. Set-trie is used for storing
on multisets in a multiset-trie. The corresponding running and fast retrieval of objects represented as sets and provides
times of the developed functions are mathematically and ex- the nearest-neighbor search by implementing methods that
perimentally analyzed. One of the most important queries perform set-containment queries. Multiset-trie extends the
is the search of the nearest neighbor given an input object. abilities of set-trie and provides support for storing and re-
The nearest neighbor search of a multiset-trie makes it a trieving objects that can be represented as both sets and
good alternative for the index data structures that are used multisets.
in information retrieval systems. In particular, our research
is focused on the application of the multiset-trie to full-text 2. MULTISET-TRIE DATA STRUCTURE
search systems.
2.1 Description
Keywords
Let Σ be a set of distinct symbols that define an alphabet
multiset, trie, multiset-trie, information retrieval, inverted and let σ be the cardinality of Σ. Define φ to be a bijective
index, full-text search mapping φ : Σ → I, where I = {1, 2, 3, . . . , σ}. In this way,
we obtain an indexing of elements from the alphabet Σ, so
1. INTRODUCTION we can work directly with integers rather then with specific
symbols from Σ.
Information retrieval (IR) systems are most commonly used
for searching a text-based content such as text documents The multiset-trie M is an n-ary tree based data structure
in a database. Such IR systems are called full-text search with the properties of trie. A node in multiset-trie always
systems. The application of the full-text search techniques has out-degree n. Some of the child nodes may be Null, but
on the database almost always requires an index for a good at most n − 1 of them. All the children of a node, including
performance of the queries. Indexes narrow down the search the Null children, are uniquely labeled from left to right with
using pre-generated meta data constructed from the data in labels cj, where j ∈ {0, 1, . . . , n − 1}.
the database.
The height of a multiset-trie is always σ + 1 if the structure
is not empty. Levels in multiset-trie are enumerated by their
height, i.e. a level Li has height i, where L1 is the root level.
The levels Li and symbols from Σ are related as follows. For
i ∈ {1, 2, . . . , σ}, a level Li represents a symbol s ∈ Σ, such
that φ−1(i) = s. The last level Lσ+1 does not represent any
symbol and is named leaf level (LL for short).

Consider two nodes u, v in a multiset-trie at levels Li, Li+1

StuCoSReC Proceedings of the 2017 4th Student Computer Science Research Conference DOI: https://doi.org/10.26493/978-961-7023-40-4.49-52 49
Ljubljana, Slovenia, 11 October
   44   45   46   47   48   49   50   51   52   53   54