Page 193 - Weiss, Jernej, ur./ed. 2025. Glasbena interpretacija: med umetniškim in znanstvenim┊Music Interpretation: Between the Artistic and the Scientific. Koper/Ljubljana: Založba Univerze na Primorskem in Festival Ljubljana. Studia musicologica Labacensia, 8

P. 193

exploring musicological discourses ...
A custom algorithm was used to convert the PDF documents into
plain text, but several common issues were encountered during PDF-to-
text conversion, such as:

1) Encoding challenges: PDFs use various character encodings,
which resulted in incorrect character representation in the ex-
tracted text.
2) Layout complexities: Academic articles often include col-
umns, footnotes, headers, and footers, which disrupt the linear
flow of text when extracted, leading to incorrect automatic text
recognition.
3) Musicology-specific textual and graphic features in the text,
which make automatic text processing difficult.
4) Multilingual and parallel texts in individual issues.

These issues merit further manual cleanup of the articles and the com-
pilation of several individual corpora in different languages; which is some-
thing we intend to do in further research so as to provide a publicly availa-
ble resource for further quantitative analyses.

Corpus Construction and Analysis
After extracting text from the articles into plain text, we uploaded raw tex-
tual data into the SketchEngine platform. SketchEngine is a comprehen-
26
sive corpus management and analysis tool widely utilized in linguistic re-
search. It allows researchers to create custom corpora and provides a suite
27
of analytical functions, including frequency lists, concordances, colloca-
tions, and keyword analyses.
In order to facilitate in-depth analysis of the texts in question, we have
to compile the uploaded texts, which essentially means that we have to as-
sign a value to each and every word in the corpus, and this makes statistical
research possible. This process is two-fold and is called lemmatization and
part-of-speech tagging. The corpus was lemmatized, and parts of speech
were tagged to facilitate more precise searches and analyses.
The following analyses were conducted:

1) Frequency analysis: Generated word and lemma frequency lists to
identify the most commonly used terms across the corpus. This
26 Sketch Engine, https://www.sketchengine.eu/.
27 Kilgarriff, et al., “The Sketch Engine: Ten Years on.”

193

188 189 190 191 192 193 194 195 196 197 198