Page 193 - Weiss, Jernej, ur./ed. 2025. Glasbena interpretacija: med umetniškim in znanstvenim┊Music Interpretation: Between the Artistic and the Scientific. Koper/Ljubljana: Založba Univerze na Primorskem in Festival Ljubljana. Studia musicologica Labacensia, 8
P. 193

exploring musicological discourses ...
                 A  custom  algorithm  was  used  to  convert  the  PDF  documents  into
            plain text, but several common issues were encountered during PDF-to-
            text conversion, such as:

            1)   Encoding challenges: PDFs use various character encodings,
                 which resulted in incorrect character representation in the ex-
                 tracted text.
            2)  Layout complexities: Academic articles often include col-
                 umns, footnotes, headers, and footers, which disrupt the linear
                 flow of text when extracted, leading to incorrect automatic text
                 recognition.
            3)   Musicology-specific textual and graphic features in the text,
                 which make automatic text processing difficult.
            4)  Multilingual and parallel texts in individual issues.

                 These issues merit further manual cleanup of the articles and the com-
            pilation of several individual corpora in different languages; which is some-
            thing we intend to do in further research so as to provide a publicly availa-
            ble resource for further quantitative analyses.

                 Corpus Construction and Analysis
            After extracting text from the articles into plain text, we uploaded raw tex-
            tual data into the SketchEngine platform. SketchEngine  is a comprehen-
                                                                26
            sive corpus management and analysis tool widely utilized in linguistic re-
            search.  It allows researchers to create custom corpora and provides a suite
                   27
            of analytical functions, including frequency lists, concordances, colloca-
            tions, and keyword analyses.
                 In order to facilitate in-depth analysis of the texts in question, we have
            to compile the uploaded texts, which essentially means that we have to as-
            sign a value to each and every word in the corpus, and this makes statistical
            research possible. This process is two-fold and is called lemmatization and
            part-of-speech tagging. The corpus was lemmatized, and parts of speech
            were tagged to facilitate more precise searches and analyses.
                 The following analyses were conducted:

            1)   Frequency analysis: Generated word and lemma frequency lists to
                 identify the most commonly used terms across the corpus. This
            26   Sketch Engine, https://www.sketchengine.eu/.
            27   Kilgarriff, et al., “The Sketch Engine: Ten Years on.”


                                                                              193
   188   189   190   191   192   193   194   195   196   197   198