# Swiss-AL Pipeline ```{raw} html

```

The Swiss-AL Pipeline is a step-by-step process for collecting, processing, and organizing text data. The pipeline starts by gathering texts from different sources like media databases, websites, and digital archives. Then, the data is carefully checked, filtered, and labeled with useful linguistic information. Finally, it is stored in special databases, making it easy to access and analyze. This section explains each stage of of the pipeline in more detail. ## Data Selection and Collection The data for **Swiss-AL corpora** come from various sources: - **Swiss Media Database**: retrieved through [Swissdox@LiRI](https://www.liri.uzh.ch/en/services/swissdox.html). - **Crawled Data**: from a carefully selected set of websites from Swiss actors of public communication, such as government websites, political parties, etc. We use ([Selenium](https://www.selenium.dev/)) for browser automationwhenever it is needed (for websites that load data dynamically). - **Data Dumps**: Includes sources in other formats, like database dumps of specific projects. All data, including metadata, are converted (if necessary) into **JSON format** and stored in [Elasticsearch](https://www.elastic.co/elasticsearch). Elasticsearch is a distributed, open-source search and analytics engine designed for fast and scalable full-text search. As all the other components, it is hosted on ZHAW servers.

## Data Filtering ### Filter Documents in **Elasticsearch** are filtered based on the following criteria: - **Word Count**: Documents less than 100 words or more than 10'000 words are excluded from further processing. As a result, very short news snippets and long-form journalism may not be represented. - **Language**: Document languages are identified using the [Compact Language Detector](https://github.com/CLD2Owners/cld2). This detector determines the primary language and annotates segments in other languages as foreign. Foreign language segments (e.g. quotes) are not removed from the text, but they are annotated as such (e.g. "FM" in the German tagset). - **Date Filtering**: Documents are filtered by specific dates to align with the defined timespan of the corpora. Texts for which no date can be detected are filtered out. For instance, in high-reach journalistic corpora, we consider texts published from 2010 onward. ### Deduplication **Deduplication** removes duplicate or highly similar texts based on a predefined **similarity threshold**. Texts that exceed this threshold are filtered out to ensure that only unique content remains in the corpus. This process improves the accuracy of frequency analysis and helps avoid biases. As a threshold, we use a Jaccard overlap of 0.85 within the [SpotSigs](https://dl.acm.org/doi/10.1145/1390334.1390431) algorithm. However, it means that identical or almost identical texts (i.e., reports from media agencies cited in various publications) will not reflect their full numerical presence in the media landscape.

## Linguistic Annotation Linguistic annotation refers to the process of enriching text data with linguistic information. It involves adding structured labels or metadata that describe various aspects of language, such as word boundaries, part-of-speech categories, or named entities like people and locations. For all Swiss-AL corpora, linguistic annotation is done automatically by Natural Language Processing (NLP) tools. The annotation makes texts easier to query and analyze. - If you want to know the accuracy of a particular tool, please consult the publications related to that tool. - We do not perform error analysis ourselves, so please contact us if you notice systematic annotation errors (e.g., if verbs are consistently annotated as nouns, or if certain annotations are missing). ### Tokenization The first step involves breaking down the text into smaller units called **tokens**. Tokens are the smallest meaningful units in a text that a computer can process. In natural language processing (NLP), tokenization is the process of breaking a text into these units, which can be words, punctuation marks, numbers and other character sequences. - For example, in the sentence: "The cat sat on the mat." The tokens might be: ["The", "cat", "sat", "on", "the", "mat", "."] Although each tokenizer has its own language-specific rules, tokens can also be roughly defined as sequences of characters separated by whitespace or non-letter characters. For tokenization, we use [TreeTagger](https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/). ### Part-of-Speech Tagging This step identifies the **word categories** of the tokens, such as nouns, verbs, or determiners. Even punctuation is tagged (i.e., a period "." could represent a sentence terminator or be part of an abbreviation). The tagger uses the context of surrounding words for accurate syntactic identification. For **German**, we use the [STTS](https://www.ims.uni-stuttgart.de/forschung/ressourcen/lexika/germantagsets/) tagset. For **[French](https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/french-tagset.html)** and **[Italian](https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/italian-tagset.txt)**, we use the tagsets created by Achim Stein. For assigning Part-of-Speech tags automatically (i.e. *tagging*), we use [TreeTagger](https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/). Next to **pos** tags, German texts have additional, **rfpos** tags, which store morphosyntactic information (see [Search Modes](./search-modes.md)) produced by [RFTagger](https://www.cis.uni-muenchen.de/~schmid/tools/RFTagger/). ### Lemmatization Following PoS tagging, each token is assigned it's lemma (basic form). For example: -**German** *denkst* '(you) think' and *dachte* 'tougth' are lemmatized as *denken* ('to think'). -**French** *mangeais* (‘was eating’) and *mangé* (‘eaten’) are lemmatized as **manger** (‘to eat’). -**Italian** *pensavi* (‘you were thinking’) and *pensato* (‘thought’) are lemmatized as **pensare** (‘to think’). The PoS tagging step and lemmatization help distinguish words that look the same but have different meanings depending on their role in a sentence. **German** - *Spiel* (‘game’ – noun) and *spiel* (‘play’ – verb) are both written the same way but have different meanings. - If used in *das Spiel* (‘the game’), it remains a noun. - If used in *Spiel mit mir* (‘Play with me’), it is a verb. - PoS tagging helps determine the correct lemma: **Spiel** (noun) vs. **spielen** (verb). **French** - *marche* can mean either: - *La marche* (‘the walk’ – noun) → lemma: **marche** - *Il marche* (‘he walks’ – verb) → lemma: **marcher** **Italian** - *corso* can mean: - *Un corso* (‘a course’ – noun) → lemma: **corso** - *Ho corso* (‘I ran’ – verb) → lemma: **correre** By combining **lemmatization** and **PoS tagging**, linguistic tools can correctly interpret words based on their context. For lemmatization, we use [TreeTagger](https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/). ### Sentence Boundary Detection Boundaries between sentences are detected using punctuation marks like periods (.), exclamation marks (!), and question marks (?). This ensures correct sentence segmentation in further analysis. ### Syntactic Parsing In this step, the grammatical structure of sentences is analyzed to identify syntactic roles such as **subjects**, **objects**, and **modifiers**. German and French texts are parsed with [Mate](https://code.google.com/archive/p/mate-tools/) Parser. For Italian we use the [Stanford CoreNLP Parser](https://stanfordnlp.github.io/CoreNLP/parser-standalone.html). ### Temporal Tagging This step identifies specific points in time, either through explicit dates (i.e., *2024*) or relative expressions (i.e., *today*, *last week*), which are linked to the publication date of the text. For temporal tagging we use [HeidelTime](https://github.com/HeidelTime/heideltime). ### Entity Recognition Entities identifiable by proper names, i.e., **people, locations, organizations**, are tagged and classified into categories. This process uses the [Stanford Named Entity Recognition](https://nlp.stanford.edu/software/CRF-NER.html) software. ### Toponym Disambiguation Geographic names (**toponyms**) undergo ambiguity resolution. This process distinguishes between locations with identical names,using contextual information to determine the correct location. For Toponym Disambiguation we use [CLAVIN](https://github.com/bigconnect/clavin). → Go to [Advanced Search](./search-modes.md) to see which annotations you can query directly in the Workbench and how.

## Final Data Storage Processed and annotated corpora are stored in **[IMS Open Corpus Workbench (CWB)](https://cwb.sourceforge.io/)**, which contains open-source tools for managing and querying large text corpora with linguistic annotations. Storing data in the CWB format enables fast access to data that are retrieved for Workbench tools [Words in Context](./tools.md) and [Distribution of Words](./tools.md). The CWB format is not suitable for certain Swiss-AL tools, such as Topics and Keywords. For these modules, we store their results in the **SQL Database**, from which they can be more easily retrieved. The SQL Database is used by other tools as well, in addition to CWB. It is also used for storing corpus metainformation, such as corpus types (media, organisational, etc.), corpus sources, and other information about the texts.

## Swiss-AL Workbench Tools Depending on users query, Swiss-AL Workbench tools (see [Tools](./tools.md)) retrieve information from either CWB and/or from the SQL Database, process the data and visualise the results on the Swiss-AL Workbench. For example, text snippets that can be accessed in *Words in Context* are retrieved from CWB, where full texts are stored. *Topics*, on the other hand, are retrieved from the SQL database, where word lists are stored.