
<style>
  .howtosum {
      color: white;
      background-color: #00919B;; /* Blue */
      padding: 8px;
      font-weight: bold;
      border-radius: 5px;
      cursor: pointer;
  }
  .howtodet {
      border: 1px solid #007bff;
      border-radius: 5px;
      padding: 5px;
      background-color: #fffade; /* Light Blue */
  }
</style>

# Swiss-AL Tools


## Topics 

**What are Topics?**

Topics are clusters of words that frequently appear together in a set of documents and represent a common theme or subject.

In a collection of texts about different subjects, a **topic model** analyzes the words in these articles and tries to find groups of words that commonly appear together. Each word group (or topic) might correspond to a theme like "storm", "flood" or "climate_change" without anyone manually labeling them.

We use Latent Dirichlet Allocation (**LDA**) to detect topics in a selected corpus [Blei at al. 2003](./bibliography.md). LDA is a generative probabilistic model that represents documents as mixtures of topics, where each topic is a distribution over words. It is widely used for topic modeling due to its ability to handle large corpora efficiently and produce interpretable results.

<details class="howtodet">
<summary class="howtosum"><b>How does LDA work?</b></summary>

Let’s say we have a corpus texts about climate that are unfamiliar to us, and we want to know which topics they address. We don’t have time to read all of them, but we want to group them into meaningful categories based on their major themes.

**Step 1: Random topic assignment**

At the start, LDA assigns every word in every document to a random topic.
This is completely random, so initially, the topics don’t make sense.

*Example*\
*Random topic assignments*

D1 "The glacier is melting quickly" glacier → T1, melting → T2, quickly → T3
D2 "Rising temperatures and stronger storms" Rising → T2, storms → T1


**Step 2: Iterative refinement (Gibbs Sampling)**

LDA goes through many iterations where it reassigns words to topics based on probabilities. Each word looks at the topics of surrounding words and adjusts its topic accordingly.

*Example*\
*After some iterations*

Document	Words	Adjusted Topic Assignments\
D1 "The glacier is melting quickly" glacier → T1, melting → T1, quickly → T1
D2 "Rising temperatures and stronger storms" Rising → T2, storms → T2

Now, words that often appear together are grouped into the same topic.
In this case:

Topic 1 (T1) → glacier, melting, quickly (climate effects)
Topic 2 (T2) → Rising, storms (extreme weather)

**Step 3: Probability adjustment**

LDA continuously refines topics using two main probabilities:

How often does this word appear in a topic?
- If "glacier" appears mostly in Topic 1, it will likely stay there.
How often does this topic appear in the document?
- If a document mostly has Topic 1 words, new words in it will likely be assigned to Topic 1.

*Example*\
*Topic distribution in documents*

Document Topic 1 (Climate Effects) Topic 2 (Extreme Weather)
D1 ("The glacier is melting quickly") 80% 20%
D2 ("Rising temperatures and stronger storms")


**Step 4: Convergence (stable topics)**

After many iterations, words settle into stable topics. Now, every document has a mix of topics, and every topic has important words.

*Example*
*Extracted Topics*

 **Topic 1** (Climate Effects)\
**glacier, melting, drought, sea-level, permafrost, erosion**

**Topic 2** (Locations & Places)\
**storm, hurricane, heatwave, flood, temperature, rainfall**


</details>

-------

**Topics in Swiss-AL Workbench**

The Swiss-AL Workbench provides pre-calculated topics for each corpus created with the Python package **[tomotopy](https://bab2min.github.io/tomotopy/v0.13.0/en/)**. Additionally, users can generate topic models for their custom subcorpora. The following options and requirements are available:

-  **Number of topics**: When a user creats a subcorpus, automatically a model with 50 number of topics will be calculated. Afterwards, users can select between 5 and 100 topics for their analysis. 
-  **Model storage**: Topic models generated for subcorpora can be saved to the user's workspace for future reference.

💡 Note that:<br>
There is no perfect topic number — it depends on your corpus size and how diverse your content is.
A good starting point is to try out several values and evaluate the coherence and interpretability of the results.
If you are working with a large corpus from different sources, we recommend trying a higher number of topics.
On the other hand, if your corpus is focused on a specific theme (e.g., *measles vaccination*), a smaller number of topics may be more useful. Experiment until you find a model that fits your needs best.


All topic models are calculated using up to **four consecutive words (lemmas)** as the basic unit of analysis. We use [TermWeight.IDF](https://bab2min.github.io/tomotopy/v0.13.0/en/#tomotopy.TermWeight) for weighting terms to be considered as topic keywords. By using this method, terms occurring at almost every document have a low weighting and terms occurring in a few documents have high weighting. Term weighting is based on [Wilson et al. (2010)](bibliography.md). For topics visualisations in 2D and 3D format, we use [tsne](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html).


There are no restrictions regarding corpus size for topic modeling. However, we encourage you to choose subcorpora containing at least **1,000 texts** to ensure more reliable topic modeling results.
<!--maximum of models that the user can save?-->
<!--removed from the list of topics or ignored prior to calculation-->

**Topic Modeling View**

- **Topic List**: Contains topic numbers (Topic Nr.), a list of 30 keywords per topic (Words) and Size(%). 
- **Topic Network**: Visualizes the topics in a 2D space. Similar topics are placed closed to each other. The bubble size depends on the number of documents associated with the topic. By hovering on individual bubbles, 5 most prominint keywords pro topic are shown. A click on a bubble highlights the corresponding topic row in the Topic List.

💡 Note that:<br>
- Topic numbers (e.g., 1, 2, 3) are arbitrary and do not inherently carry any meaning.
- It is common for one or two topics to act as "background topics" capturing words that do not strongly belong to any specific theme, often consisting of high-frequency, general-purpose terms.


![Topics in Swiss-AL Workbench1](/_static/topics1.png)
*Overview of topics in the corpus "German Demo Corpus"*


**One-Topic View**
- **Top Words**: Shows a list of keywords for the selected topic.
- **Topic Over Time**: This feature visualizes the frequency of the selected topic over time by showing the proportion of all tokens in each time period that belong to the topic. Users can switch between monthly and yearly view.
- **Top 20 Documents**: Displays the texts most strongly associated with this topic, based on the **coherence score**, which measures the semantic similarity among a topic’s top words. 
- **Top 5 Similar Topics**: Shows a list of the five most similar topics, calculated based on topic-term distributions. The similarity measure used is **cosine similarity**.
- **Top Sources for This Topic**: Highlights the 10 sources in which this topic appears most prominently, ranked by the proportion of document's lemmas that occur in the topic.
- **Topic Proportions by Category**: Shows, for each category value (e.g. media type, social system, etc.), the proportion of its total lemmas that are assigned to the topic.

<!--Warum finde ich das gesuchte Wort nicht in Topics? es kommt schon vor-->

<!-- in comparision to all sources in the corpus in which it is contained or in comparison to all sources -->
<!-- the numbers in csv do not correspond to the proportions in the graph, the users do not have the number for the proportion -->

![Topics in Swiss-AL Workbench2](/_static/topics2.png)
*One-topic view: "bag, gesundheit, patient, medizinisch, spital,... "*


<!--Abbildungen 2 mal per Topics einfügen, anklickbare bubbles (kann man das in git tool machen?)
-> lieber einfach gelb markieren und schreiben, was die einzelnen Teile machen
-->

## Keywords 

**What are *keywords*?**

The tool *Keywords* identifies terms that occur considerably more frequently in one corpus (study corpus) compared to another corpus (reference corpus). 

<!--add examples with fairy tales-->
In Swiss-AL Workbench, the keywords are determined using **[Log Ratio (LR)](https://cass.lancs.ac.uk/log-ratio-an-informal-introduction/)**, which is the binary logarithm of the ratio of relative frequencies. In the Log Ratio score, each additional point represents a doubling of the difference in frequency between the two corpora.

Examples:

- log<sub>2</sub> ratio = 0
  -  The term has roughly the same relative frequency in the study corpus and the reference corpus (2<sup>0</sup> = 1).
- log<sub>2</sub> ratio = 1
  - The term is roughly **twice** as frequent in the study corpus as in the reference corpus (2<sup>1</sup> = 2).
- log<sub>2</sub> ratio = 2
  -  The term is roughly **four** times more frequent in the study corpus than in the reference corpus (2<sup>2</sup> = 4).
- log<sub>2</sub> ratio = 3, 4, ...
  - The term is eight times (2<sup>3</sup> = 8), sixteen times (2<sup>4</sup> = 16), etc., more frequent in the study corpus than in the reference corpus.

When a user creates a subcorpus, keywords are calculated relative to the parent corpus, which serves as the reference corpus (with the overlapping part between the selected subcorpus and the parent corpus excluded).

However, users can re-calculate keywords using another reference corpus of their choice (advanced mode).

Note that:<br>
💡 Only keywords that are more common in the study corpus are shown.<br>
<!--💡 The minimum frequency threshold in the study corpus is **5**. For the reference corpus, the minimum is 0; however, to avoid division by zero, any frequency of 0 is replaced with 0.5.<br>-->
<!--💡 If a lemma in the study corpus is not found in the reference corpus, 1e-10 is given as a frequency (*dummy value*). -->
💡 Keywords are filtered using a **Log-Likelihood Ratio (LLR)** test, with a p-value threshold of 0.001.


## Words in Context

The tool *Words in Context* allows you to view text segments or documents containing your search term. This tool has two modes:

**KWIC View**: Per default users can see 5 tokens left and right of your search term. You can add more context — up to two sentences on both sides. This view is traditionally used in corpus linguistics (where it is usually called KWIC for "Key Word In Context") and it is useful if you are interested in the contexs in which your search term occures throughout the corpus. You can sort the table on ascending or descending the columns. 

![KWIC View](/_static/kwic.png)
*KWIC View for the search term "Corona"*


**Document View**: This view has been developed for facilitating the search for relevant documents. In comparison to the KWIC View, which displays a row for each individual occurrence of the searched term in the corpus, the Document View shows one row for each document that contains the searched term(s). Additionaly, it shows:
  - **Distribution of your search them in the document**:  A blue square on the horizontal bar represents the approximate position of the term in text. The color intensity represents the frequency of occurrence in the given segment. You can sort this table and explore, for instance, whether your search terms occur more frequently at the beginning or end of a given document.
  - **Keywords for each document**, calculated with TF-IDF (Term Frequency–Inverse Document Frequency). TF-IDF measures how important a word is in a document relative to a collection (corpus) of documents. It increases with the number of times a word appears in a document (TF), but is offset by how common the word is across all documents (IDF).
  - Text metadata, including **titles**, which will also likely help you find the relevant documents.

![KWIC View](/_static/kwic_text.png)
*Document View for the search term "Corona"*

Note that:<br>
💡 *Document keywords* in Words in Context differ from *corpus keywords* created with [Keywords](./tools.md).<br>
<!--genaue methode mit Klaus besprechen-->

## Word Distribution 

The tool *Word Distribution* is divided in four modules:

- **Frequency Table**: this functionality shows the most frequent forms for your search term.  If you are interested in the frequency of the main form (lemma) of your search, select "Group by lemma". 
- **Word Distribution over Time**: this functionality shows the distribution of your search term through time in two parallel diagrams: year view and month view. You can adjust the time span by interacting with both graphs.
- **Grouped by Source**: this functionality shows the distribution of your search term throughout sources in your corpus.

- **Grouped by Category**: this functionality shows the distribution of your search term throughout categories in your corpus. For instance, for media texts: **category 1** is *publisher* or *media type*. The latter is further divided in **category 2**: *daily/online newspaper*, *weekly/magazine*, *special interest magazine* and *radio/tv*. 

![KWIC View](/_static/distribution.png)
*Word Distribution for the search term "Corona"*

💡  See examples for all categories in [Corpora](./corpora.md).   <br>
💡  The relative frequencies shown in the diagrams indicate the total number of hits found in a specific category divided by the total number of hits in the corpus. 

<!--this is actually missing now, only the share per lemma is given-->

<!--## Cooccurrences (Co-occurring words)

genaue methode mit Klaus besprechen-->

## Semantic Space

With this tool, users can search for **semantically similar words** by looking at *nearest neighbors* of a word in a semantic space. A *semantic space* is a mathematical representation of word meanings, where words (or phrases) are mapped to points in a multi-dimensional space based on their relationships with other words. In this space, words with similar meanings or usage tend to be closer together. 

The nearest neighbors search produces:

- Diagrams visualizing the vector space in both 2D and 3D (up to 5,000 most frequent words in the corpus);
- A table displaying the **cosine similarity** values between the searched word(s) and their closest neighbors in the vector space.


<!--genaue methode mit Sooyeon besprechen-->


![Semantic_Space3](/_static/word_embeddings_3D.png)
*Semantic space in corpus "German Demo Corpus"*

![Semantic_Space2](/_static/word_embeddings_radikal_table.png)
*Semantic neighbors for "radikal" 'radical' in the corpus "German Demo Corpus"*



<!--
💡 The search for similar words is also embedded in the Search view: once you have entered your search term, you can use the function **Show similar words**, which displays 10 nearest neighbors of your search term in the semantic space.

-->

A paper showing the potential of word embeddings for discourse analysis is [Bubenhofer et al. 2019](./bibliography.md). If you would like to read more on the general principal behind word embeddings, we recommend [Lenci 2018](./bibliography.md).

<!-- {cite}`Lenci2018`-->


<!--## (Search Corpus)-->

<!--### Frequency (-> move this to Text snippets)-->

<!--genaue methode mit Klaus besprechen-->

<!--https://www.3blue1brown.com/lessons/gpt-->
