# Glossary

## Corpus Annotations

- **Lemma**: The base form of a word under which it is listed in dictionaries (e.g., "go" for "goes," "went").
- **Token**: A single unit of a text, such as a word, number, or punctuation mark.
- **Part-of-Speech (PoS)**: The grammatical category of a word, such as noun, verb, or adjective.
- **Tag**: An annotated label for a token that indicates its word class or grammatical properties (e.g., "verb" for "go").

## Corpus Processing & Technology

- **Corpus**: A collection of texts used for linguistic analysis.
- **Workbench**: A platform or software for analyzing and visualizing corpus data.
- **Pipeline**: A sequence of processing steps for extracting, filtering, and annotating text data.
- **Algorithm**: A systematic sequence of operations used to solve a problem or process data.
- **CQP (Corpus Query Processor)**: A query language for searching linguistic corpora using complex patterns and annotations.

## Text Analysis

- **Semantic Space / Vector Space**: A mathematical representation of word meanings in a multidimensional space based on their usage in texts.
- **Word Vector**: A numerical representation of a word in a semantic space that reflects its relationships with other words in a corpus.
- **Nearest Neighbor**: The word or words that are closest to a given word in the semantic space.
- **Corpus Keywords**: Words that occur considerably more frequently in a study corpus compared to a reference corpus.
- **Text Keywords**: The key terms of an individual text that best represent its content.
- **Log Ratio**: A measure used to identify significant differences in word frequency between two corpora by comparing the logarithm of their relative frequencies.
- **Log Likelihood**: A measure used to determine how strongly a word is associated with a specific corpus compared to a reference corpus, helping to identify statistically significant keywords.
- **KWIC (Key Word in Context)**: A display format where search terms are shown with their immediate context.
- **Coherence Score**: A measure for the degree of similarity between the top words in each topic, refelecting how frequently they are to appear together in similar contexts.
- **Cosine Similarity**: A measure commonly used to assess how similar two word vectors are based on the angle between them.
- **TF-IDF** (Term Frequency/Inverse Document Frequency): A measure used to identify terms that are distinctive to a specific text and to rank document relevance in information retrieval.

## Areas of Research

- **Public Communication**: The exchange of information and opinions in publicly accessible discourses.
- **Discourse Analysis**: The study of language use, particularly patterns, structures, and meanings in large text collections.
- **Applied Linguistics**: A field of research that applies linguistic knowledge to practical problems.
- **ORD (Open Research Data)**: Scientific research data that is publicly accessible to promote transparency and reuse.