Glossary

9. Glossary#

Lemma: The base form of a word under which it is listed in dictionaries (e.g., go for goes, went).
Token: A single unit of a text, such as a word, number, or punctuation mark.
Part-of-Speech (PoS): The grammatical category of a word, such as noun, verb, or adjective.
Tag: An annotated label for a token that indicates its word class or grammatical properties (e.g., verb for go).

Corpus: A collection of texts used for linguistic analysis.
Workbench: A platform or software for analyzing and visualizing corpus data.
Pipeline: A sequence of processing steps for extracting, filtering, and annotating text data.
Algorithm: A systematic sequence of operations used to solve a problem or process data.
CQP (Corpus Query Processor): A query language for searching linguistic corpora using complex patterns and annotations.

Semantic Space / Vector Space: A mathematical representation of word meanings in a multidimensional space based on their usage in texts.
Word Vector: A numerical representation of a word in a semantic space that reflects its relationships with other words in a corpus.
Nearest Neighbor: The word or words that are closest to a given word in the semantic space.
Corpus Keywords: Words that occur considerably more frequently in a study corpus compared to a reference corpus.
Text Keywords: The key terms of an individual text that best represent its content.
Log Ratio: A measure used to identify significant differences in word frequency between two corpora by comparing the logarithm of their relative frequencies.
Log Likelihood: A measure used to determine how strongly a word is associated with a specific corpus compared to a reference corpus, helping to identify statistically significant keywords.
KWIC (Key Word in Context): A display format where search terms are shown with their immediate context.
Cosine Similarity: A measure commonly used to assess how similar two word vectors are based on the angle between them (range 0 – 1, higher = more similar).

Public Communication: The exchange of information and opinions in publicly accessible discourses.
Discourse Analysis: The study of language use, particularly patterns, structures, and meanings in large text collections.
Applied Linguistics: A field of research that applies linguistic knowledge to practical problems.
ORD (Open Research Data): Scientific research data that is publicly accessible to promote transparency and reuse.