9. Glossary#
9.1. Corpus Annotations#
Lemma: The base form of a word under which it is listed in dictionaries (e.g., go for goes, went).
Token: A single unit of a text, such as a word, number, or punctuation mark.
Part-of-Speech (PoS): The grammatical category of a word, such as noun, verb, or adjective.
Tag: An annotated label for a token that indicates its word class or grammatical properties (e.g., verb for go).
9.2. Corpus Processing & Technology#
Corpus: A collection of texts used for linguistic analysis.
Workbench: A platform or software for analyzing and visualizing corpus data.
Pipeline: A sequence of processing steps for extracting, filtering, and annotating text data.
Algorithm: A systematic sequence of operations used to solve a problem or process data.
CQP (Corpus Query Processor): A query language for searching linguistic corpora using complex patterns and annotations.
9.3. Text Analysis#
Semantic Space / Vector Space: A mathematical representation of word meanings in a multidimensional space based on their usage in texts.
Word Vector: A numerical representation of a word in a semantic space that reflects its relationships with other words in a corpus.
Nearest Neighbor: The word or words that are closest to a given word in the semantic space.
Corpus Keywords: Words that occur considerably more frequently in a study corpus compared to a reference corpus.
Text Keywords: The key terms of an individual text that best represent its content.
Log Ratio: A measure used to identify significant differences in word frequency between two corpora by comparing the logarithm of their relative frequencies.
Log Likelihood: A measure used to determine how strongly a word is associated with a specific corpus compared to a reference corpus, helping to identify statistically significant keywords.
KWIC (Key Word in Context): A display format where search terms are shown with their immediate context.
Coherence Score: A measure for the degree of similarity between the top words in each topic, refelecting how frequently they are to appear together in similar contexts (typical range: 0.3 – 0.7; higher = better).
Cosine Similarity: A measure commonly used to assess how similar two word vectors are based on the angle between them (range 0 – 1, higher = more similar).
TF-IDF (Term Frequency/Inverse Document Frequency): A measure used to identify terms that are distinctive to a specific text and to rank document relevance in information retrieval.
9.4. Areas of Research#
Public Communication: The exchange of information and opinions in publicly accessible discourses.
Discourse Analysis: The study of language use, particularly patterns, structures, and meanings in large text collections.
Applied Linguistics: A field of research that applies linguistic knowledge to practical problems.
ORD (Open Research Data): Scientific research data that is publicly accessible to promote transparency and reuse.