9. Glossary#

9.1. Corpus Annotations#

  • Lemma: The base form of a word under which it is listed in dictionaries (e.g., go for goes, went).

  • Token: A single unit of a text, such as a word, number, or punctuation mark.

  • Part-of-Speech (PoS): The grammatical category of a word, such as noun, verb, or adjective.

  • Tag: An annotated label for a token that indicates its word class or grammatical properties (e.g., verb for go).

9.2. Corpus Processing & Technology#

  • Corpus: A collection of texts used for linguistic analysis.

  • Workbench: A platform or software for analyzing and visualizing corpus data.

  • Pipeline: A sequence of processing steps for extracting, filtering, and annotating text data.

  • Algorithm: A systematic sequence of operations used to solve a problem or process data.

  • CQP (Corpus Query Processor): A query language for searching linguistic corpora using complex patterns and annotations.

9.3. Text Analysis#

  • Semantic Space / Vector Space: A mathematical representation of word meanings in a multidimensional space based on their usage in texts.

  • Word Vector: A numerical representation of a word in a semantic space that reflects its relationships with other words in a corpus.

  • Nearest Neighbor: The word or words that are closest to a given word in the semantic space.

  • Corpus Keywords: Words that occur considerably more frequently in a study corpus compared to a reference corpus.

  • Text Keywords: The key terms of an individual text that best represent its content.

  • Log Ratio: A measure used to identify significant differences in word frequency between two corpora by comparing the logarithm of their relative frequencies.

  • Log Likelihood: A measure used to determine how strongly a word is associated with a specific corpus compared to a reference corpus, helping to identify statistically significant keywords.

  • KWIC (Key Word in Context): A display format where search terms are shown with their immediate context.

  • Coherence Score: A measure for the degree of similarity between the top words in each topic, refelecting how frequently they are to appear together in similar contexts (typical range: 0.3 – 0.7; higher = better).

  • Cosine Similarity: A measure commonly used to assess how similar two word vectors are based on the angle between them (range 0 – 1, higher = more similar).

  • TF-IDF (Term Frequency/Inverse Document Frequency): A measure used to identify terms that are distinctive to a specific text and to rank document relevance in information retrieval.

9.4. Areas of Research#

  • Public Communication: The exchange of information and opinions in publicly accessible discourses.

  • Discourse Analysis: The study of language use, particularly patterns, structures, and meanings in large text collections.

  • Applied Linguistics: A field of research that applies linguistic knowledge to practical problems.

  • ORD (Open Research Data): Scientific research data that is publicly accessible to promote transparency and reuse.