Resources to Examine the Quality of Word Embedding Models Trained on n-Gram Data

  • Author:

    Ábel Elekes, Adrian Englhardt, Martin Schäler, Klemens Böhm

  • Source:

    22nd Conference on Computational Natural Language Learning (CoNLL 2018). DOI: 10.18653/v1/K18-1041

  • Date: 01.11.2018
  • This is the supplementary material for the article "Resources to Examine the Quality of Word Embedding Models Trained on n-Gram Data"

    Abstract Word embeddings are powerful tools that facilitate better
    analysis of natural language. However, their quality highly depends on
    the resource used for training. There are various approaches relying on
    n-gram corpora, such as the Google n-gram corpus. It is the largest
    currently available corpus (with historic data) and exists for several lan-
    guages. However, n-gram corpora only offer a small window into the full
    text – 5 words for the Google corpus at best. This gives way to the con-
    cern whether the extracted word semantics are of high quality. In this
    paper, we address this concern with two contributions. First, we provide
    a resource containing 120 word-embedding models – one of the largest
    collection of embedding models. Furthermore, the resource contains the
    n-gramed versions of all used corpora, as well as our scripts used for
    corpus generation, model generation and evaluation. Second, we define
    a set of meaningful experiments allowing to evaluate the aforementioned
    quality differences. We conduct these experiments using our resource to
    show its usage and significance. The evaluation results confirm that one
    generally can expect high quality for n-grams with n ≥ 3. We deem these
    contributions valuable resources fostering scientific advancement in this
    area.

    Download pdf

Supplementary Materials

Embedding models

Here we provide all the embedding models we have trained for the publication. They are grouped by the minimum count parameter. The parameters in the name of the models are explained in the paper.
  • No minimum count threshold models: [7.6 GB]
  • Models with minimum count parameter = 2: [3.6 GB]
  • Models with minimum count parameter = 5: [2.0 GB]
  • Models with minimum count parameter = 10: [1.2 GB]

The embedding models are licensed under a Creative Commons Attribution 4.0 International License External Link. If you use these models in your scientific work, please reference the companion paper.

Datasets

We also provide the datasets for future studies. The datasets contain the preprocessed text data and the ngrammed version for n = 2, 3, 5 and 8.

The ngrammed version of the dataset is licensed under a Creative Commons Attribution 4.0 International License External Link. If you use this dataset in your scientific work, please reference the companion paper.

Scripts

We provide the training and evaluation Python scripts of our experiments here. It contains the training and evaluation script.