Resources to Examine the Quality of Word Embedding Models Trained on n-Gram Data

Autor:
Ábel Elekes, Adrian Englhardt, Martin Schäler, Klemens Böhm
Quelle:
22nd Conference on Computational Natural Language Learning (CoNLL 2018). DOI: 10.18653/v1/K18-1041
Datum: 01.11.2018
This is the supplementary material for the article "Resources to Examine the Quality of Word Embedding Models Trained on n-Gram Data"

Abstract Word embeddings are powerful tools that facilitate better
analysis of natural language. However, their quality highly depends on
the resource used for training. There are various approaches relying on
n-gram corpora, such as the Google n-gram corpus. It is the largest
currently available corpus (with historic data) and exists for several lan-
guages. However, n-gram corpora only offer a small window into the full
text – 5 words for the Google corpus at best. This gives way to the con-
cern whether the extracted word semantics are of high quality. In this
paper, we address this concern with two contributions. First, we provide
a resource containing 120 word-embedding models – one of the largest
collection of embedding models. Furthermore, the resource contains the
n-gramed versions of all used corpora, as well as our scripts used for
corpus generation, model generation and evaluation. Second, we define
a set of meaningful experiments allowing to evaluate the aforementioned
quality differences. We conduct these experiments using our resource to
show its usage and significance. The evaluation results confirm that one
generally can expect high quality for n-grams with n ≥ 3. We deem these
contributions valuable resources fostering scientific advancement in this
area.

Download pdf

Supplementary Materials

Embedding models

Here we provide all the embedding models we have trained for the publication. They are grouped by the minimum count parameter. The parameters in the name of the models are explained in the paper.

No minimum count threshold models: [7.6 GB]
Models with minimum count parameter = 2: [3.6 GB]
Models with minimum count parameter = 5: [2.0 GB]
Models with minimum count parameter = 10: [1.2 GB]

The embedding models are licensed under a Creative Commons Attribution 4.0 International License External Link . If you use these models in your scientific work, please reference the companion paper.

Datasets

We also provide the datasets for future studies. The datasets contain the preprocessed text data and the ngrammed version for n = 2, 3, 5 and 8.

Ngrammed 1-Billion word dataset: [14.6 GB]
Ngrammed Wikipedia dataset: [16.8 GB]

The ngrammed version of the dataset is licensed under a Creative Commons Attribution 4.0 International License External Link . If you use this dataset in your scientific work, please reference the companion paper.

Scripts

We provide the training and evaluation Python scripts of our experiments here. It contains the training and evaluation script.