Resources to Examine the Quality of Word Embedding Models Trained on n-Gram Data
- Author:
-
Source:
22nd Conference on Computational Natural Language Learning (CoNLL 2018). DOI: 10.18653/v1/K18-1041
- Date: 01.11.2018
-
This is the supplementary material for the article "Resources to Examine the Quality of Word Embedding Models Trained on n-Gram Data"
Abstract Word embeddings are powerful tools that facilitate better
analysis of natural language. However, their quality highly depends on
the resource used for training. There are various approaches relying on
n-gram corpora, such as the Google n-gram corpus. It is the largest
currently available corpus (with historic data) and exists for several lan-
guages. However, n-gram corpora only offer a small window into the full
text – 5 words for the Google corpus at best. This gives way to the con-
cern whether the extracted word semantics are of high quality. In this
paper, we address this concern with two contributions. First, we provide
a resource containing 120 word-embedding models – one of the largest
collection of embedding models. Furthermore, the resource contains the
n-gramed versions of all used corpora, as well as our scripts used for
corpus generation, model generation and evaluation. Second, we define
a set of meaningful experiments allowing to evaluate the aforementioned
quality differences. We conduct these experiments using our resource to
show its usage and significance. The evaluation results confirm that one
generally can expect high quality for n-grams with n ≥ 3. We deem these
contributions valuable resources fostering scientific advancement in this
area.
Supplementary Materials
Embedding models
- No minimum count threshold models: [7.6 GB]
- Models with minimum count parameter = 2: [3.6 GB]
- Models with minimum count parameter = 5: [2.0 GB]
- Models with minimum count parameter = 10: [1.2 GB]
The embedding models are licensed under a Creative Commons Attribution 4.0 International License . If you use these models in your scientific work, please reference the companion paper.
Datasets
The ngrammed version of the dataset is licensed under a Creative Commons Attribution 4.0 International License . If you use this dataset in your scientific work, please reference the companion paper.