CH-Bench: a User-oriented Benchmark for Systems for Efficient Distant Reading—Design, Performance, and Insights

  • Author:

    Böhm, Klemens, Willkomm, Jens, Markus Raster, Martin Schäler

  • Source:

    International Journal on Digital Libraries (Int J Digit Libr)

  • Date: 15.03.2023
  • Abstract

    Data Science deals with the discovery of information from large volumes of data. The data studied by scientists in the humanities include large textual corpora. An important objective is to study the ideas and expectations of a society regarding specific concepts, like "freedom" or "democracy", both for today's society and even more for societies of the past. Studying the meaning of words using large corpora requires efficient systems for text analysis, so-called distant reading systems. Making such systems efficient calls for a specification of the necessary functionality and clear expectations regarding typical work loads. But this currently is unclear, and there is no benchmark to evaluate distant reading systems. In this article, we propose such a benchmark, with the following innovations: As a first step, we collect and structure various information needs of the target users. We then formalize the notion of word context to facilitate the analysis of specific concepts. Using this notion, we formulate queries in line with the information needs of users. Finally, based on this, we propose concrete benchmark queries. To demonstrate the benefit of our benchmark, we conduct an evaluation, with two objectives. First, we aim at insights regarding the content of different corpora, i.e., whether and how their size and nature (e.g., popular and broad literature or specific expert literature) affect results. Second, we benchmark different data management technologies. This has allowed us to identify performance bottlenecks.

     

     

Abstract

Data Science deals with the discovery of information from large volumes of data. The data studied by scientists in the humanities include large textual corpora. An important objective is to study the ideas and expectations of a society regarding specific concepts, like "freedom" or "democracy", both for today's society and even more for societies of the past. Studying the meaning of words using large corpora requires efficient systems for text analysis, so-called distant reading systems. Making such systems efficient calls for a specification of the necessary functionality and clear expectations regarding typical work loads. But this currently is unclear, and there is no benchmark to evaluate distant reading systems. In this article, we propose such a benchmark, with the following innovations: As a first step, we collect and structure various information needs of the target users. We then formalize the notion of word context to facilitate the analysis of specific concepts. Using this notion, we formulate queries in line with the information needs of users. Finally, based on this, we propose concrete benchmark queries. To demonstrate the benefit of our benchmark, we conduct an evaluation, with two objectives. First, we aim at insights regarding the content of different corpora, i.e., whether and how their size and nature (e.g., popular and broad literature or specific expert literature) affect results. Second, we benchmark different data management technologies. This has allowed us to identify performance bottlenecks.

Citation

Cite this article as:

Willkomm, J., Raster, M., Schäler, M. et al. CH-Bench: a user-oriented benchmark for systems for efficient distant reading (design, performance, and insights). Int J Digit Libr (2023). https://doi.org/10.1007/s00799-023-00347-4

Bibtex:

@article{Willkomm:2023:10.1007/s00799-023-00347-4,

    author    = {Jens Willkomm and Markus Raster and Martin Schäler and Klemens Böhm},

    journal   = {International Journal on Digital Libraries},

    title     = {{CH}-Bench: a user-oriented benchmark for systems for efficient distant reading (design, performance, and insights)},

    year      = {2023},

    month     = {mar},

    issn      = {1432-1300},

    doi       = {10.1007/s00799-023-00347-4},

    publisher = {Springer Science and Business Media {LLC}},

}