Accurate Cardinality Estimation of Co-occurring Words Using Suffix Trees

Abstract

Estimating the cost of a query plan is one of the hardest problems in query optimization. This includes cardinality estimates of string search patterns, of multi-word strings like phrases or text snippets in particular. At first sight, suffix trees address this problem. To curb the memory usage of a suffix tree, one often prunes the tree to a certain depth. But this pruning method "takes away" more information from long strings than from short ones. This problem is particularly severe with sets of long strings, the setting studied here. In this article, we propose respective pruning techniques. Our approaches remove characters with low information value. The various variants determine a character's information value in different ways, e.g., by using conditional entropy with respect to previous characters in the string. Our experiments show that, in contrast to the well-known pruned suffix tree, our technique provides significantly better estimations when the tree size is reduced by 60% or less. Due to the redundancy of natural language, our pruning techniques yield hardly any error for tree-size reductions of up to 50%.

Download pdf

© Jens Willkomm, Martin Schäler, and Klemens Böhm 2021.

This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive version was published in Database Systems for Advanced Applications (DASFAA '21), April 11-14, 2021, Taipei, Taiwan, https://doi.org/10.1007/978-3-030-73197-7_50.

Citation

Cite this paper as:

Willkomm J., Schäler M., Böhm K. (2021) Accurate Cardinality Estimation of Co-occurring Words Using Suffix Trees. In: Jensen C.S. et al. (eds) Database Systems for Advanced Applications. DASFAA 2021. Lecture Notes in Computer Science, vol 12682. Springer, Cham. https://doi.org/10.1007/978-3-030-73197-7_50

Bibtex:
@inproceedings{Willkomm:2021:10.1007/978-3-030-73197-7_50,
     author         = {Jens Willkomm and Martin Schäler and Klemens Böhm},
     editor          = {Christian S. Jensen and Ee-Peng Lim and De-Nian Yang and Wang-Chien Lee and Vincent S. Tseng and Vana Kalogeraki and Jen-Wei Huang and Chih-Ya Shen},
     booktitle     = {Database Systems for Advanced Applications},
     title = {Accurate Cardinality Estimation of Co-occurring Words Using Suffix Trees},
     year = {2021},
     pages          = {721--737},
     publisher    = {Springer International Publishing},
     doi  = {10.1007/978-3-030-73197-7_50},
     isbn = {978-3-030-73197-7}
}