Self-Tuning Histogramms

Efficient query processing in databases requires that the database system can estimate selectivities of queries, i.e., which percentage of the data objects satisfy the query predicates. Histograms are data structures commonly used for such estimations. Self-tuning histograms are a variant of histograms that is easy to build and maintain, compared to static histograms: When constructing them, the system does not scan the whole data set, but only looks at past query results. Because of this property, self-tuning histograms have broader applicability than static histograms. On the other hand, the question how good selectivity estimations based on self-tuning histograms actually are has not been investigated sufficiently. An important observation in this context is that clustering algorithms and histograms have similar goals: partition the data space into regions that differ in density compared to adjacent regions. In other words, clustering algorithms could be used to construct histograms, and we want to compare the quality of these histograms to the one of self-tuning histograms (and of static conventional histograms). Another goal of this project is to quantify the uncertainties in selectivity estimation using self-tuning histograms. Instead of estimating only an expected value, as is typically done so far, we want to deduce a probability distribution of selectivities. Having such a distribution is known to improve query processing. We have certain results regarding this part and have recently submitted a paper to a conference. Having addressed the issues of estimation quality and imprecision in estimations, a natural next step is to generalize the concept of self-tuning histograms. Estimating response time of services depending on the parameter values seems to be similar to selectivity estimation, given the parameters of the query. However, the difference is that the response time of a service can vary from one invocation to another, even if the two have exactly the same parameter values. Again, the vision is to have (a concise summary of) the history of response times, and to use it to estimate future ones.