[Discussion] [Star Tree] Percentile Aggregation support in Star Tree using TDigest Algorithm. #17504

Shailesh-Kumar-Singh · 2025-03-04T12:25:35Z

Is your feature request related to a problem? Please describe

We need to support Percentile Aggregation in Star Tree and hence I was exploring the TDigest Algorithm for it.

Describe the solution you'd like

Current Star Tree Aggregation Process

When building a reduced Star Tree, multiple documents (with the same values across the first N-1 fields) are combined into a single reduced Star Tree document. For example, if there are N fields and we want to compute percentiles on the Nth field, then for all documents where the first N-1 fields are identical, we combine these into a single reduced document.
This works well for simple aggregations like SUM. For example:

F1	F2	F3
0	1	2
0	1	3
0	1	4
0	2	3
0	2	4
0	2	5

The above can be reduced into:

F1	F2	F3 (SUM)
0	1	9
0	2	12

This works efficiently because SUM produces a single compact value for each reduced document, regardless of the number of source documents being combined.

Challenges with Percentile Aggregation

Percentile aggregation is fundamentally different from basic aggregations like SUM. Instead of storing a single numeric value, we need to encode and store the distribution of values for the Nth field using a TDigest structure. [ This stores the compression, centroidCount, and a List - Collection].

The TDigest combines data points into a compressed form that allows approximate percentile calculations. For each reduced Star Tree document, we would build a TDigestState object representing all M documents being merged.
This process can be illustrated with:

double compression = 100;
TDigestState digest = new TDigestState(compression);
int tDigestCardinality = 10000;
Random random = new Random();
for (int j = 0; j < M; j++) {
    digest.add(random.nextDouble(tDigestCardinality));
}

In this example,
M documents are being combined, and for each document, the value from the Nth field is added to the TDigest state.

Storage Concerns

The size of a TDigestState object is highly variable and depends on both the number of values added and the order in which they are inserted. The worst-case size can reach up to 20 times the compression factor. (Note - The compression factor is 100 by default). This worst case occurs when values are inserted in strictly increasing or strictly decreasing order — though there are potential algorithmic modifications that could mitigate this, which we can explore further if needed.

To estimate practical storage requirements, I inserted using random values (representing average-case behavior). I observed the following approximate relationship between the number of combined documents and the resulting TDigest object size:

Number of Documents Combined	Approximate TDigest Size (relative to initial size S)
1 document	S (approximately 20 bytes)
100 documents	60-70x S
1,000 documents	110-120x S
10,000 documents	140-150x S
100,000 documents	180-190x S

Due to the high storage cost associated with maintaining TDigests in Star Tree, we have decided not to proceed with supporting Percentile Aggregation in Star Tree at this time.
We will work on supporting other star tree queries and come back to percentiles aggregations if OpenSearch users have use cases around this despite the storage cost. The community can feel free to update this issue as well if they find this feature's usefulness or suggestions in terms of improving the feature.

Related component

Search:Aggregations

Describe alternatives you've considered

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

Shailesh-Kumar-Singh added enhancement Enhancement or improvement to existing feature or request untriaged labels Mar 4, 2025

github-actions bot added the Search:Aggregations label Mar 4, 2025

github-project-automation bot added this to Search Project Board Mar 4, 2025

github-project-automation bot moved this to 🆕 New in Search Project Board Mar 4, 2025

sandeshkr419 removed the untriaged label Mar 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Discussion] [Star Tree] Percentile Aggregation support in Star Tree using TDigest Algorithm. #17504

[Discussion] [Star Tree] Percentile Aggregation support in Star Tree using TDigest Algorithm. #17504

Shailesh-Kumar-Singh commented Mar 4, 2025

[Discussion] [Star Tree] Percentile Aggregation support in Star Tree using TDigest Algorithm. #17504

[Discussion] [Star Tree] Percentile Aggregation support in Star Tree using TDigest Algorithm. #17504

Comments

Shailesh-Kumar-Singh commented Mar 4, 2025

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Related component

Describe alternatives you've considered

Additional context