Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discussion] [Star Tree] Percentile Aggregation support in Star Tree using TDigest Algorithm. #17504

Open
Shailesh-Kumar-Singh opened this issue Mar 4, 2025 · 0 comments
Labels
enhancement Enhancement or improvement to existing feature or request Search:Aggregations

Comments

@Shailesh-Kumar-Singh
Copy link
Contributor

Is your feature request related to a problem? Please describe

We need to support Percentile Aggregation in Star Tree and hence I was exploring the TDigest Algorithm for it.

Describe the solution you'd like

Current Star Tree Aggregation Process

When building a reduced Star Tree, multiple documents (with the same values across the first N-1 fields) are combined into a single reduced Star Tree document. For example, if there are N fields and we want to compute percentiles on the Nth field, then for all documents where the first N-1 fields are identical, we combine these into a single reduced document.
This works well for simple aggregations like SUM. For example:

F1 F2 F3
0 1 2
0 1 3
0 1 4
0 2 3
0 2 4
0 2 5

The above can be reduced into:

F1 F2 F3 (SUM)
0 1 9
0 2 12

This works efficiently because SUM produces a single compact value for each reduced document, regardless of the number of source documents being combined.

Challenges with Percentile Aggregation

Percentile aggregation is fundamentally different from basic aggregations like SUM. Instead of storing a single numeric value, we need to encode and store the distribution of values for the Nth field using a TDigest structure. [ This stores the compression, centroidCount, and a List - Collection].

The TDigest combines data points into a compressed form that allows approximate percentile calculations. For each reduced Star Tree document, we would build a TDigestState object representing all M documents being merged.
This process can be illustrated with:

double compression = 100;
TDigestState digest = new TDigestState(compression);
int tDigestCardinality = 10000;
Random random = new Random();
for (int j = 0; j < M; j++) {
    digest.add(random.nextDouble(tDigestCardinality));
}

In this example,
M documents are being combined, and for each document, the value from the Nth field is added to the TDigest state.

Storage Concerns

The size of a TDigestState object is highly variable and depends on both the number of values added and the order in which they are inserted. The worst-case size can reach up to 20 times the compression factor. (Note - The compression factor is 100 by default). This worst case occurs when values are inserted in strictly increasing or strictly decreasing order — though there are potential algorithmic modifications that could mitigate this, which we can explore further if needed.

To estimate practical storage requirements, I inserted using random values (representing average-case behavior). I observed the following approximate relationship between the number of combined documents and the resulting TDigest object size:

Number of Documents Combined Approximate TDigest Size (relative to initial size S)
1 document S (approximately 20 bytes)
100 documents 60-70x S
1,000 documents 110-120x S
10,000 documents 140-150x S
100,000 documents 180-190x S

Due to the high storage cost associated with maintaining TDigests in Star Tree, we have decided not to proceed with supporting Percentile Aggregation in Star Tree at this time.
We will work on supporting other star tree queries and come back to percentiles aggregations if OpenSearch users have use cases around this despite the storage cost. The community can feel free to update this issue as well if they find this feature's usefulness or suggestions in terms of improving the feature.

Related component

Search:Aggregations

Describe alternatives you've considered

No response

Additional context

No response

@Shailesh-Kumar-Singh Shailesh-Kumar-Singh added enhancement Enhancement or improvement to existing feature or request untriaged labels Mar 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Search:Aggregations
Projects
Status: 🆕 New
Development

No branches or pull requests

2 participants