You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe
We need to support Percentile Aggregation in Star Tree and hence I was exploring the TDigest Algorithm for it.
Describe the solution you'd like
Current Star Tree Aggregation Process
When building a reduced Star Tree, multiple documents (with the same values across the first N-1 fields) are combined into a single reduced Star Tree document. For example, if there are N fields and we want to compute percentiles on the Nth field, then for all documents where the first N-1 fields are identical, we combine these into a single reduced document.
This works well for simple aggregations like SUM. For example:
F1
F2
F3
0
1
2
0
1
3
0
1
4
0
2
3
0
2
4
0
2
5
The above can be reduced into:
F1
F2
F3 (SUM)
0
1
9
0
2
12
This works efficiently because SUM produces a single compact value for each reduced document, regardless of the number of source documents being combined.
Challenges with Percentile Aggregation
Percentile aggregation is fundamentally different from basic aggregations like SUM. Instead of storing a single numeric value, we need to encode and store the distribution of values for the Nth field using a TDigest structure. [ This stores the compression, centroidCount, and a List - Collection].
The TDigest combines data points into a compressed form that allows approximate percentile calculations. For each reduced Star Tree document, we would build a TDigestState object representing all M documents being merged.
This process can be illustrated with:
double compression = 100;
TDigestState digest = new TDigestState(compression);
int tDigestCardinality = 10000;
Random random = new Random();
for (int j = 0; j < M; j++) {
digest.add(random.nextDouble(tDigestCardinality));
}
In this example,
M documents are being combined, and for each document, the value from the Nth field is added to the TDigest state.
Storage Concerns
The size of a TDigestState object is highly variable and depends on both the number of values added and the order in which they are inserted. The worst-case size can reach up to 20 times the compression factor. (Note - The compression factor is 100 by default). This worst case occurs when values are inserted in strictly increasing or strictly decreasing order — though there are potential algorithmic modifications that could mitigate this, which we can explore further if needed.
To estimate practical storage requirements, I inserted using random values (representing average-case behavior). I observed the following approximate relationship between the number of combined documents and the resulting TDigest object size:
Number of Documents Combined
Approximate TDigest Size (relative to initial size S)
1 document
S (approximately 20 bytes)
100 documents
60-70x S
1,000 documents
110-120x S
10,000 documents
140-150x S
100,000 documents
180-190x S
Due to the high storage cost associated with maintaining TDigests in Star Tree, we have decided not to proceed with supporting Percentile Aggregation in Star Tree at this time.
We will work on supporting other star tree queries and come back to percentiles aggregations if OpenSearch users have use cases around this despite the storage cost. The community can feel free to update this issue as well if they find this feature's usefulness or suggestions in terms of improving the feature.
Related component
Search:Aggregations
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe
We need to support Percentile Aggregation in Star Tree and hence I was exploring the TDigest Algorithm for it.
Describe the solution you'd like
Current Star Tree Aggregation Process
When building a reduced Star Tree, multiple documents (with the same values across the first N-1 fields) are combined into a single reduced Star Tree document. For example, if there are N fields and we want to compute percentiles on the Nth field, then for all documents where the first N-1 fields are identical, we combine these into a single reduced document.
This works well for simple aggregations like SUM. For example:
The above can be reduced into:
This works efficiently because SUM produces a single compact value for each reduced document, regardless of the number of source documents being combined.
Challenges with Percentile Aggregation
Percentile aggregation is fundamentally different from basic aggregations like SUM. Instead of storing a single numeric value, we need to encode and store the distribution of values for the Nth field using a TDigest structure. [ This stores the compression, centroidCount, and a List - Collection].
The TDigest combines data points into a compressed form that allows approximate percentile calculations. For each reduced Star Tree document, we would build a TDigestState object representing all M documents being merged.
This process can be illustrated with:
In this example,
M documents are being combined, and for each document, the value from the Nth field is added to the TDigest state.
Storage Concerns
The size of a TDigestState object is highly variable and depends on both the number of values added and the order in which they are inserted. The worst-case size can reach up to 20 times the compression factor. (Note - The compression factor is 100 by default). This worst case occurs when values are inserted in strictly increasing or strictly decreasing order — though there are potential algorithmic modifications that could mitigate this, which we can explore further if needed.
To estimate practical storage requirements, I inserted using random values (representing average-case behavior). I observed the following approximate relationship between the number of combined documents and the resulting TDigest object size:
Due to the high storage cost associated with maintaining TDigests in Star Tree, we have decided not to proceed with supporting Percentile Aggregation in Star Tree at this time.
We will work on supporting other star tree queries and come back to percentiles aggregations if OpenSearch users have use cases around this despite the storage cost. The community can feel free to update this issue as well if they find this feature's usefulness or suggestions in terms of improving the feature.
Related component
Search:Aggregations
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: