-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incremental (delta) update #928
base: main
Are you sure you want to change the base?
Conversation
Deploying datachain-documentation with
|
Latest commit: |
3e7fa37
|
Status: | ✅ Deploy successful! |
Preview URL: | https://217ce101.datachain-documentation.pages.dev |
Branch Preview URL: | https://ilongin-798-incremental-upda.datachain-documentation.pages.dev |
@ilongin it would be great to extract all logic outside of the fat file Also, should we call it incremental or delta? :) Delta seems better but I don't like it do to a conflict with Delta Lake. Any ideas? :) |
Deploying datachain-documentation with
|
Latest commit: |
8fa1534
|
Status: | ✅ Deploy successful! |
Preview URL: | https://c897b9bc.datachain-documentation.pages.dev |
Branch Preview URL: | https://ilongin-798-incremental-upda.datachain-documentation.pages.dev |
@dmpetrov one question just to be 100% sure. How do we deal with different statuses : added, modified, removed, same? My assumption is to:
Currently Regarding the name, delta makes more sense if we are not just appending new ones, otherwise it's more like incremental, but I don't have strong opinion here...both sound reasonable to me. |
Deploying datachain-documentation with
|
Latest commit: |
67824e6
|
Status: | ✅ Deploy successful! |
Preview URL: | https://b84a6f31.datachain-documentation.pages.dev |
Branch Preview URL: | https://ilongin-798-incremental-upda.datachain-documentation.pages.dev |
Let's use the same default for the incremental update.
Then let's use Delta 🙂 |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #928 +/- ##
==========================================
+ Coverage 87.65% 87.74% +0.08%
==========================================
Files 131 132 +1
Lines 11896 11923 +27
Branches 1622 1625 +3
==========================================
+ Hits 10428 10462 +34
+ Misses 1058 1053 -5
+ Partials 410 408 -2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR Overview
This PR introduces incremental (delta) update functionality to optimize dataset updates by computing and merging diffs rather than re‐processing the entire source, and it updates related tests and supporting modules accordingly.
- Implements a new delta_update function in the core module.
- Adds functional and unit tests covering delta updates from both datasets and storage.
- Updates the save and query methods (and related documentation) to support delta processing and refactors file signal retrieval.
Reviewed Changes
File | Description |
---|---|
src/datachain/delta.py | Adds delta_update for incremental diff computation. |
tests/func/test_delta.py | Introduces tests to validate delta updates from various sources. |
tests/unit/lib/test_signal_schema.py | Adds unit tests for file signal retrieval. |
src/datachain/lib/dc.py | Updates the save method to allow a delta flag. |
src/datachain/query/dataset.py | Refactors StartingStep to QueryStep; updates type annotation. |
src/datachain/lib/signal_schema.py | Replaces contains_file() with get_file_signal() for clarity. |
Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.
Comments suppressed due to low confidence (2)
src/datachain/delta.py:36
- Consider reviewing the logic of appending the original chain's steps onto the diff chain to ensure that the order of applied transformations is correct and consistent with user expectations.
diff._query.steps += dc._query.steps
tests/func/test_delta.py:146
- Duplicate file entries (e.g., 'images/img6.jpg' and 'images/img8.jpg' appear twice) in the expected output may indicate an unintended behavior in the union operation. Please confirm whether duplicates are intentional or if filtering/aggregation is needed.
"images/img6.jpg",
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me 👍🔥
|
||
|
||
def test_get_file_signal(): | ||
assert SignalSchema({"name": str, "f": File}).get_file_signal() == "f" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also test for nested file signal or is it out of the scope of this method?
E.g.:
class CustomModel(DataModel):
file: File
foo: str
bar: float
assert SignalSchema({"name": str, "custom": CustomModel}).get_file_signal() == "custom.file"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now it only works for top level file objects. In future we can add nested as well if needed
Adding the ability to do incremental, or delta updates with dataset. Idea is to not re-build the whole dataset from source once source has some changes (new or modified files in s3 bucket for example), but to create diff chain between the last version of dataset and source with all modifications added (chain methods like mappers, filters etc. what has been added to create original chain from source), and then merge diff with last version of dataset. This way we will have much better performance.
The way user will run delta updates is to just re-run the whole script where it creates dataset, with one small modification - adding
delta=True
onDataChain.save()
method.Example: