-
Notifications
You must be signed in to change notification settings - Fork 766
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add tokenizer for the new sql parser #4024
Conversation
Thanks for the contribution! Please review the labels and make any necessary changes. |
1 similar comment
Thanks for the contribution! Please review the labels and make any necessary changes. |
This pull request is being automatically deployed with Vercel (learn more). 🔍 Inspect: https://vercel.com/databend/databend/123rP78WQfiuCxrBHPb31draEDVj [Deployment for dfd529d canceled] |
common/ast/Cargo.toml
Outdated
@@ -8,12 +8,12 @@ edition = "2021" | |||
|
|||
[lib] | |||
doctest = false | |||
test = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did we intended to disable the unit tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, databend's unit tests are built too slow, so we move all of them to integration tests instead. Please take a look at #3473
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, we do not have any so-called unit tests :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two methods:
- add the test to src/tests , remove test = false.
- Make it public for now and put it under tests/it.
The vast majority of tests in Databend follow the 2nd method.
We are exploring new ways to balance the testing needs of private modules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, then the fabulous mocking tool mockall will not work...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andylokandy We do use mockall, but I'm not sure how you plan to use it. Here is one of our use cases, and the only one. 😭 https://github.com/datafuselabs/databend/blob/c54f06fc3788307aad67511e494dece562cb6d67/common/management/tests/it/user.rs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two methods:
- add the test to src/tests , remove test = false.
- Make it public for now and put it under tests/it.
Thanks, I'm pretty clear now.
but I'm not sure how you plan to use it.
Think of two structs A
and B
, where B
is one of the fields of A
. In the unit test, where #[cfg(test)]
is on, I'll mock a B
and inject it into A to test how A
interacts with B
. In the integration test, where #[cfg(test)]
is off, I'll test A
with the real B
.
To give a concrete example:
#[cfg(test)]
use other::MockB as B;
#[not(cfg(test))]
use other::B;
pub struct A {
inner: B
}
mod other {
#[cfg_attr(test, mockall::automock)]
pub struct B {
// ...
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it, thanks. It looks like our testing style is not up to the task for now.
@andylokandy It's great to see you in the databend community! |
CommentBlock, | ||
|
||
#[regex(r#"[_a-zA-Z][_$a-zA-Z0-9]*"#)] | ||
Ident, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ident
should be distinguished with reserved words. For example:
SELECT select FROM t; -- not allowed
SELECT select_result FROM t; -- allowed
I think we need a complicated regular expression to handle this, or make an extra pass to check validity of Ident
s(not a good choice).
How do you think about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be the work of the parser, e.g, your first example will be lexed into SELECT SELECT FROM ident(t)
on which the parser should be able to detect the grammar error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was concerning about the ambiguation between reserved words and identifier, but I found logos
do have a clear rule to handle this.
It looks good to me now.
#[allow(non_camel_case_types)] | ||
#[derive(Logos, Clone, Copy, Debug, PartialEq)] | ||
pub enum TokenKind { | ||
#[error] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about reserve a TokenKind::Error
for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is required by logos
according to its documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...
#[error]
Error,
#[regex(r"[ \t\n\f]+", logos::skip)]
Whitespace,
...
This can work fine. And IMO it's more reasonible to give invalid token a Error
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed
Codecov Report
@@ Coverage Diff @@
## main #4024 +/- ##
=====================================
Coverage 57% 57%
=====================================
Files 817 819 +2
Lines 43402 43425 +23
=====================================
+ Hits 24787 24805 +18
- Misses 18615 18620 +5
Continue to review full report at Codecov.
|
@andylokandy Is this PR ready for merging? If so I will approve and merge it. |
@leiysky Ok, let's roll! |
Wait for another reviewer approval |
I hereby agree to the terms of the CLA available at: https://databend.rs/dev/policies/cla/
Summary
As attempting to implement #866, and after some deep discussion with @leiysky, we've decided to reimplement the sql parser with nom. But, while the idiom of nom parses directly on
&str
, we, like most of the other sql parsers, need a tokenizer to achieve better error messaging and better maintainability of the parser. So this is the PR to add the tokenizer as the first step of implementing the new sql parser.I've investigated the tokenizer of
sqlparser-rs
and have tried to implement one using the crateregex
. As the result, the cratelogos
achieved higher performance while paying the least effort.The lexer rules in this PR follow this official reference of PostgreSQL, and the keyword definitions are completely copied from
sqlparser-rs
which I think may contain the keywords not defined in PostgreSQL.Changelog
Related Issues
Ref #866
Test Plan
Unit Tests