Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tokenizer for the new sql parser #4024

Merged
merged 4 commits into from
Jan 30, 2022
Merged

Conversation

andylokandy
Copy link
Contributor

@andylokandy andylokandy commented Jan 29, 2022

I hereby agree to the terms of the CLA available at: https://databend.rs/dev/policies/cla/

Summary

As attempting to implement #866, and after some deep discussion with @leiysky, we've decided to reimplement the sql parser with nom. But, while the idiom of nom parses directly on &str, we, like most of the other sql parsers, need a tokenizer to achieve better error messaging and better maintainability of the parser. So this is the PR to add the tokenizer as the first step of implementing the new sql parser.

I've investigated the tokenizer of sqlparser-rs and have tried to implement one using the crate regex. As the result, the crate logos achieved higher performance while paying the least effort.

The lexer rules in this PR follow this official reference of PostgreSQL, and the keyword definitions are completely copied from sqlparser-rs which I think may contain the keywords not defined in PostgreSQL.

Changelog

  • New Feature

Related Issues

Ref #866

Test Plan

Unit Tests

@andylokandy andylokandy requested a review from BohuTANG as a code owner January 29, 2022 12:26
@databend-bot databend-bot added pr-feature this PR introduces a new feature to the codebase labels Jan 29, 2022
@databend-bot
Copy link
Member

Thanks for the contribution!
I have applied any labels matching special text in your PR Changelog.

Please review the labels and make any necessary changes.

1 similar comment
@databend-bot
Copy link
Member

Thanks for the contribution!
I have applied any labels matching special text in your PR Changelog.

Please review the labels and make any necessary changes.

@vercel
Copy link

vercel bot commented Jan 29, 2022

This pull request is being automatically deployed with Vercel (learn more).
To see the status of your deployment, click below or on the icon next to each commit.

🔍 Inspect: https://vercel.com/databend/databend/123rP78WQfiuCxrBHPb31draEDVj
✅ Preview: Canceled

[Deployment for dfd529d canceled]

@@ -8,12 +8,12 @@ edition = "2021"

[lib]
doctest = false
test = false
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we intended to disable the unit tests?

Copy link
Member

@Xuanwo Xuanwo Jan 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, databend's unit tests are built too slow, so we move all of them to integration tests instead. Please take a look at #3473

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, we do not have any so-called unit tests :)

Copy link
Contributor Author

@andylokandy andylokandy Jan 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PsiACE @Xuanwo Ah, thanks for your fast response. So I'm curious about how do you test the private interfaces. In addition, how do you test with some mock technics with #[cfg(test)].

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PsiACE @Xuanwo Ah, thanks for your fast response. So I'm curious about how do you test the private interfaces. In addition, how do you test with some mock technics with #[cfg(test)].

By making it pub 😆

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two methods:

  1. add the test to src/tests , remove test = false.
  2. Make it public for now and put it under tests/it.

The vast majority of tests in Databend follow the 2nd method.

We are exploring new ways to balance the testing needs of private modules.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, then the fabulous mocking tool mockall will not work...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andylokandy We do use mockall, but I'm not sure how you plan to use it. Here is one of our use cases, and the only one. 😭 https://github.com/datafuselabs/databend/blob/c54f06fc3788307aad67511e494dece562cb6d67/common/management/tests/it/user.rs

Copy link
Contributor Author

@andylokandy andylokandy Jan 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two methods:

  1. add the test to src/tests , remove test = false.
  2. Make it public for now and put it under tests/it.

Thanks, I'm pretty clear now.

but I'm not sure how you plan to use it.

Think of two structs A and B, where B is one of the fields of A. In the unit test, where #[cfg(test)] is on, I'll mock a B and inject it into A to test how A interacts with B. In the integration test, where #[cfg(test)] is off, I'll test A with the real B.

To give a concrete example:

#[cfg(test)]
use other::MockB as B;
#[not(cfg(test))]
use other::B;

pub struct A {
    inner: B
}

mod other {
    #[cfg_attr(test, mockall::automock)]
    pub struct B {
        // ...
    }
} 

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, thanks. It looks like our testing style is not up to the task for now.

@Xuanwo
Copy link
Member

Xuanwo commented Jan 29, 2022

@andylokandy It's great to see you in the databend community!

@Xuanwo Xuanwo requested a review from leiysky January 29, 2022 13:00
CommentBlock,

#[regex(r#"[_a-zA-Z][_$a-zA-Z0-9]*"#)]
Ident,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ident should be distinguished with reserved words. For example:

SELECT select FROM t; -- not allowed
SELECT select_result FROM t; -- allowed

I think we need a complicated regular expression to handle this, or make an extra pass to check validity of Idents(not a good choice).

How do you think about this?

Copy link
Contributor Author

@andylokandy andylokandy Jan 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be the work of the parser, e.g, your first example will be lexed into SELECT SELECT FROM ident(t) on which the parser should be able to detect the grammar error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was concerning about the ambiguation between reserved words and identifier, but I found logos do have a clear rule to handle this.

It looks good to me now.

#[allow(non_camel_case_types)]
#[derive(Logos, Clone, Copy, Debug, PartialEq)]
pub enum TokenKind {
#[error]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about reserve a TokenKind::Error for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is required by logos according to its documentation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...
    #[error]
    Error,

    #[regex(r"[ \t\n\f]+", logos::skip)]
    Whitespace,
...

This can work fine. And IMO it's more reasonible to give invalid token a Error instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed

@codecov-commenter
Copy link

codecov-commenter commented Jan 29, 2022

Codecov Report

Merging #4024 (dfd529d) into main (b9070da) will increase coverage by 0%.
The diff coverage is 69%.

Impacted file tree graph

@@          Coverage Diff          @@
##            main   #4024   +/-   ##
=====================================
  Coverage     57%     57%           
=====================================
  Files        817     819    +2     
  Lines      43402   43425   +23     
=====================================
+ Hits       24787   24805   +18     
- Misses     18615   18620    +5     
Impacted Files Coverage Δ
common/ast/src/error.rs 0% <0%> (ø)
common/ast/src/parser/mod.rs 81% <ø> (ø)
common/ast/src/parser/token.rs 80% <80%> (ø)
metasrv/src/network.rs 97% <0%> (-2%) ⬇️
metasrv/src/meta_service/meta_service_impl.rs 92% <0%> (-2%) ⬇️
metasrv/src/meta_service/raftmeta.rs 48% <0%> (ø)
common/management/src/cluster/cluster_mgr.rs 80% <0%> (+1%) ⬆️
common/dal/src/context.rs 88% <0%> (+2%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b9070da...dfd529d. Read the comment docs.

@leiysky
Copy link
Contributor

leiysky commented Jan 30, 2022

@andylokandy Is this PR ready for merging? If so I will approve and merge it.

@andylokandy
Copy link
Contributor Author

@leiysky Ok, let's roll!

@databend-bot
Copy link
Member

Wait for another reviewer approval

@sundy-li sundy-li merged commit 0921b0d into databendlabs:main Jan 30, 2022
@leiysky leiysky mentioned this pull request Jan 31, 2022
15 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need-review pr-feature this PR introduces a new feature to the codebase
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants