Add tokenizer for the new sql parser #4024

andylokandy · 2022-01-29T12:26:40Z

I hereby agree to the terms of the CLA available at: https://databend.rs/dev/policies/cla/

Summary

As attempting to implement #866, and after some deep discussion with @leiysky, we've decided to reimplement the sql parser with nom. But, while the idiom of nom parses directly on &str, we, like most of the other sql parsers, need a tokenizer to achieve better error messaging and better maintainability of the parser. So this is the PR to add the tokenizer as the first step of implementing the new sql parser.

I've investigated the tokenizer of sqlparser-rs and have tried to implement one using the crate regex. As the result, the crate logos achieved higher performance while paying the least effort.

The lexer rules in this PR follow this official reference of PostgreSQL, and the keyword definitions are completely copied from sqlparser-rs which I think may contain the keywords not defined in PostgreSQL.

Changelog

New Feature

Related Issues

Ref #866

Test Plan

Unit Tests

databend-bot · 2022-01-29T12:26:43Z

Thanks for the contribution!
I have applied any labels matching special text in your PR Changelog.

Please review the labels and make any necessary changes.

databend-bot · 2022-01-29T12:26:43Z

Thanks for the contribution!
I have applied any labels matching special text in your PR Changelog.

Please review the labels and make any necessary changes.

vercel · 2022-01-29T12:26:44Z

This pull request is being automatically deployed with Vercel (learn more).
To see the status of your deployment, click below or on the icon next to each commit.

🔍 Inspect: https://vercel.com/databend/databend/123rP78WQfiuCxrBHPb31draEDVj
✅ Preview: Canceled

[Deployment for dfd529d canceled]

andylokandy · 2022-01-29T12:28:08Z

common/ast/Cargo.toml

@@ -8,12 +8,12 @@ edition = "2021"

 [lib]
 doctest = false
-test = false


Did we intended to disable the unit tests?

Hi, databend's unit tests are built too slow, so we move all of them to integration tests instead. Please take a look at #3473

Currently, we do not have any so-called unit tests :)

@PsiACE @Xuanwo Ah, thanks for your fast response. So I'm curious about how do you test the private interfaces. In addition, how do you test with some mock technics with #[cfg(test)].

@PsiACE @Xuanwo Ah, thanks for your fast response. So I'm curious about how do you test the private interfaces. In addition, how do you test with some mock technics with #[cfg(test)].

By making it pub 😆

Two methods:

add the test to src/tests , remove test = false.

Make it public for now and put it under tests/it.

The vast majority of tests in Databend follow the 2nd method.

We are exploring new ways to balance the testing needs of private modules.

Oh, then the fabulous mocking tool mockall will not work...

@andylokandy We do use mockall, but I'm not sure how you plan to use it. Here is one of our use cases, and the only one. 😭 https://github.com/datafuselabs/databend/blob/c54f06fc3788307aad67511e494dece562cb6d67/common/management/tests/it/user.rs

Two methods:

add the test to src/tests , remove test = false.

Make it public for now and put it under tests/it.

Thanks, I'm pretty clear now.

but I'm not sure how you plan to use it.

Think of two structs A and B, where B is one of the fields of A. In the unit test, where #[cfg(test)] is on, I'll mock a B and inject it into A to test how A interacts with B. In the integration test, where #[cfg(test)] is off, I'll test A with the real B.

To give a concrete example:

#[cfg(test)] use other::MockB as B; #[not(cfg(test))] use other::B; pub struct A { inner: B } mod other { #[cfg_attr(test, mockall::automock)] pub struct B { // ... } }

got it, thanks. It looks like our testing style is not up to the task for now.

Xuanwo · 2022-01-29T12:58:56Z

@andylokandy It's great to see you in the databend community!

leiysky · 2022-01-29T13:26:01Z

common/ast/src/parser/token.rs

+    CommentBlock,
+
+    #[regex(r#"[_a-zA-Z][_$a-zA-Z0-9]*"#)]
+    Ident,


Ident should be distinguished with reserved words. For example:

SELECT select FROM t; -- not allowed SELECT select_result FROM t; -- allowed

I think we need a complicated regular expression to handle this, or make an extra pass to check validity of Idents(not a good choice).

How do you think about this?

I think this should be the work of the parser, e.g, your first example will be lexed into SELECT SELECT FROM ident(t) on which the parser should be able to detect the grammar error.

I was concerning about the ambiguation between reserved words and identifier, but I found logos do have a clear rule to handle this.

It looks good to me now.

leiysky · 2022-01-29T13:27:46Z

common/ast/src/parser/token.rs

+#[allow(non_camel_case_types)]
+#[derive(Logos, Clone, Copy, Debug, PartialEq)]
+pub enum TokenKind {
+    #[error]


What about reserve a TokenKind::Error for this?

This is required by logos according to its documentation.

... #[error] Error, #[regex(r"[ \t\n\f]+", logos::skip)] Whitespace, ...

This can work fine. And IMO it's more reasonible to give invalid token a Error instead.

codecov-commenter · 2022-01-29T13:37:45Z

Codecov Report

Merging #4024 (dfd529d) into main (b9070da) will increase coverage by 0%.
The diff coverage is 69%.

@@          Coverage Diff          @@
##            main   #4024   +/-   ##
=====================================
  Coverage     57%     57%           
=====================================
  Files        817     819    +2     
  Lines      43402   43425   +23     
=====================================
+ Hits       24787   24805   +18     
- Misses     18615   18620    +5

Impacted Files	Coverage Δ
common/ast/src/error.rs	`0% <0%> (ø)`
common/ast/src/parser/mod.rs	`81% <ø> (ø)`
common/ast/src/parser/token.rs	`80% <80%> (ø)`
metasrv/src/network.rs	`97% <0%> (-2%)`	⬇️
metasrv/src/meta_service/meta_service_impl.rs	`92% <0%> (-2%)`	⬇️
metasrv/src/meta_service/raftmeta.rs	`48% <0%> (ø)`
common/management/src/cluster/cluster_mgr.rs	`80% <0%> (+1%)`	⬆️
common/dal/src/context.rs	`88% <0%> (+2%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b9070da...dfd529d. Read the comment docs.

leiysky · 2022-01-30T03:49:04Z

@andylokandy Is this PR ready for merging? If so I will approve and merge it.

andylokandy · 2022-01-30T05:42:08Z

@leiysky Ok, let's roll!

databend-bot · 2022-01-30T08:00:02Z

Wait for another reviewer approval

Add tokenizer for the new sql parser

164a625

andylokandy requested a review from BohuTANG as a code owner January 29, 2022 12:26

databend-bot added pr-feature this PR introduces a new feature to the codebase labels Jan 29, 2022

databend-bot added the need-review label Jan 29, 2022

vercel bot temporarily deployed to Preview January 29, 2022 12:26 Inactive

andylokandy commented Jan 29, 2022

View reviewed changes

Remove redundant comment

5f87716

vercel bot temporarily deployed to Preview January 29, 2022 12:47 Inactive

Xuanwo requested a review from leiysky January 29, 2022 13:00

leiysky reviewed Jan 29, 2022

View reviewed changes

Move test to /tests and add lexer error

797797b

vercel bot temporarily deployed to Preview January 29, 2022 14:46 Inactive

Fix literal escape

dfd529d

vercel bot temporarily deployed to Preview January 29, 2022 14:59 Inactive

andylokandy mentioned this pull request Jan 29, 2022

Add utils for new sql parser #4027

Merged

leiysky approved these changes Jan 30, 2022

View reviewed changes

sundy-li merged commit 0921b0d into databendlabs:main Jan 30, 2022

leiysky mentioned this pull request Jan 31, 2022

[Tracking] Refactor SQL Parser #1218

Closed

15 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add tokenizer for the new sql parser #4024

Add tokenizer for the new sql parser #4024

andylokandy commented Jan 29, 2022 •

edited

Loading

databend-bot commented Jan 29, 2022

databend-bot commented Jan 29, 2022

vercel bot commented Jan 29, 2022 •

edited

Loading

andylokandy Jan 29, 2022

Xuanwo Jan 29, 2022 •

edited

Loading

PsiACE Jan 29, 2022

andylokandy Jan 29, 2022 •

edited

Loading

leiysky Jan 29, 2022

PsiACE Jan 29, 2022

andylokandy Jan 29, 2022

PsiACE Jan 29, 2022

andylokandy Jan 29, 2022 •

edited

Loading

PsiACE Jan 29, 2022

Xuanwo commented Jan 29, 2022

leiysky Jan 29, 2022

andylokandy Jan 29, 2022 •

edited

Loading

leiysky Jan 29, 2022

leiysky Jan 29, 2022

andylokandy Jan 29, 2022

leiysky Jan 29, 2022

andylokandy Jan 29, 2022

codecov-commenter commented Jan 29, 2022 •

edited

Loading

leiysky commented Jan 30, 2022

andylokandy commented Jan 30, 2022

databend-bot commented Jan 30, 2022

Add tokenizer for the new sql parser #4024

Add tokenizer for the new sql parser #4024

Conversation

andylokandy commented Jan 29, 2022 • edited Loading

Summary

Changelog

Related Issues

Test Plan

databend-bot commented Jan 29, 2022

databend-bot commented Jan 29, 2022

vercel bot commented Jan 29, 2022 • edited Loading

Choose a reason for hiding this comment

Xuanwo Jan 29, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andylokandy Jan 29, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andylokandy Jan 29, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Xuanwo commented Jan 29, 2022

Choose a reason for hiding this comment

andylokandy Jan 29, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Jan 29, 2022 • edited Loading

Codecov Report

leiysky commented Jan 30, 2022

andylokandy commented Jan 30, 2022

databend-bot commented Jan 30, 2022

andylokandy commented Jan 29, 2022 •

edited

Loading

vercel bot commented Jan 29, 2022 •

edited

Loading

Xuanwo Jan 29, 2022 •

edited

Loading

andylokandy Jan 29, 2022 •

edited

Loading

andylokandy Jan 29, 2022 •

edited

Loading

andylokandy Jan 29, 2022 •

edited

Loading

codecov-commenter commented Jan 29, 2022 •

edited

Loading