Parsing an expression containing a huge list of token triggers a stack overflow #162

evomassiny · 2019-10-16T22:28:02Z

Hello,

I've stumbled across a weird bug in the javascript parser:
when a single expression uses a lot of token (>120 in debug mode, and >3400 in release mode),
the Parser::parse() triggers a stack overflow.

Here is a minimal way to reproduce the bug:

use boa::{
    exec::{Executor, Interpreter},
    js::value::ResultValue,
    realm::Realm,
    syntax::{lexer::Lexer, parser::Parser},
};

fn main() {
    // A simple but huge expression
    const src: &str = r#"
1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 +
1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 +
1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 +
1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 +
1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 +
1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 +
1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 +
1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 +
1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 +
1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 +
1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 +
1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 +
1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 +
1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 +
1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 +
1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 +
1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 +
1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 +
1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 +
1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1;
"#;
    // build interpreter
    let realm = Realm::create();
    let mut engine: Interpreter = Executor::new(realm);
    // lex input
    let mut lexer = Lexer::new(&src);
    lexer.lex().expect("lexing failed");
    // /!\ parsing this token list triggers a Stack Overflow
    let _ = Parser::new(lexer.tokens).parse_all();
}

My guess is that the function recursively calls itself to process each right hand of an addition,
which slowly increase the size of the stack until it eventually overflows.

The only solution that I can think of, is to re-implement the parser into a stack machine, which would basically turns the recursive calls into a while loop (but this might represents a tons of work :S ).

Thank you for your time,

evomassiny

The text was updated successfully, but these errors were encountered:

jasonwilliams · 2019-10-17T08:13:48Z

Thanks @evomassiny
This is a great issue, I think we should certainly look at that. The reproduction steps are useful.

I’m not sure how the refactor would work right now but it’s something to think about. I’m open to ideas. Stack machine sounds about right.

evomassiny · 2019-10-19T21:30:35Z

Hi,

I found a way to build a parser without fonction recursion, by using an intermediate representation of the Abstract Syntax Tree in Polish Notation, and build a AST from it, in a second pass.

The benefit of the polish notation is its "flatness", we can represent the whole AST into a vector of expression, which make it easier to work with, specialy using a stack machine.

the main parsing algorithm would be something like that;

collect tokens from the lexer
starting with idx = 0
- tries to identify an expression pattern starting from tokens[idx..], if one matches:
  - build the associated expression, the expression type should hold the number of sub-expressions needed to execute it, but not the sub-expressions themselves
  - push this expression into a stack, the sub-expression will be parsed in the next iteration
  - invalidate the parsed tokens, so the next iteration won't try to parse them again
- increment idx
Iter the stack in reverse to build the actual AST, starting from the leaves to the trunk.

The hard part is the implementation of the function that match expressions patterns (basically regex but for tokens).

To convince myself that it could be done, I implemented a parser for a subset of the javascript langage in this repo: https://github.com/evomassiny/toylang,
If you want to see the part that build the AST in Polish Notation it's here, and the part that build an actual AST from it is there.

I hope this helps, but to be honest I don't have much knowledge of AST's parsers, there might be easier solutions.

simonbuchan · 2020-01-14T04:19:02Z

Also look into https://en.wikipedia.org/wiki/Shunting-yard_algorithm

yovoslav · 2020-01-14T15:38:51Z

thanks @simonbuchan, that is what my current WIP implementation is based upon

jasonwilliams · 2020-01-16T11:05:45Z

Also look into https://en.wikipedia.org/wiki/Shunting-yard_algorithm

@simonbuchan thats a great algorithm!
I believe our change is basically that with some tweaks

simonbrahan · 2020-01-17T10:19:57Z

Also look into https://en.wikipedia.org/wiki/Shunting-yard_algorithm

@simonbrahan thats a great algorithm!
I believe our change is basically that with some tweaks

Wrong @simon :P

jasonwilliams · 2020-01-18T17:24:43Z

Adding benchmark to keep track of expression parsing jasonwilliams#226

jrop · 2020-02-14T21:27:35Z

What algorithm does the parser currently use? I've not had the stack overflow problem with Pratt parsing. I even tried plugging the text in from #162 into my Pratt expression parser, and it parsed very quickly (and this is written in TypeScript: imagine the gains if it was in Rust).

The nice thing about Pratt parsing is that it integrates very naturally with recursive decent parsing.

jasonwilliams · 2020-02-15T00:51:08Z

This bug was fixed by #223

@jrop for this issue it was fixed using the Shunting Yard algorithm.
Also if you have better ideas, please leave a comment on this thread #225

simonbuchan · 2020-02-16T23:21:38Z

@jrop Pratt parsers are very similar to the shunting yard algorithm: both are based on attaching precedence and associativity to infix operators. The main difference seems to be that Pratt parsers store nested precedence levels on the call stack, while shunting yard handles that with an explicit stack. If you have a sensible number of precedence levels, Pratt parsing won't hit the stack limit, but it's theoretically a bit slower due to the theoretically higher number of calls: in practice you won't notice a difference. Really the difference is that there's much better documentation on how to implement Shunting Yard, Pratt parsers tend to use weird terms like "null denominator".

dapper-gh · 2020-10-11T17:34:23Z

This bug still occurs - here are a few examples.

1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1

[[[[[[[[[[[[[[[[[[[]]]]]]]]]]]]]]]]]]]

Both of those panic in debug mode/the website WASM demo. You may need to increase the size of the expressions to cause a panic in release mode.

Razican · 2020-10-12T17:50:02Z

I will re-open this issue, since I think we are back again doing recursive parsing since we did our new parser. Maybe @jasonwilliams can confirm or give more insights.

jasonwilliams · 2024-04-14T14:37:17Z

The original description doesn't overflow anymore as there's been some optimzations but this example does still cause the issue:
https://gist.github.com/jasonwilliams/9f461a7fac0e7721702d82b05fb1012c

jasonwilliams · 2024-12-06T21:47:49Z

i think all thats left here is to add a test on the parser to make sure this doesn't happen again

jasonwilliams · 2025-02-27T13:29:42Z

discussion is happening at #4089 (comment)

jasonwilliams added the help wanted Extra attention is needed label Oct 17, 2019

yovoslav self-assigned this Jan 12, 2020

yovoslav mentioned this issue Jan 18, 2020

Reimplement the parser #225

Closed

16 tasks

jasonwilliams closed this as completed Feb 15, 2020

Razican reopened this Oct 12, 2020

Razican unassigned yovoslav Jan 11, 2021

Razican mentioned this issue Jan 31, 2022

stack-overflow caused by deep call stack #1402

Open

jedel1043 added this to Boa pre-v1 Aug 22, 2022

jedel1043 added the bug Something isn't working label Aug 22, 2022

jedel1043 moved this to Todo in Boa pre-v1 Aug 22, 2022

jedel1043 self-assigned this Apr 14, 2024

jedel1043 added the triaged Issue reviewed by the maintainer team label Apr 14, 2024

jasonwilliams added the E-Easy Easy label Dec 6, 2024

hansl self-assigned this Dec 13, 2024

hansl mentioned this issue Dec 13, 2024

Add a stress test to the parser to parser multi-millions tokens #4086

Merged

hansl closed this as completed in #4086 Dec 14, 2024

github-project-automation bot moved this from To do to Done in Boa pre-v1 Dec 14, 2024

jasonwilliams mentioned this issue Feb 26, 2025

boa_tester stack overflows in debug mode #4089

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parsing an expression containing a huge list of token triggers a stack overflow #162

Parsing an expression containing a huge list of token triggers a stack overflow #162

evomassiny commented Oct 16, 2019

jasonwilliams commented Oct 17, 2019

evomassiny commented Oct 19, 2019 •

edited

Loading

simonbuchan commented Jan 14, 2020

yovoslav commented Jan 14, 2020

jasonwilliams commented Jan 16, 2020 •

edited

Loading

simonbrahan commented Jan 17, 2020 •

edited

Loading

jasonwilliams commented Jan 18, 2020

jrop commented Feb 14, 2020

jasonwilliams commented Feb 15, 2020 •

edited by Razican

Loading

simonbuchan commented Feb 16, 2020

dapper-gh commented Oct 11, 2020

Razican commented Oct 12, 2020

jasonwilliams commented Apr 14, 2024

jasonwilliams commented Dec 6, 2024

jasonwilliams commented Feb 27, 2025 •

edited

Loading

Parsing an expression containing a huge list of token triggers a stack overflow #162

Parsing an expression containing a huge list of token triggers a stack overflow #162

Comments

evomassiny commented Oct 16, 2019

jasonwilliams commented Oct 17, 2019

evomassiny commented Oct 19, 2019 • edited Loading

simonbuchan commented Jan 14, 2020

yovoslav commented Jan 14, 2020

jasonwilliams commented Jan 16, 2020 • edited Loading

simonbrahan commented Jan 17, 2020 • edited Loading

jasonwilliams commented Jan 18, 2020

jrop commented Feb 14, 2020

jasonwilliams commented Feb 15, 2020 • edited by Razican Loading

simonbuchan commented Feb 16, 2020

dapper-gh commented Oct 11, 2020

Razican commented Oct 12, 2020

jasonwilliams commented Apr 14, 2024

jasonwilliams commented Dec 6, 2024

jasonwilliams commented Feb 27, 2025 • edited Loading

evomassiny commented Oct 19, 2019 •

edited

Loading

jasonwilliams commented Jan 16, 2020 •

edited

Loading

simonbrahan commented Jan 17, 2020 •

edited

Loading

jasonwilliams commented Feb 15, 2020 •

edited by Razican

Loading

jasonwilliams commented Feb 27, 2025 •

edited

Loading