Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jaq outputs invalid utf8 escaped json #209

Open
ibash-corpusant opened this issue Sep 9, 2024 · 5 comments
Open

jaq outputs invalid utf8 escaped json #209

ibash-corpusant opened this issue Sep 9, 2024 · 5 comments

Comments

@ibash-corpusant
Copy link

ibash-corpusant commented Sep 9, 2024

I'm not sure if jaq is outputting invalid utf8, or if jq is too liberal in what it accepts. In any case this is easy to reproduce:

Steps:

  1. Have a json string with unicode codepoints
  2. Pass it through jaq

Expected:

  1. jaq outputs the unicode characters in json

Actual:

  1. jaq tries to escape the unicode characters
// foo.js
a = "\u{200b}banana \u{200e}man"
console.log(JSON.stringify({a}))
# this outputs the string with the unicode characters (but they're not visible)
❯ node foo.js 
{"a":"​banana ‎man"}

# jaq tries to escape the unicode characters, but this is invalid json
❯ node foo.js | jaq
{
  "a": "\u{200b}banana \u{200e}man"
}

# jq outputs the unicode characters (but they're not visible)
❯ node foo.js | jq
{
  "a": "​banana ‎man"
}

# we can see jaq is producing invalid json
❯ node foo.js | jaq | jaq
Error: failed to parse: invalid hexadecimal sequence

# but jq produces valid json
❯ node foo.js | jq | jaq 
{
  "a": "\u{200b}banana \u{200e}man"
}

❯ node foo.js | jaq | jq 
jq: parse error: Invalid characters in \uXXXX escape at line 2, column 35
@pkoppstein
Copy link

This is to confirm that jaq (including jaq 2.0.0-alpha) is incorrect:

$ cat bananaman.json
"\u200bbanana \u200eman"

$ jq -r . bananaman.json | jq -R . | jaq . 
"\u{200b}banana \u{200e}man"

Perhaps the intention was that jaq should write "\u200bbanana \u200eman",
which would be reasonable although at variance with both the C and Go implementations:

$ jq -r . bananaman.json | jq -R . > bananaman.quoted.txt
$ jq . bananaman.quoted.txt
"​banana ‎man"
$ gojq . bananaman.quoted.txt
"​banana ‎man"

@01mf02 01mf02 closed this as completed in b8f8e4f Sep 10, 2024
@01mf02
Copy link
Owner

01mf02 commented Sep 10, 2024

Thank you for your bug report, @ibash-corpusant! However, because your PR adds a new dependency, allocates a new string, and may fail, I corrected the problem in a way that avoids all that.

@ibash-corpusant
Copy link
Author

thank you for the quick fix!

@maribox
Copy link

maribox commented Feb 8, 2025

I still have this problem on version 2.1.0, is that possible? Or am I doing anything wrong?
The data is from the wiktionary data from https://kaikki.org/dictionary/rawdata.html (raw Wiktextract data)

$ jaq -c 'select((has("form_of") | not) and has("sounds")) | {(.word): (.sounds | map(select(has("ipa"))))}' ./raw-wiktextract-data.jsonl | rg '\\u\{' 
{"विश\u{94d}व":[{"tags":["Delhi"],"ipa":"/ʋɪʃ.ʋə/"},{"tags":["Delhi"],"ipa":"[ʋɪʃ.ʋɐ]"}]}
{"विश\u{94d}व":[{"tags":["Delhi"],"ipa":"/ʋɪʃ.ʋə/"},{"tags":["Delhi"],"ipa":"[ʋɪʃ.ʋɐ]"}]}
{"विश\u{94d}व":[{"ipa":"/ʋiɕ.ʋə/"}]}
{"विश\u{94d}व":[{"ipa":"[bisːo]"}]}
{"विश\u{94d}व":[{"tags":["Vedic"],"ipa":"/ʋíɕ.ʋɐ/"},{"tags":["Classical-Sanskrit"],"ipa":"/ˈʋiɕ.ʋɐ/"}]}
{"विश\u{94d}व":[{"tags":["Vedic"],"ipa":"/ʋíɕ.ʋɐ/"},{"tags":["Classical-Sanskrit"],"ipa":"/ˈʋiɕ.ʋɐ/"}]}
{"क\u{941}त\u{94d}ता":[{"tags":["Delhi"],"ipa":"/kʊt̪.t̪ɑː/"},{"tags":["Delhi"],"ipa":"[kʊt̪̚.t̪äː]"}]}
{"अ\u{902}":[{"ipa":"/əŋ/"},{"ipa":"[aŋ]"}]}
{"अ\u{902}":[{"tags":["Delhi"],"ipa":"/ə̃/"},{"tags":["Delhi"],"ipa":"[ɐ̃]"}]}
{"अ\u{902}":[{"ipa":"/əm/"}]}
{"अ\u{902}":[{"ipa":"[ʌ̃]"},{"ipa":"[ʌm]"}]}
{"अ\u{902}":[{"ipa":"[ə̃ː]"}]}
{"अ\u{901}":[{"ipa":"/ə̃ː/"},{"ipa":"[ãː]"}]}
{"अ\u{901}":[{"tags":["Delhi"],"ipa":"/ə̃/"},{"tags":["Delhi"],"ipa":"[ɐ̃]"}]}
{"अ\u{901}":[{"ipa":"[ʌ̃]"}]}
{"அகரம\u{bcd}":[{"ipa":"/aɡaɾam/"}]}
{"அக\u{bcd}க\u{bbe}":[{"ipa":"/akːaː/"}]}
{"அகம\u{bcd}":[{"ipa":"/aɡam/"}]}
{"அங\u{bcd}கம\u{bcd}":[{"ipa":"/aŋɡam/"}]}
{"יי\u{5b4}דיש":[{"ipa":"/ˈjɪdɪʃ/"}]}
{"יי\u{5b4}דיש":[{"ipa":"/ˈjɪdɪʃ/"}]}
{"m\u{327}uļe":[{"note":"phonetic","ipa":"[mˠulʷe]"},{"note":"phonemic","ipa":"/mˠilʷej/"}]}
{"क\u{94d}या":[{"tags":["Delhi"],"ipa":"/kjɑː/"},{"tags":["Delhi"],"ipa":"[kjäː]"}]}
{"क\u{94d}या":[{"tags":["Delhi"],"ipa":"/kjɑː/"},{"tags":["Delhi"],"ipa":"[kjäː]"}]}
{"פ\u{5bf}ינף":[{"ipa":"/fɪnf/"},{"ipa":"/ˈfɪnəf/"}]}
{"मध\u{94d}य प\u{94d}रद\u{947}श":[{"tags":["Delhi"],"ipa":"/məd̪ʱ.jə .pɾə.d̪eːʃ/"},{"tags":["Delhi"],"ipa":"[mɐd̪ʱ.jɐ‿.pɾɐ.d̪eːʃ]"}]}
{"ฝร\u{e31}\u{e48}ง":[{"tags":["standard"],"ipa":"/fa˨˩.raŋ˨˩/"},{"tags":["standard"],"ipa":"/fa˦˥.raŋ˨˩/"}]}
{"ฝร\u{e31}\u{e48}ง":[{"tags":["standard"],"ipa":"/fa˨˩.raŋ˨˩/"},{"tags":["standard"],"ipa":"/fa˦˥.raŋ˨˩/"}]}

@01mf02
Copy link
Owner

01mf02 commented Feb 14, 2025

Hi @maribox, thanks for reporting this issue. You're not doing anything wrong, I just forgot originally to apply the Unicode formatting code from strings to object keys. #259 should correct this. Can you confirm that?

@01mf02 01mf02 reopened this Feb 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants