Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't escape space characters in strings #525

Merged
merged 14 commits into from
Apr 29, 2022

Conversation

Lagovas
Copy link
Collaborator

@Lagovas Lagovas commented Apr 13, 2022

Found that valid strings with \n characters for string types we encode as hex values and then it rendered incorrectly in web apps. In this PR update our check of printable characters that skipped for encoding. Also found that postgresql sends unicode values in another way then simple latin strings (with encoding into hex). But I didn't find how to fix it correctly, so I created T2531 for futher investigations. And this PR just short fix that should fix obvious and frequent cases.

Checklist

@G1gg1L3s
Copy link
Contributor

Found that valid strings with \n characters for string types we encode as hex values and then it rendered incorrectly in web apps.

Could you provide an example of the full workflow where the problem occurs and for what reason? I just don't quite understand what do you mean

@Lagovas
Copy link
Collaborator Author

Lagovas commented Apr 15, 2022

Could you provide an example of the full workflow where the problem occurs and for what reason? I just don't quite understand what do you mean

first time I found it in our rubygems example:
image
There is hex value:

>>> unhexlify(b'2d2d2d205b5d0a')
b'--- []\n'

And that is how it looks like with correct encoding:
image
Empty value filtered on the app's side.
There is Acra encoded in such way due to \n symbol which is not printable

@G1gg1L3s
Copy link
Contributor

Ok, thank you for explanation! So my question is, are there any issues with other control characters? I saw \n and \t in tests. But what about for example \r? Maybe it's better to just check that slice is valid utf8, and if not, use hex encoding? At least this complies with postgres' varchar type.

@Lagovas
Copy link
Collaborator Author

Lagovas commented Apr 15, 2022

Ok, thank you for explanation! So my question is, are there any issues with other control characters? I saw \n and \t in tests. But what about for example \r? Maybe it's better to just check that slice is valid utf8, and if not, use hex encoding? At least this complies with postgres' varchar type.

Now it is checks "SPACE" group of unicode symbols and ignores other control symbols.

And now it checks for utf8 strings, spaces and \\. Otherwise encodes into the hex.

@G1gg1L3s
Copy link
Contributor

I am not sure if it work correctly. For example, I've configured two tables: encrypted and plain with one text field. Then I insert data with one backslash. Here are the results:
image

On the left there is a connection to acra, and on the right direct connection to postgres. As you can see, the Acra encoded plain results into hex, and encrypted removes \ from decrypted results entirely (checked in wireshark).

Maybe we should revise and somehow document the encoding/decoding strategy?

@Lagovas
Copy link
Collaborator Author

Lagovas commented Apr 27, 2022

so, I debugged how postgres sends string and binary values. And yes, if data has text type then postgres doesn't encode it and sens as is. Even if it is not unprintable control characters but valid utf8.
But if its binary data, then it sends in the escaped format (hex/octal)

Postgres knows types in the database but doesn't know type on application side. For example if app operates with str types, in postgres it is stored as binary. So anyway it comes to acra-server as binary values. But after decryption we should send forwared it as string values without any encodings, otherwise encode as binary data.

Also I met a problem with encoding binary data when we don't know real type. When we received data that not related to any encryptor_config setting. Acra decrypts acrastructs/acrablocks transparently without any encryptor_config because data can be stored directly from app to database with acrawriter. Or data can be encrypted with AcraTranslator and also saved directly to database. So on Acra side we try all data to decode, match to known signatures, try to decrypt and then return encoded decrypted data or as is. So I added saving value as is to return in same representation in case if it was already decoded but not decrypted, and acra doesn't have info about type.

Comment on lines +174 to 186
case sqlparser.StrVal:
// if it is valid string, return as is. otherwise there is binary data, and we should encode it
if utils.IsPrintablePostgresqlString(data) {
return data, nil
}
val.Type = sqlparser.PgEscapeString
fallthrough
case sqlparser.PgEscapeString:
// valid strings we pass as is without extra encoding
if utils.IsPrintablePostgresqlString(data) {
return data, nil
}
return PgEncodeToHexString(data), nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two cases look almost the same, maybe we can combine them back?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they are not the same. in case with StrVal we check that it is printable string. If so, we return it as string literal as is. If it came as select 'some string', then we return 'some string'. Otherwise we change type to PgEscapeString that will return as E'some string'. We return as PgEscapeString because with such format we can encode not printable characters as hex or oct literals.

When we fall below - yes, we check second time with IsPrintable. But it will fail and encode string.
Or we can return PgEncodeToHexString in case with StrVal too. But I prefer to change type and use fallthrough for a future if case with PgEscapeString will process it in more complicated or different way. Then we should not to maintain copy-paste

// string values can contain hex values and should be returned as is
{input: []byte("\\xTT"), decodedData: []byte("\\xTT"), encodedData: []byte("\\xTT"), decodeErr: hex.InvalidByteError('T'), encodeErr: nil,
setting: &config.BasicColumnEncryptionSetting{DataType: "str"}},

// invalid binary hex value that should be returned as is. Also encoded into hex due to invalid hex value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leftover confusing comment detected :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch, thanks. will update it in the second copy-pasted row too

Comment on lines 88 to 96
if !base.IsDecryptedFromContext(ctx) {
encodedValue, ok := getEncodedValueFromContext(ctx)
if ok {
return ctx, &identityValue{encodedValue}, nil
}
} else {
return ctx, &byteSequenceValue{seq: data}, nil
}
return ctx, &identityValue{data}, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first I thought that 96th line was unreachable, but it isn't. Though, can we swap branches in this if statement to simplify it (I'm talking about checking if base.IsDecryptedFromContext(ctx) first). I think It would be easier to perceive and it would make branches more explicit.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, will reverse it

Comment on lines 88 to 96
if !base.IsDecryptedFromContext(ctx) {
encodedValue, ok := getEncodedValueFromContext(ctx)
if ok {
return ctx, &identityValue{encodedValue}, nil
}
} else {
return ctx, &byteSequenceValue{seq: data}, nil
}
return ctx, &identityValue{data}, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, correct me if I am wrong.
If the decoding was successful in decodeText we save the result into the context. Then here, in the encoding function, if the data type is not known, we check if data was decrypted. And if it wasn't, we try to return that saved decoded value.

But, as I understand, if the decryptor handlers cannot decrypt the ciphertext, they will return it as is. So, theoretically, the decoded and saved value should not change. It means that in any case encodedValue must always be equal to data. So, my question is, in what cases that isn't true?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how acra processes data from database: it has 3 handlers.

  1. decoder
  2. decryptor
  3. encoder
    When it receive data from database, we should decode it from encoded format into raw binary. For example binary string 'hello' database will send as bytea data: \x68656c6c6f (in hex format). On our side we should decode it into raw binary value: 'hello' and pass it to our handlers that decrypt or detokenize data.
    Okay, we had decrypted or did nothing with this data. Anyway to pass forward we should encode it in correct format to application. If we didn't do anything, we should pass it as is in hex format \x68656c6c6f. If we decrypted data, we look on data_type. If it is "str" then we don't need to encode. If we don't have data_type then pass it as binary as it came to us and encode to hex format.

In function above we describe 3 step: encoder. Decide what to do: encode or don't encode.

// IsPrintablePostgresqlString returns true if it's valid ASCII or utf8 string except '\' character that used as escape
// character in strings
// IsPrintablePostgresqlString returns true if it's valid utf8 string
// except '\' character that used as escape character in strings
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another confusing comment detected :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, thanks. removed second check at end after extra checks that it doesn't need anymore

@Lagovas Lagovas requested a review from G1gg1L3s April 28, 2022 13:43
Copy link
Contributor

@G1gg1L3s G1gg1L3s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, cool!

@Lagovas Lagovas merged commit f1cb21a into master Apr 29, 2022
@Lagovas Lagovas deleted the lagovas/dont-escape-space-characters branch April 29, 2022 12:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants