Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specialize OsString::push and OsString as From for UTF-8 #137777

Merged
merged 2 commits into from
Mar 7, 2025

Conversation

thaliaarchi
Copy link
Contributor

@thaliaarchi thaliaarchi commented Feb 28, 2025

When concatenating two WTF-8 strings, surrogate pairs at the boundaries need to be joined. However, since UTF-8 strings cannot contain surrogate halves, this check can be skipped when one string is UTF-8. Specialize OsString::push to use a more efficient concatenation in this case.

The WTF-8 version of OsString tracks whether it is known to be valid UTF-8 with its is_known_utf8 field. Specialize From<AsRef<OsStr>> so this can be set for UTF-8 string types.

Unfortunately, a specialization for T: AsRef<str> conflicts with T: AsRef<OsStr>, so stamp out string types with a macro.

r? @ChrisDenton

@rustbot
Copy link
Collaborator

rustbot commented Feb 28, 2025

r? @joboet

rustbot has assigned @joboet.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Feb 28, 2025
@rustbot rustbot assigned ChrisDenton and unassigned joboet Feb 28, 2025
@bjorn3
Copy link
Member

bjorn3 commented Feb 28, 2025

Avoid trying to join surrogates at boundaries when concatenating a WTF-8 OsString and a UTF-8 string.

Is this a performance optimization or a correctness fix?

@thaliaarchi
Copy link
Contributor Author

Performance. When the LHS ends with a leading surrogate and the RHS starts with a trailing surrogate, then they need to be joined. A WTF-8 OsString can contain surrogate halves, but a &str cannot since UTF-8 forbids them. Thus, concatenating the two will never join surrogates.

When concatenating two WTF-8 strings, surrogate pairs at the boundaries
need to be joined. However, since UTF-8 strings cannot contain surrogate
halves, this check can be skipped when one string is UTF-8. Specialize
`OsString::push` to use a more efficient concatenation in this case.

Unfortunately, a specialization for `T: AsRef<str>` conflicts with
`T: AsRef<OsStr>`, so stamp out string types with a macro.
The WTF-8 version of `OsString` tracks whether it is known to be valid
UTF-8 with its `is_known_utf8` field. Specialize `From<AsRef<OsStr>>` so
this can be set for UTF-8 string types.
@thaliaarchi thaliaarchi changed the title Specialize OsString::push for strings Specialize OsString::push and OsString as From for UTF-8 Feb 28, 2025
@thaliaarchi
Copy link
Contributor Author

I have now pushed another similar specialization. Since the WTF-8 OsString tracks whether it is known to be UTF-8 with its is_known_utf8 field, make sure From<AsRef<OsStr>> sets that to true when converting from UTF-8 strings.

@joboet
Copy link
Member

joboet commented Mar 6, 2025

Great idea!
@bors r+

@bors
Copy link
Contributor

bors commented Mar 6, 2025

📌 Commit 83407b8 has been approved by joboet

It is now in the queue for this repository.

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Mar 6, 2025
@joboet joboet self-assigned this Mar 6, 2025
workingjubilee added a commit to workingjubilee/rustc that referenced this pull request Mar 7, 2025
…joboet

Specialize `OsString::push` and `OsString as From` for UTF-8

When concatenating two WTF-8 strings, surrogate pairs at the boundaries need to be joined. However, since UTF-8 strings cannot contain surrogate halves, this check can be skipped when one string is UTF-8. Specialize `OsString::push` to use a more efficient concatenation in this case.

The WTF-8 version of `OsString` tracks whether it is known to be valid UTF-8 with its `is_known_utf8` field. Specialize `From<AsRef<OsStr>>` so this can be set for UTF-8 string types.

Unfortunately, a specialization for `T: AsRef<str>` conflicts with `T: AsRef<OsStr>`, so stamp out string types with a macro.

r? `@ChrisDenton`
bors added a commit to rust-lang-ci/rust that referenced this pull request Mar 7, 2025
…kingjubilee

Rollup of 12 pull requests

Successful merges:

 - rust-lang#136667 (Revert vita's c_char back to i8)
 - rust-lang#136780 (std: move stdio to `sys`)
 - rust-lang#137107 (Override default `Write` methods for cursor-like types)
 - rust-lang#137363 (compiler: factor Windows x86-32 ABI impl into its own file)
 - rust-lang#137528 (Windows: Fix error in `fs::rename` on Windows 1607)
 - rust-lang#137537 (Prevent `rmake.rs` from using unstable features, and fix 3 run-make tests that currently do)
 - rust-lang#137777 (Specialize `OsString::push` and `OsString as From` for UTF-8)
 - rust-lang#137832 (Fix crash in BufReader::peek())
 - rust-lang#137904 (Improve the generic MIR in the default `PartialOrd::le` and friends)
 - rust-lang#138115 (Suggest typo fix for static lifetime)
 - rust-lang#138125 (Simplify `printf` and shell format suggestions)
 - rust-lang#138129 (Stabilize const_char_classify, const_sockaddr_setters)

r? `@ghost`
`@rustbot` modify labels: rollup
bors added a commit to rust-lang-ci/rust that referenced this pull request Mar 7, 2025
…iaskrgr

Rollup of 8 pull requests

Successful merges:

 - rust-lang#136667 (Revert vita's c_char back to i8)
 - rust-lang#137107 (Override default `Write` methods for cursor-like types)
 - rust-lang#137777 (Specialize `OsString::push` and `OsString as From` for UTF-8)
 - rust-lang#137832 (Fix crash in BufReader::peek())
 - rust-lang#137904 (Improve the generic MIR in the default `PartialOrd::le` and friends)
 - rust-lang#138115 (Suggest typo fix for static lifetime)
 - rust-lang#138125 (Simplify `printf` and shell format suggestions)
 - rust-lang#138129 (Stabilize const_char_classify, const_sockaddr_setters)

r? `@ghost`
`@rustbot` modify labels: rollup
@bors bors merged commit d986027 into rust-lang:master Mar 7, 2025
6 checks passed
@rustbot rustbot added this to the 1.87.0 milestone Mar 7, 2025
rust-timer added a commit to rust-lang-ci/rust that referenced this pull request Mar 7, 2025
Rollup merge of rust-lang#137777 - thaliaarchi:os_string-push-str, r=joboet

Specialize `OsString::push` and `OsString as From` for UTF-8

When concatenating two WTF-8 strings, surrogate pairs at the boundaries need to be joined. However, since UTF-8 strings cannot contain surrogate halves, this check can be skipped when one string is UTF-8. Specialize `OsString::push` to use a more efficient concatenation in this case.

The WTF-8 version of `OsString` tracks whether it is known to be valid UTF-8 with its `is_known_utf8` field. Specialize `From<AsRef<OsStr>>` so this can be set for UTF-8 string types.

Unfortunately, a specialization for `T: AsRef<str>` conflicts with `T: AsRef<OsStr>`, so stamp out string types with a macro.

r? ``@ChrisDenton``
@thaliaarchi thaliaarchi deleted the os_string-push-str branch March 7, 2025 18:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants