Struct Char

Source

pub struct Char<T>(pub T);

Expand description

𝐓 🛠️ Unicode scalars-related low-level const operations.

^{📍 text/unicode/scalar}

§Terminology

A code point is an integer in U+0000..=U+10FFFF.
A surrogate is a code point in U+D800..=U+DFFF.
A Unicode scalar value is a code point that is not a surrogate. Rust’s char represents exactly this set.
A scalar rank is the zero-based position of a scalar value in the ordered set of all Unicode scalar values, with the surrogate range omitted.

§Methods

over char
- len_utf8.
- width (common).
- is_combining (common).
- is_control (common).
- is_fullwidth (common).
- to_utf8_bytes.
- write_utf8_to.
- as_ascii (unchecked).
- to_ascii_fold (unchecked).
- random_next.
- random_from_seed.
over u16
- is_surrogate (high, low).
- decode_surrogate_pair.
over u32
- scalar_from_rank.
- scalar_rank.
- is_valid_code.
- is_valid_scalar.
- is_surrogate (high, low).
- len_bytes.
- len_utf8 (unchecked).
- width (common).
- is_ascii (unchecked).
- is_noncharacter.
- is_combining (common).
- is_control (common).
- is_fullwidth (common).
- as_ascii (unchecked).
- to_utf8_bytes (unchecked).
- write_utf8_to_unchecked.
- random_next.
- random_from_seed.
over u8
- len_utf8 (unchecked).
- len_utf8_match (naive).
- is_utf8_boundary.
- is_utf8_continuation.
- as_char.
over &[u8]
- to_char (lenient, unchecked^⚠).
- to_scalar (unchecked).
- has_overlong_encoding.
- has_valid_continuation.
- is_utf8_boundary.
- ceil_utf8_boundary.
- floor_utf8_boundary.

Tuple Fields§

§0: T

Implementations§

Source §

impl Char<char>

§Methods over `char`

Source

pub const fn len_utf8(self) -> usize

Returns the number of bytes needed to encode the given Unicode scalar as UTF-8.

pub const fn width(self) -> usize

Returns the monospace display width.

0: Non-printing characters (controls, combining marks)
1: Regular characters (Latin, Greek, Cyrillic, etc.)
2: Wide characters (CJK, emoji, fullwidth forms)

Source

pub const fn width_common(self) -> usize

Returns the monospace display width using faster calculation.

Uses optimized checks that cover common cases but may incorrectly report some obscure Unicode characters as 1 width instead of 2.

Source

pub const fn is_combining(self) -> bool

Returns true for all Unicode combining characters.

Includes musical notation, historic scripts, and obscure diacritics. Comprehensive but slightly slower than is_combining_common.

Source

pub const fn is_combining_common(self) -> bool

Returns true for common combining marks used in modern text.

Covers Latin, Greek, and most European language diacritics. Fast and suitable for 95% of use cases.

Source

pub const fn is_control(self) -> bool

Returns true for all Unicode control characters.

Source

pub const fn is_control_common(self) -> bool

Returns true for common Unicode control characters.

Just ASCII, zero-width spaces, bidi formatting, word joiners and invisible operators.

Source

pub const fn is_fullwidth(self) -> bool

Returns true for all Unicode fullwidth characters.

Source

pub const fn is_fullwidth_common(self) -> bool

Returns true for common fullwidth characters (ASCII variants, basic CJK)

Source

pub const fn to_utf8_bytes(self) -> [u8; 4]

Converts this Unicode scalar to a UTF-8 encoded byte sequence.

Always returns a [u8; 4] array, with unused bytes set to 0.

pub const fn write_utf8_to(self, buf: &mut [u8]) -> usize

Writes this Unicode scalar as UTF-8 into buf.

Returns the number of bytes written.

A buffer of 4 bytes is always large enough for any Unicode scalar.

§Panics

Panics if buf.len() < self.0.len_utf8().

Source

pub const fn as_ascii(self) -> &'static str ⓘ

Returns the ASCII representation as a &'static str, or "" if non-ASCII.

Source

pub const fn as_ascii_unchecked(self) -> &'static str ⓘ

Returns the ASCII representation as a &'static str.

§Panics

Panics if the character is not ASCII.

Source

pub const fn to_ascii_fold(self) -> Option<char> ⓘ

Converts a character to its closest ASCII equivalent, if possible.

This function attempts to replace accented or special characters with their ASCII counterparts. If a mapping exists, it returns Some(char), otherwise, it returns None.

Source

pub const fn to_ascii_fold_unchecked(self) -> char

Converts a character to its closest ASCII equivalent, or returns the input character if no mapping exists.

This function is similar to to_ascii_fold, but never returns None. If no ASCII equivalent exists, the input character is returned unchanged.

Source

pub const fn random_next(rng: &mut Pcg32) -> char

Returns a Unicode scalar selected from the next output of rng.

Source

pub const fn random_from_seed(seed: u64) -> char

Returns a Unicode scalar deterministically selected from seed.

Source §

impl Char<u16>

§Methods over `u16`.

Source

pub const fn is_surrogate(self) -> bool

Returns true if the given Unicode scalar code is a surrogate code point.

Source

pub const fn is_surrogate_high(self) -> bool

Returns true if the given Unicode scalar code is a leading surrogate.

Source

pub const fn is_surrogate_low(self) -> bool

Returns true if the given Unicode scalar code is a trailing surrogate.

Source

pub const fn decode_surrogate_pair(high: u16, low: u16) -> Option<char> ⓘ

Decodes the given surrogate pair.

§Features

Uses the unsafe_str feature to skip duplicated validation checks.

Source §

impl Char<u32>

§Methods over `u32`.

Source

pub const MAX_CODE_POINT: u32 = 0x10_FFFF

Maximum Unicode code point.

Source

pub const SCALAR_COUNT: u32

Number of Unicode scalar values.

Source

pub const SURROGATE_COUNT: u32

Number of Unicode surrogate code points.

Source

pub const SURROGATE_START: u32 = 0xD800

First Unicode surrogate code point.

Source

pub const SURROGATE_END: u32 = 0xDFFF

Last Unicode surrogate code point.

Source

pub const SURROGATE_HIGH_END: u32 = 0xDBFF

Last high-surrogate code point, also called a leading surrogate.

Source

pub const SURROGATE_LOW_START: u32 = 0xDC00

First low-surrogate code point, also called a trailing surrogate.

Source

pub const fn scalar_from_rank(rank: u32) -> u32

Maps a dense Unicode scalar rank to its scalar value.

Scalar indices are contiguous in 0..SCALAR_COUNT, with the surrogate range omitted from the resulting values.

§Panics

Panics if rank >= SCALAR_COUNT.

Source

pub const fn scalar_rank(self) -> Option<u32> ⓘ

Returns the dense scalar rank of this value, or None if it is not a Unicode scalar value.

Source

pub const fn is_valid_code(self) -> bool

Checks whether the value is a Unicode code point.

A valid Unicode code point is any integer in the range: 0..=MAX_CODE_POINT.

This includes surrogate code points which are valid code points but cannot be represented as Unicode scalars.

§Examples

assert!(Char('A' as u32).is_valid_code()); // regular character
assert!(Char(0x00).is_valid_code());       // NULL is valid
assert!(Char(0x10FFFF).is_valid_code());   // maximum Unicode code point
// surrogates are valid code points:
assert!(Char(0xD800).is_valid_code());     // high surrogate
assert!(Char(0xDFFF).is_valid_code());     // low surrogate
// invalid:
assert!(!Char(0x110000).is_valid_code());  // above max Unicode

Source

pub const fn is_valid_scalar(self) -> bool

Checks whether the value is a Unicode scalar value representable as char.

This excludes surrogate code points, which are invalid in UTF-8 and cannot be represented as Unicode scalars.

§Examples

assert!(Char('A' as u32).is_valid_scalar()); // regular character
assert!(Char(0x00).is_valid_scalar());       // NULL is valid
assert!(Char(0x10FFFF).is_valid_scalar());   // maximum Unicode scalar
// invalid:
assert!(!Char(0xD800).is_valid_scalar());    // high surrogate
assert!(!Char(0xDFFF).is_valid_scalar());    // low surrogate
assert!(!Char(0x110000).is_valid_scalar());  // above max Unicode

Source

pub const fn is_surrogate(self) -> bool

Returns true if the value is a Unicode surrogate code point.

Source

pub const fn is_surrogate_high(self) -> bool

Returns true if the value is a Unicode high surrogate code point.

Source

pub const fn is_surrogate_low(self) -> bool

Returns true if the value is a Unicode low surrogate code point.

Source

pub const fn len_bytes(self) -> usize

Returns the bytes required to store the given Unicode code point in a non-UTF encoding.

This function does not determine the UTF-8 byte length. It assumes a simple encoding where values up to 0xFF use 1 byte, 0x100..=0xFFFF use 2 bytes, and anything larger uses 3 bytes.

Source

pub const fn len_utf8(self) -> Option<usize> ⓘ

Returns the number of bytes required to encode the given Unicode scalar as UTF-8.

Returns None if it’s not a valid Unicode scalar.

Source

pub const fn len_utf8_unchecked(self) -> usize

Returns the UTF-8 byte length of the current Unicode scalar without validation.

Assumes the code is a valid Unicode scalar. Use len_utf8 for a checked version.

Source

pub const fn width(self) -> usize

Returns the monospace display width.

0: Non-printing characters (controls, combining marks)
1: Regular characters (Latin, Greek, Cyrillic, etc.)
2: Wide characters (CJK, emoji, fullwidth forms)

Source

pub const fn width_common(self) -> usize

Returns the monospace display width using faster calculation.

Uses optimized checks that cover common cases but may incorrectly report some obscure Unicode characters as 1 width instead of 2.

Source

pub const fn is_ascii(self) -> bool

Checks if the given value is a 7-bit ASCII character (U+0000..=U+007F).

Source

pub const fn is_noncharacter(self) -> bool

Returns true if the given Unicode scalar code is a noncharacter.

Note that this also checks against reserved, potential non-characters.

Source

pub const fn is_combining(self) -> bool

Returns true for all Unicode combining characters.

Includes musical notation, historic scripts, and obscure diacritics. Comprehensive but slightly slower than is_combining_common.

Source

pub const fn is_combining_common(self) -> bool

Returns true for common combining marks used in modern text.

Covers Latin, Greek, and most European language diacritics. Fast and suitable for 95% of use cases.

Source

pub const fn is_control(self) -> bool

Returns true for all Unicode control characters.

Source

pub const fn is_control_common(self) -> bool

Returns true for common Unicode control characters.

Just ASCII, zero-width spaces, bidi formatting, word joiners and invisible operators.

Source

pub const fn is_fullwidth(self) -> bool

Returns true for all Unicode fullwidth characters.

Source

pub const fn is_fullwidth_common(self) -> bool

Returns true for common fullwidth characters (ASCII variants, basic CJK)

Source

pub const fn as_ascii(self) -> &'static str ⓘ

Returns the ASCII &'static str representation of the value, or "" if non-ASCII.

Source

pub const fn as_ascii_unchecked(self) -> &'static str ⓘ

Returns the ASCII &'static str representation of the value, or panics if non-ASCII.

§Panics

Panics if the character is not ASCII.

Source

pub const fn to_utf8_bytes(self) -> Option<[u8; 4]> ⓘ

Converts the Unicode scalar value to a UTF-8 encoded byte sequence array.

Returns None if the value is not a valid Unicode scalar. The result is always a [u8; 4] array, with unused bytes set to 0.

pub const fn to_utf8_bytes_unchecked(self) -> [u8; 4]

Converts the Unicode scalar value to a UTF-8 encoded byte sequence without validation.

Assumes the value is a valid Unicode scalar. Always returns a [u8; 4] array, with unused bytes set to 0.

See also Char::to_utf8_bytes for a checked version.

Source

pub const fn write_utf8_to_unchecked(self, buf: &mut [u8]) -> usize

Writes this Unicode scalar as UTF-8 into buf without validation.

Returns the number of bytes written.

A buffer of 4 bytes is always large enough for any Unicode scalar.

§Panics

Panics if buf.len() < self.0.len_utf8().

Source

pub const fn random_next(rng: &mut Pcg32) -> u32

Returns a Unicode scalar selected from the next output of rng.

Source

pub const fn random_from_seed(seed: u64) -> u32

Returns a Unicode scalar deterministically selected from seed.

Source §

impl Char<u8>

§Methods over `u8`.

Source

pub const fn len_utf8(self) -> Option<usize> ⓘ

Returns the expected UTF-8 byte length based on the given first byte, or None if invalid.

LUT based (256-byte array).

Source

pub const fn len_utf8_unchecked(self) -> usize

Returns the expected UTF-8 byte length based on the given first byte, or 0 if invalid.

LUT based (256-byte array).

Source

pub const fn len_utf8_match(self) -> Option<usize> ⓘ

Returns the expected UTF-8 byte length based on the given first byte, or None if invalid.

Match based, for when memory accesses are more expensive than branches.

Source

pub const fn len_utf8_match_naive(self) -> usize

Returns the expected UTF-8 byte length based on the given first byte.

Match based, for when memory accesses are more expensive than branches.

This function does not validate UTF-8 but determines how many bytes a valid sequence should occupy based on the leading byte.

§Caveat

If used on malformed UTF-8, it may suggest a length longer than the actual valid sequence.
Always use in conjunction with proper UTF-8 validation if handling untrusted input.

Source

pub const fn is_utf8_boundary(self) -> bool

Returns true if this byte is a valid starting point for a UTF-8 sequence.

This checks if the byte is not a UTF-8 continuation byte (i.e., it’s either an ASCII character or a valid leading byte of a multi-byte sequence).

Source

pub const fn is_utf8_continuation(self) -> bool

Returns true if this byte is a UTF-8 continuation byte.

Continuation bytes have the bit pattern 10xxxxxx.

Source

pub const fn as_char(self) -> char

Returns the current byte as a char.

See char::from(u8).

Source §

impl Char<&[u8]>

§Methods over `u8` slice.

Source

pub const fn to_char(self, index: usize) -> Option<(char, usize)> ⓘ

Decodes a UTF-8 scalar at index.

Returns Some((char, len)) if the input is a valid UTF-8 sequence and the decoded value is a valid Unicode scalar.

Returns None if:

The index is out of bounds.
The bytes do not form a valid UTF-8 sequence.
The decoded value is not a valid Unicode scalar.

This is implemented via Char::to_scalar().

§Examples

// Valid UTF-8 sequence
let result = Char(b"\xE2\x82\xAC").to_char(0); // €
assert_eq!(result, Some(('€', 3)));

// Invalid continuation bytes
let invalid_continuation = Char(b"\xE2\x41\xAC").to_char(0);
assert_eq!(invalid_continuation, None);

// Surrogate code point
let surrogate = Char(b"\xED\xA0\x80").to_char(0); // U+D800
assert_eq!(surrogate, None);

// Out of bounds index
let out_of_bounds = Char(b"hello").to_char(10);
assert_eq!(out_of_bounds, None);

// Incomplete sequence
let incomplete = Char(b"\xE2\x82").to_char(0); // Missing third byte
assert_eq!(incomplete, None);

§Features

Uses the unsafe_str feature to skip duplicated validation checks.

Source

pub const fn to_char_lenient(self, index: usize) -> (char, usize) ⓘ

Decodes a UTF-8 scalar leniently at index, validating only the final Unicode scalar.

This method is forgiving of UTF-8 encoding errors but ensures the result is a valid Unicode scalar value.

Does not validate UTF-8 continuation bytes (may decode malformed sequences).
If the leading byte is invalid it returns the replacement character (�).

This is implemented via Char::to_scalar_unchecked().

§Panics

Panics if the decoded value is not a valid Unicode scalar value, or if the index is out of bounds.

§Examples

// Valid UTF-8 sequence
let result = Char(b"\xE2\x82\xAC").to_char_lenient(0); // €
assert_eq!(result, ('€', 3));

// Invalid UTF-8 but decodes to valid scalar - behavior depends on input
// This may return unexpected characters rather than panicking
let result = Char(b"\xE2\x41\xAC").to_char_lenient(0);
assert_eq!(result, ('\u{206c}', 3));

// Surrogate code point - will panic
// let result = Char(b"\xED\xA0\x80").to_char_lenient(0); // PANIC: U+D800 is invalid

// Out of bounds index - will panic
// let result = Char(b"hello").to_char_lenient(10); // PANIC: index out of bounds

Source

pub const unsafe fn to_char_unchecked(self, index: usize) -> (char, usize) ⓘ

Available on crate feature unsafe_str and non-crate feature safe_text only.

Decodes a UTF-8 scalar at index without any validation.

If the leading byte is invalid it returns the replacement character (�).

This is implemented via Char::to_scalar_unchecked.

§Safety

The caller must ensure that:

index is within bounds of bytes
bytes[index..] contains a valid UTF-8 sequence
The decoded value is a valid Unicode scalar.

Violating these conditions may lead to undefined behavior.

Source

pub const fn to_scalar(self, index: usize) -> Option<(u32, usize)> ⓘ

Decodes a UTF-8 scalar from the given byte slice, starting at index.

Returns (scalar, len), where scalar is the decoded Unicode scalar, and len is the number of bytes consumed.

Returns None if:

The index is out of bounds.
The bytes do not form a valid UTF-8 sequence.
The decoded value is not a valid Unicode scalar.

§Examples

assert_eq!(Char("Ħ".as_bytes()).to_scalar(0), Some((u32::from('Ħ'), 2)));

let invalid = b"\x80"; // Invalid leading byte
assert_eq!(Char(invalid).to_scalar(0), None);

Source

pub const fn to_scalar_unchecked(self, index: usize) -> (u32, usize) ⓘ

Decodes a UTF-8 scalar from the given byte slice, starting at index, without validation.

Returns (scalar, len), where scalar is the decoded Unicode scalar, and len is the number of bytes consumed.

It assumes bytes[index..] contains a valid UTF-8 sequence, and it doesn’t validate the resulting Unicode scalar.

If the leading byte is invalid it returns the replacement character (�).

§Panics

It will panic if the index is out of bounds.

Source

pub const fn has_overlong_encoding(self, index: usize, len: usize) -> bool

Returns true if the UTF-8 sequence starting at index is overlong encoded.

This method only checks for overlong encodings, but not other UTF-8 validity rules. It does not verify continuation byte patterns nor invalid scalar values.

Overlong encodings use more bytes than necessary to represent a character, which is invalid in well-formed UTF-8.

§Examples

assert!(Char(b"\xE0\x80\x80").has_overlong_encoding(0, 3)); // overlong encoding
assert!(!Char(b"\xE0\xA0\x80").has_overlong_encoding(0, 3)); // valid 3-byte sequence

Source

pub const fn has_valid_continuation(self, index: usize, len: usize) -> bool

Verifies that the continuation bytes following a UTF-8 leading byte are properly formatted.

Each continuation byte must match the pattern 10xxxxxx (i.e., have the high bits 0b10). This ensures the byte sequence follows proper UTF-8 encoding rules.

This method only verifies correct syntax, but not correct semantics. It does not check for overlong encodings nor invalid scalar values.

§Examples

assert!(Char(b"\xE2\x82\xAC").has_valid_continuation(0, 3)); // euro sign €
assert!(!Char(b"\xE2\x41\xAC").has_valid_continuation(0, 3)); // second byte is ASCII 'A'
assert!(!Char(b"\xC2").has_valid_continuation(0, 2)); // incomplete sequence