The software does not properly handle when an input contains Unicode encoding.
Time of Introduction
Implementation
Applicable Platforms
Languages
All
Common Consequences
Scope
Effect
Integrity
Technical Impact: Unexpected state
Demonstrative Examples
Example 1
Windows provides the MultiByteToWideChar(), WideCharToMultiByte(),
UnicodeToBytes(), and BytesToUnicode() functions to convert between
arbitrary multibyte (usually ANSI) character strings and Unicode (wide
character) strings. The size arguments to these functions are specified in
different units, (one in bytes, the other in characters) making their use
prone to error.
In a multibyte character string, each character occupies a varying
number of bytes, and therefore the size of such strings is most easily
specified as a total number of bytes. In Unicode, however, characters
are always a fixed size, and string lengths are typically given by the
number of characters they contain. Mistakenly specifying the wrong units
in a size argument can lead to a buffer overflow.
The following function takes a username specified as a multibyte
string and a pointer to a structure for user information and populates
the structure with information about the specified user. Since Windows
authentication uses Unicode for usernames, the username argument is
first converted from a multibyte string to a Unicode string.
This function incorrectly passes the size of unicodeUser in bytes
instead of characters. The call to MultiByteToWideChar() can therefore
write up to (UNLEN+1)*sizeof(WCHAR) wide characters, or
(UNLEN+1)*sizeof(WCHAR)*sizeof(WCHAR) bytes, to the unicodeUser array,
which has only (UNLEN+1)*sizeof(WCHAR) bytes allocated.
If the username string contains more than UNLEN characters, the call
to MultiByteToWideChar() will overflow the buffer unicodeUser.
Server allows remote attackers to read documents
outside of the web root, and possibly execute arbitrary commands, via
malformed URLs that contain Unicode encoded
characters.
Avoid making decisions based on names of resources (e.g. files) if
those resources can have alternate names.
Phase: Implementation
Strategy: Input Validation
Assume all input is malicious. Use an "accept known good" input
validation strategy, i.e., use a whitelist of acceptable inputs that
strictly conform to specifications. Reject any input that does not
strictly conform to specifications, or transform it into something that
does.
When performing input validation, consider all potentially relevant
properties, including length, type of input, the full range of
acceptable values, missing or extra inputs, syntax, consistency across
related fields, and conformance to business rules. As an example of
business rule logic, "boat" may be syntactically valid because it only
contains alphanumeric characters, but it is not valid if the input is
only expected to contain colors such as "red" or "blue."
Do not rely exclusively on looking for malicious or malformed inputs
(i.e., do not rely on a blacklist). A blacklist is likely to miss at
least one undesirable input, especially if the code's environment
changes. This can give attackers enough room to bypass the intended
validation. However, blacklists can be useful for detecting potential
attacks or determining which inputs are so malformed that they should be
rejected outright.
Phase: Implementation
Strategy: Input Validation
Inputs should be decoded and canonicalized to the application's current internal representation before being validated (CWE-180). Make sure that the application does not decode the same input twice (CWE-174). Such errors could be used to bypass whitelist validation schemes by introducing dangerous inputs after they have been checked.
[REF-7] Mark Dowd, John McDonald
and Justin Schuh. "The Art of Software Security Assessment". Chapter 8, "Character Sets and Unicode", Page
446.. 1st Edition. Addison Wesley. 2006.