The software prepares a structured message for communication with another component, but encoding or escaping of the data is either missing or done incorrectly. As a result, the intended structure of the message is not preserved.
Extended Description
Improper encoding or escaping can allow attackers to change the commands that are sent to another component, inserting malicious commands instead.
Most software follows a certain protocol that uses structured messages for communication between components, such as queries or commands. These structured messages can contain raw data interspersed with metadata or control information. For example, "GET /index.html HTTP/1.1" is a structured message containing a command ("GET") with a single argument ("/index.html") and metadata about which protocol version is being used ("HTTP/1.1").
If an application uses attacker-supplied inputs to construct a structured message without properly encoding or escaping, then the attacker could insert special characters that will cause the data to be interpreted as control information or metadata. Consequently, the component that receives the output will perform the wrong operations, or otherwise interpret the data incorrectly.
Alternate Terms
Output Sanitization
Output Validation
Output Encoding
Terminology Notes
The usage of the "encoding" and "escaping" terms varies widely. For
example, in some programming languages, the terms are used interchangeably,
while other languages provide APIs that use both terms for different tasks.
This overlapping usage extends to the Web, such as the "escape" JavaScript
function whose purpose is stated to be encoding. Of course, the concepts of
encoding and escaping predate the Web by decades. Given such a context, it
is difficult for CWE to adopt a consistent vocabulary that will not be
misinterpreted by some constituency.
The communications between components can be modified in unexpected
ways. Unexpected commands can be executed, bypassing other security
mechanisms. Incoming data can be misinterpreted.
Likelihood of Exploit
Very High
Detection Methods
Automated Static Analysis
This weakness can often be detected using automated static analysis
tools. Many modern tools use data flow analysis or constraint-based
techniques to minimize the number of false positives.
Effectiveness: Moderate
This is not a perfect solution, since 100% accuracy and coverage are
not feasible.
Automated Dynamic Analysis
This weakness can be detected using dynamic tools and techniques that
interact with the software using large test suites with many diverse
inputs, such as fuzz testing (fuzzing), robustness testing, and fault
injection. The software's operation may slow down, but it should not
become unstable, crash, or generate incorrect results.
Demonstrative Examples
Example 1
Here a value read from an HTML form parameter is reflected back to
the client browser without having been encoded prior to output.
Consider a chat application in which a front-end web application
communicates with a back-end server. The back-end is legacy code that does
not perform authentication or authorization, so the front-end must implement
it. The chat protocol supports two commands, SAY and BAN, although only
administrators can use the BAN command. Each argument must be separated by a
single space. The raw inputs are URL-encoded. The messaging protocol allows
multiple commands to be specified on the same line if they are separated by
a "|" character.
Back End: Command Processor Code
(Bad Code)
Example
Language: Perl
$inputString = readLineFromFileHandle($serverFH);
# generate an array of strings separated by the "|"
character.
@commands = split(/\|/, $inputString);
foreach $cmd (@commands) {
# separate the operator from its arguments based on a
single whitespace
($operator, $args) = split(/ /, $cmd, 2);
$args = UrlDecode($args);
if ($operator eq "BAN") {
ExecuteBan($args);
}
elsif ($operator eq "SAY") {
ExecuteSay($args);
}
}
Front End: Web Application
In this code, the web application receives a command, encodes it for
sending to the server, performs the authorization check, and sends the
command to the server.
(Bad Code)
Example
Language: Perl
$inputString = GetUntrustedArgument("command");
($cmd, $argstr) = split(/\s+/, $inputString, 2);
# removes extra whitespace and also changes CRLF's to
spaces
$argstr =~ s/\s+/ /gs;
$argstr = UrlEncode($argstr);
if (($cmd eq "BAN") && (!
IsAdministrator($username))) {
die "Error: you are not the admin.\n";
}
# communicate with file server using a file handle
$fh = GetServerFileHandle("myserver");
print $fh "$cmd $argstr\n";
Diagnosis
It is clear that, while the protocol and back-end allow multiple
commands to be sent in a single request, the front end only intends to
send a single command. However, the UrlEncode function could leave the
"|" character intact. If an attacker provides:
(Attack)
SAY hello world|BAN user12
then the front end will see this is a "SAY" command, and the $argstr
will look like "hello world | BAN user12". Since the command is "SAY",
the check for the "BAN" command will fail, and the front end will send
the URL-encoded command to the back end:
(Result)
SAY hello%20world|BAN%20user12
The back end, however, will treat these as two separate
commands:
(Result)
SAY hello world
BAN user12
Notice, however, that if the front end properly encodes the "|" with
"%7C", then the back end will only process a single command.
Example 3
This example takes user input, passes it through an encoding scheme
and then creates a directory specified by the user.
(Bad Code)
Example
Language: Perl
sub GetUntrustedInput {
return($ARGV[0]);
}
sub encode {
my($str) = @_;
$str =~ s/\&/\&/gs;
$str =~ s/\"/\"/gs;
$str =~ s/\'/\'/gs;
$str =~ s/\</\</gs;
$str =~ s/\>/\>/gs;
return($str);
}
sub doit {
my $uname = encode(GetUntrustedInput("username"));
print "<b>Welcome,
$uname!</b><p>\n";
system("cd /home/$uname; /bin/ls -l");
}
The programmer attempts to encode dangerous characters, however the blacklist for encoding is incomplete (CWE-184) and an attacker can still pass a semicolon, resulting in a chain with command injection (CWE-77).
Additionally, the encoding routine is used inappropriately with
command execution. An attacker doesn't even need to insert their own
semicolon. The attacker can instead leverage the encoding routine to
provide the semicolon to separate the commands. If an attacker supplies
a string of the form:
(Attack)
' pwd
then the program will encode the apostrophe and insert the semicolon,
which functions as a command separator when passed to the system
function. This allows the attacker to complete the command
injection.
OS command injection in backup software using
shell metacharacters in a filename; correct behavior would require that this
filename could not be changed.
Web application does not set the charset when
sending a page to a browser, allowing for XSS exploitation when a browser
chooses an unexpected encoding.
Cross-site scripting in chat application via a
message, which normally might be allowed to contain arbitrary
content.
Potential Mitigations
Phase: Architecture and Design
Strategy: Libraries or Frameworks
Use a vetted library or framework that does not allow this weakness to
occur or provides constructs that make this weakness easier to
avoid.
For example, consider using the ESAPI Encoding control [R.116.1] or a similar tool, library, or framework. These will help the programmer encode outputs in a manner less prone to error.
Alternately, use built-in functions, but consider using wrappers in
case those functions are discovered to have a vulnerability.
Phase: Architecture and Design
Strategy: Parameterization
If available, use structured mechanisms that automatically enforce the
separation between data and code. These mechanisms may be able to
provide the relevant quoting, encoding, and validation automatically,
instead of relying on the developer to provide this capability at every
point where output is generated.
For example, stored procedures can enforce database query structure
and reduce the likelihood of SQL injection.
Phases: Architecture and Design; Implementation
Understand the context in which your data will be used and the
encoding that will be expected. This is especially important when
transmitting data between different components, or when generating
outputs that can contain multiple encodings at the same time, such as
web pages or multi-part mail messages. Study all expected communication
protocols and data representations to determine the required encoding
strategies.
Phase: Architecture and Design
In some cases, input validation may be an important strategy when
output encoding is not a complete solution. For example, you may be
providing the same output that will be processed by multiple consumers
that use different encodings or representations. In other cases, you may
be required to allow user-supplied input to contain control information,
such as limited HTML tags that support formatting in a wiki or bulletin
board. When this type of requirement must be met, use an extremely
strict whitelist to limit which control sequences can be used. Verify
that the resulting syntactic structure is what you expect. Use your
normal encoding methods for the remainder of the input.
Phase: Architecture and Design
Use input validation as a defense-in-depth measure to reduce the likelihood of output encoding errors (see CWE-20).
Phase: Requirements
Fully specify which encodings are required by components that will be
communicating with each other.
Phase: Implementation
When exchanging data between components, ensure that both components
are using the same character encoding. Ensure that the proper encoding
is applied at each interface. Explicitly set the encoding you are using
whenever the protocol allows you to do so.
This weakness is primary to all weaknesses related to injection (CWE-74) since the inherent nature of injection involves the violation of structured messages.
CWE-116 and CWE-20 have a close association because, depending on the nature of the structured message, proper input validation can indirectly prevent special characters from changing the meaning of a structured message. For example, by validating that a numeric ID field should only contain the 0-9 characters, the programmer effectively prevents injection attacks.
However, input validation is not always sufficient, especially when less
stringent data types must be supported, such as free-form text. Consider a
SQL injection scenario in which a last name is inserted into a query. The
name "O'Reilly" would likely pass the validation step since it is a common
last name in the English language. However, it cannot be directly inserted
into the database because it contains the "'" apostrophe character, which
would need to be escaped or otherwise neutralized. In this case, stripping
the apostrophe might reduce the risk of SQL injection, but it would produce
incorrect behavior because the wrong name would be recorded.
Research Gaps
While many published vulnerabilities are related to insufficient output
encoding, there is such an emphasis on input validation as a protection
mechanism that the underlying causes are rarely described. Within CVE, the
focus is primarily on well-understood issues like cross-site scripting and
SQL injection. It is likely that this weakness frequently occurs in custom
protocols that support multiple encodings, which are not necessarily
detectable with automated techniques.
Theoretical Notes
This is a data/directive boundary error in which data boundaries are not
sufficiently enforced before it is sent to a different control
sphere.
Taxonomy Mappings
Mapped Taxonomy Name
Node ID
Fit
Mapped Node Name
WASC
22
Improper Output Handling
CERT Java Secure Coding
IDS00-J
Sanitize untrusted data passed across a trust
boundary
CERT Java Secure Coding
IDS12-J
Perform lossless conversion of String data between differing
character encodings
CERT Java Secure Coding
IDS05-J
Use a subset of ASCII for file and path
names
CERT C++ Secure Coding
MSC09-CPP
Character Encoding - Use Subset of ASCII for
Safety