XML文件处理过程中的0x1A 错误处理,Hexadecimal value 0x is an invalid character

92 阅读 0 评论 61 点赞

我是靠谱客的博主忧心母鸡，这篇文章主要介绍XML文件处理过程中的0x1A 错误处理,Hexadecimal value 0x is an invalid character，现在分享给大家，希望可以做个参考。

在XML文件处理的过程中，经常遇到一些类似于0x1A 的错误信息提示，其实XML的字符是有国际标准的（http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char）,

所以有必要删除那些不符合标准的字符，否则后果可是不可想象的..

Soap client throws System.Xml.XmlException : hexadecimal value 0x1A, is an invalid character

查到一个解决方案：http://prettycode.org/2009/05/07/hexadecimal-value-0x-is-an-invalid-character/

...when trying to load a XML document using one of the .NET XML API objects like XmlReader, XmlDocument, or XDocument? Was "0x" by chance one of these characters?

0x00
0x01
0x02
0x03
0x04
0x05
0x06
0x07
0x08

0x0B
0x0C
0x0E
0x0F

0x10
0x11
0x12
0x13
0x14
0x15

0x1A
0x1B
0x1C
0x1D
0x1E
0x1F
0x16
0x17
0x18
0x19

0x7F

The problem that causes these "invalid character" XmlExceptions is that the data being read or loaded contains characters that are illegal according to the XML 1.0 specification (which is what System.Xml conforms to—not XML 1.1).

Most of these illegal characters are in the ASCII control character range (think whacky characters like null, bell, backspace, etc). These aren't characters that have any business being in XML data; they're illegal characters, usually having found their way into the data from file format conversions, like when someone tries to create an XML file from Excel data, or export their data to XML from a format that may be stored as binary like PDF. In fact, if XML data contains the character '/b' (bell), your motherboard will actually make the bell sound before the XmlException is thrown.

Although most ASCII control characters are disallowed, the formatting characters '/n', '/r', and '/t' are not illegal in XML (1.0 and 1.1), and therefore do not cause this XmlException.

Sanitizing Strings

If you're encountering XML data that is causing an XmlException because the data "contains invalid characters", the data offending data should be sanitized of illegal XML characters prior to be used.

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
/// <summary>
/// Remove illegal XML characters from a string.
/// </summary>
public string SanitizeXmlString(string xml)
{
    if (string.IsNullOrEmpty(xml))
    {
        return xml;
    }
 
    var buffer = new StringBuilder(xml.Length);
 
    foreach (char c in xml)
    {
        if (IsLegalXmlChar(c))
        {
            buffer.Append(c);
        }
    }
 
    return buffer.ToString();
}
 
/// <summary>
/// Whether a given character is allowed by XML 1.0.
/// </summary>
public bool IsLegalXmlChar(int character)
{
    return
    (
         character == 0x9 /* == '/t' == 9   */        ||
         character == 0xA /* == '/n' == 10  */        ||
         character == 0xD /* == '/r' == 13  */        ||
        (character >= 0x20    && character <= 0xD7FF) ||
        (character >= 0xE000  && character <= 0xFFFD) ||
        (character >= 0x10000 && character <= 0x10FFFF)
    );
}

Useful as these methods are, don't go off pasting them into your code anywhere. Create a class instead. Let's say you use the routine to sanitize a string in one section of code. Then another section of code uses that same string that has been sanitized. How does the other section positively know that the string doesn't contain any control characters anymore, without checking? It doesn't.

Who knows where that string has been (if it's been sanitized) before it gets to a different routine, further down the processing pipeline. Program defensive and agnostically. If the sanitized string isn't a string and is instead a different type that represents sanitized strings, you can guarantee that the string doesn't contain illegal characters. Use something like this instead:

复制代码

public class XmlSanitizedString
{
    private readonly string value;
 
    public XmlSanitizedString(string s)
    {
        this.value = XmlSanitizedString.SanitizeXmlString(s);
    }
 
    /// <summary>
    /// Get the XML-santizied string.
    /// </summary>
    public override string ToString()
    {
        return this.value;
    }
 
    /// <summary>
    /// Remove illegal XML characters from a string.
    /// </summary>
    private static string SanitizeXmlString(string xml)
    {
        if (string.IsNullOrEmpty(xml))
        {
            return xml;
        }
 
        var buffer = new StringBuilder(xml.Length);
 
        foreach (char c in xml)
        {
            if (XmlSanitizedString.IsLegalXmlChar(c))
            {
                buffer.Append(c);
            }
        }
 
        return buffer.ToString();
    }
 
    /// <summary>
    /// Whether a given character is allowed by XML 1.0.
    /// </summary>
    private static bool IsLegalXmlChar(int character)
    {
        return
        (
             character == 0x9 /* == '/t' == 9   */        ||
             character == 0xA /* == '/n' == 10  */        ||
             character == 0xD /* == '/r' == 13  */        ||
            (character >= 0x20    && character <= 0xD7FF) ||
            (character >= 0xE000  && character <= 0xFFFD) ||
            (character >= 0x10000 && character <= 0x10FFFF)
        );
    }
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
public class XmlSanitizedString
{
    private readonly string value;
 
    public XmlSanitizedString(string s)
    {
        this.value = XmlSanitizedString.SanitizeXmlString(s);
    }
 
    /// <summary>
    /// Get the XML-santizied string.
    /// </summary>
    public override string ToString()
    {
        return this.value;
    }
 
    /// <summary>
    /// Remove illegal XML characters from a string.
    /// </summary>
    private static string SanitizeXmlString(string xml)
    {
        if (string.IsNullOrEmpty(xml))
        {
            return xml;
        }
 
        var buffer = new StringBuilder(xml.Length);
 
        foreach (char c in xml)
        {
            if (XmlSanitizedString.IsLegalXmlChar(c))
            {
                buffer.Append(c);
            }
        }
 
        return buffer.ToString();
    }
 
    /// <summary>
    /// Whether a given character is allowed by XML 1.0.
    /// </summary>
    private static bool IsLegalXmlChar(int character)
    {
        return
        (
             character == 0x9 /* == '/t' == 9   */        ||
             character == 0xA /* == '/n' == 10  */        ||
             character == 0xD /* == '/r' == 13  */        ||
            (character >= 0x20    && character <= 0xD7FF) ||
            (character >= 0xE000  && character <= 0xFFFD) ||
            (character >= 0x10000 && character <= 0x10FFFF)
        );
    }
}

Sanitizing Streams

Now, if the strings that need to be sanitized are being retrieved from a Stream, via a TextReader for example, we can create a custom StreamReader class that will skip over illegal characters. Let's say that you're retrieving XML data like so:

复制代码

1
2
3
4
5
6
7
8
9
10
11
string xml;
 
using (WebClient downloader = new WebClient())
{
    using (TextReader reader = new StreamReader(downloader.OpenRead(uri)))
    {
        xml = reader.ReadToEnd();
    }
}
 
// xml potentially contains illegal characters

You could use the XmlSanitizedString class above like this:

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
string xml;
 
using (WebClient downloader = new WebClient())
{
    using (TextReader reader = new StreamReader(downloader.OpenRead(uri)))
    {
        xml = reader.ReadToEnd();
    }
}
 
// Sanitize the XML
 
XmlSanitizedString safeXml = new XmlSanitizedString(xml);
 
// Do something with safeXml.ToString()

But StreamReader can be inherited, so we can create a custom reader that skips over illegal characters, which avoids the costly StringBuilder operations. After it's created, we can sanitize XML streams like this:

复制代码

1
2
3
4
5
6
7
8
9
10
11
string xml;
 
using (WebClient downloader = new WebClient())
{
    using(var reader = new XmlSanitizingStream(downloader.OpenRead(uri)))
    {
        xml = reader.ReadToEnd()
    }
}
 
// xml contains no illegal characters

The declaration for this XmlSanitizingStream, with IsLegalXmlChar() that we'll need, looks like:

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
public class XmlSanitizingStream : StreamReader
{
    // Pass 'true' to automatically detect encoding using BOMs.
    // BOMs: http://en.wikipedia.org/wiki/Byte-order_mark
 
    public XmlSanitizingStream(Stream streamToSanitize)
        : base(streamToSanitize, true)
    { }
 
    /// <summary>
    /// Whether a given character is allowed by XML 1.0.
    /// </summary>
    public static bool IsLegalXmlChar(int character)
    {
        return
        (
             character == 0x9 /* == '/t' == 9   */          ||
             character == 0xA /* == '/n' == 10  */          ||
             character == 0xD /* == '/r' == 13  */          ||
            (character >= 0x20    && character <= 0xD7FF  ) ||
            (character >= 0xE000  && character <= 0xFFFD  ) ||
            (character >= 0x10000 && character <= 0x10FFFF)
        );
    }
 
    // ...

To get this XmlSanitizingStream working correctly, we'll first need to override two integral methods: Peek() and Read(). The Read method should only return legal XML characters, and Peek should skip past a character if it's not legal.

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
private const int EOF = -1;
 
public override int Read()
{
    // Read each char, skipping ones XML has prohibited
 
    int nextCharacter;
 
    do
    {
        // Read a character
 
        if ((nextCharacter = base.Read()) == EOF)
        {
            // If the char denotes end of file, stop
            break;
        }
    }
 
    // Skip char if it's illegal, and try the next
 
    while (!XmlSanitizingStream.IsLegalXmlChar(nextCharacter));
 
    return nextCharacter;
}
 
public override int Peek()
{
    // Return next legal XML char w/o reading it 
 
    int nextCharacter;
 
    do
    {
        // See what the next character is 
        nextCharacter = base.Peek();
    }
    while
    (
        // If it's illegal, skip over and try the next.
 
        !XmlSanitizingStream.IsLegalXmlChar(nextCharacter) &&
        (nextCharacter = base.Read()) != EOF
    );
 
    return nextCharacter;
}

Next, we'll need to override the other Read* methods (Read, ReadToEnd, ReadLine, ReadBlock). These all rely on Peek and Read to derive their returns values. If they are not overridden, calling them on XmlSanitizingStream will invoke them on the underlying base StreamReader, which will then use its Peek and Read methods, not the XmlSanitizingStream's, resulting in unsanitized characters making their way through.

To make life easy and avoid writing these other Read* methods from scratch, we can disassemble the TextReader class using Reflector, and copy its versions of the other Read* methods, without having to change more than a few lines of code related to ArgumentExceptions.