Tinyxml处理UTF-8

178 阅读 0 评论 118 点赞

我是靠谱客的博主感动长颈鹿，这篇文章主要介绍Tinyxml处理UTF-8，现在分享给大家，希望可以做个参考。

以前写过一篇博文介绍tinyXml输出utf-8文档。

tinyXml的特点是不对xml节点内容的具体编码处理，这一切都交给用户。因此tinyXml和字符有关的函数都是只接受char*的数据类型。
例如：

 
        TiXmlElement *pRoot= 
        new 
         TiXmlElement( 
        "test" 
        ); 
       
        pRoot->SetAttribute( 
        "name" 
        , 
        "名字" 
        );

上述代码产生的节点，如果用TiXmlDocument的SaveFile函数直接保存，只能是ANSI的本地编码（无论程序是否是unicode），即使TiXmlDeclaration指定为utf-8。一种方法是输出到TiXmlPrinter，将TiXmlPrinter.CStr()转换到utf-8编码的char*后保存。

char*在双字节编码下是一种很奇特的字符串，中文平台下的VC的编译器，char*可以存放GBK汉字，编译能正确识别字符，因为ASCII码的最高位为0，而GBK双字节字符的首字节最高位为1。

在使用utf-8字符串时，必须树立一个观念：utf-8应当只在传输时使用，不适合作为函数过程的处理对象。什么是传输场合？网络传输和文件读写。以文件读写为例，文件以utf-8编码存放，在读入到内存后，应当立刻转换为unicode宽字符串。程序的内部处理过程中只有unicode宽字符串。直到写入文件时，unicode宽字符串才转换为utf-8字符串。

utf-8字符串本身是变长字符串，并没有特定的数据类型。它是以char*形式存放，它的byte的表现不符合任何双字节编码，当成双字节编码处理会立刻出错。事实上，char*只是一个存放空间，用void*、unsigned char*本质上没有区别。（倘若你喜欢，甚至可以拿char*来存放unicode宽字符串，一次memcpy两个byte就是了）。

脱离双字节编码（如GBK）的tinyXml使用方法是存在的。
例如上述代码可以改为：

 
        TiXmlElement *pRoot= 
        new 
         TiXmlElement( 
        "test" 
        ); 
       
        CStringA UTF8Str=CW2A(L 
        "名字" 
        ,CP_UTF8); 
       
        pRoot->SetAttribute( 
        "name" 
        ,UTF8Str);

UTF8Str变量名即是内含的char*字符串的起始指针。CW2A函数可以自己写一个代替，并不难实现。此时可以直接调用TiXmlDocument的SaveFile函数保存为无BOM的UTF-8文档。要保存为含BOM的UTF-8文档，仍然需要TiXmlPrinter，但此时不需要对TiXmlPrinter.CStr()进行任何处理。

 
        XmlEntityTree= 
        new 
         TiXmlDocument; 
       
        TiXmlDeclaration *dec= 
        new 
         TiXmlDeclaration( 
        "1.0" 
        , 
        "utf-8" 
        , 
        "" 
        ); 
       
        XmlEntityTree->LinkEndChild(dec); 
       
        TiXmlElement *pRoot= 
        new 
         TiXmlElement( 
        "test" 
        ); 
       
        CStringA UTF8Str=CW2A(L 
        "名字" 
        ,CP_UTF8); 
       
        pRoot->SetAttribute( 
        "name" 
        ,UTF8Str); 
       
        XmlEntityTree->LinkEndChild(pRoot); 
       
        TiXmlPrinter printer; 
       
        XmlEntityTree->Accept(&printer); 
       
        char 
         UTF8BOM[3]={ 
        'xEF' 
        , 
        'xBB' 
        , 
        'xBF' 
        }; 
       
        CFile theFile; 
       
        theFile.Open(_T( 
        "test.xml" 
        ),CFile::modeCreate|CFile::modeWrite); 
       
        theFile.Write(UTF8BOM,3); 
       
        theFile.Write(printer.CStr(), 
        strlen 
        (printer.CStr())); 
       
        theFile.Close();

tinyXml在加载xml文档时有一个标记，TiXmlDocument.LoadFile(TiXmlEncoding encoding);
这个标记没多大作用，无论设为TIXML_ENCODING_UTF8还是TIXML_ENCODING_LEGACY，读入的节点的数据类型一样是char*。
设为TIXML_ENCODING_UTF8标记的唯一作用是tinyXml会自动处理文档的BOM。

对于下面文档，怎样才能正确读取到TemplateStr节点的内容？很简单，在读取时进行转换就行。

 
   
        <? 
        xml 
         version 
        = 
        "1.0" 
         encoding 
        = 
        "utf-8" 
         ?> 
       
 
        < 
        config 
        > 
       
 
             
        < 
        TemplateStr 
        >中文</ 
        TemplateStr 
        > 
       
 
             
        < 
        AutoFixCue 
        >true</ 
        AutoFixCue 
        > 
       
 
             
        < 
        AutoFixTTA 
        >true</ 
        AutoFixTTA 
        > 
       
 
             
        < 
        AcceptDragFLAC 
        >true</ 
        AcceptDragFLAC 
        > 
       
 
             
        < 
        AcceptDragTAK 
        >true</ 
        AcceptDragTAK 
        > 
       
 
             
        < 
        AcceptDragAPE 
        >true</ 
        AcceptDragAPE 
        > 
       
 
        </ 
        config 
        > 
       
 
 

 
        TiXmlDocument *xmlfile=  
        new 
         TiXmlDocument(FilePath); 
       
        xmlfile->LoadFile(TIXML_ENCODING_UTF8); 
       
        TiXmlHandle hRoot(xmlfile); 
       
        TiXmlElement *pElem; 
       
        TiXmlHandle hXmlHandle(0); 
       
        //config节点 
       
        pElem=hRoot.FirstChildElement().Element(); 
       
        if 
         (!pElem)  
        return 
         FALSE; 
       
        if 
         ( 
        strcmp 
        (pElem->Value(), 
        "config" 
        )!=0) 
       
        return 
         FALSE; 
       
        //TemplateStr节点 
       
        hXmlHandle=TiXmlHandle(pElem); 
       
        pElem=hXmlHandle.FirstChild( 
        "TemplateStr" 
        ).Element(); 
       
        if 
         (!pElem)  
        return 
         FALSE; 
       
        CString TemplateStr=UTF8toUnicode(pElem->GetText());

UTF8toUnicode函数：

 
        CString UTF8toUnicode( 
        const 
         char 
        * utf8Str, 
        UINT 
         length) 
       
        { 
       
        CString unicodeStr; 
       
        unicodeStr=_T( 
        "" 
        ); 
       
        if 
         (!utf8Str) 
       
        return 
         unicodeStr; 
       
        if 
         (length==0) 
       
        return 
         unicodeStr; 
       
        //转换 
       
        WCHAR 
         chr=0; 
       
        for 
         ( 
        UINT 
         i=0;i<length;) 
       
        { 
       
        if 
         ((0x80&utf8Str[i])==0)  
        // ASCII 
       
        { 
       
        chr=utf8Str[i]; 
       
        i++; 
       
        } 
       
        else 
         if 
        ((0xE0&utf8Str[i])==0xC0)  
        // 110xxxxx 10xxxxxx 
       
        { 
       
        chr =(utf8Str[i+0]&0x3F)<<6; 
       
        chr|=(utf8Str[i+1]&0x3F); 
       
        i+=2; 
       
        } 
       
        else 
         if 
        ((0xF0&utf8Str[i])==0xE0)  
        // 1110xxxx 10xxxxxx 10xxxxxx 
       
        { 
       
        chr =(utf8Str[i+0]&0x1F)<<12; 
       
        chr|=(utf8Str[i+1]&0x3F)<<6; 
       
        chr|=(utf8Str[i+2]&0x3F); 
       
        i+=3; 
       
        } 
       
        /* 
       
        else if() // 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 
       
        {} 
       
        else if() // 111110xx 10xxxxxx 10xxxxxx 10xxxxxx  10xxxxxx 
       
        {} 
       
        else if() // 1111110x 10xxxxxx 10xxxxxx 10xxxxxx  10xxxxxx 10xxxxxx  
       
        {} 
       
        */ 
       
        else 
         // 不是UTF-8字符串 
       
        { 
       
        return 
         unicodeStr; 
       
        } 
       
        unicodeStr.AppendChar(chr); 
       
        } 
       
        return 
         unicodeStr; 
       
        } 
       
        CString UTF8toUnicode( 
        const 
         char 
        * utf8Str) 
       
        { 
       
        UINT 
         theLength= 
        strlen 
        (utf8Str); 
       
        return 
         UTF8toUnicode(utf8Str,theLength); 
       
        }