Home  >  Q&A  >  body text

正则表达式 - C++正则匹配中文乱码

#include <iostream>
#include <fstream>
#include <string>
#include <regex>
 using namespace  std;
 void main(){
     string str = "今天是个好日子圣达菲阿斯qweer";
     regex pattern("[\u4e00-\u9fa5]");
     sregex_token_iterator end;  //需要注意一下这里
     for (sregex_token_iterator j(str.begin(), str.end(), pattern); j != end; ++j){
         cout << *j;
     }
     system("pause");
 }

C++在匹配中文的时候,部分文字乱码,不知道大家遇到过这种情况吗

迷茫迷茫2764 days ago1480

reply all(1)I'll reply

  • ringa_lee

    ringa_lee2017-04-17 11:59:54

    u4e00-u9fa5 is the Chinese character matching Unicode
    C++ does not support Unicode very well. If you are a program compiled with VS under Windows, ordinary strings will be ANSI encoded after compilation, which is GBK, and L"" strings will be UTF16 LE. After C++11, you can Try using u8""(UTF8) u""(UTF16) U""(UTF32) to specify different UTF encodings of unicode strings

    Looking at the source code regex should be in the C++ standard library. Looking for questions on stackoverflow, the general response is that the regex library in the C++ standard library does not support Unicode well.
    http://stackoverflow.com/questions /11254232/do-c11-regular-expressions...
    http://stackoverflow.com/questions/15882991/range-of-utf-8-characters-...
    http://stackoverflow. com/questions/17103925/how-well-is-unicode-supplor...

    I don’t know if using UTF32 or UTF16 can solve the problem. The generally recommended method is boost::regex + icu
    This example looks like it can be solved using u""

    reply
    0
  • Cancelreply