Using Unicode (UTF-8) in C++


Currently, I have to deal with Unicode in C++ 11 (Linux environment). UTF-8 is used as default encoding. Tasks that I need:

  • Replace.
  • Regex
  • Iterate through a UTF-8 string. I don't know if using std::string and "for (character c : s)" will do what I want 'cause each character must be a unicode character. For example ế is one character, mão is a word contains 3 characters
  • Substring.
  • Concatenate substring with unicode characters or concatenate unicode characters.
  • Length.
  • Trim.
  • Read and write files.

What library should I use to achieve the best result?

Thank you very much. Looking forward to hearing from you soon.

- - Source

Answers

answered 5 day ago Davislor #1

For the regex/replace/search functions, I’ve previously used PCRE. This is designed to work with UTF-8 strings. You might be able to work with STL regular expressions, but not in any portable way. (Windows, in particular, does not support UTF-8 locales.)

Iterating through a UTF-8 string is even more complicated than you describe, if you need to support combining marks or the zero-width joiner! You write that é is one character, but it might be two Unicode codepoints: Latin small letter e + combining acute accent above. If you simply want to iterate through codepoints, you might use mbtowc() or std::codecvt::do_in from the Standard Library. If you need to iterate through graphemes, the most portable way to do that is with ICU.

Regular string concatenation should work, and the standard library has mblen() for length. This isn’t completely portable, because the multibyte encoding does not have to be UTF-8 (although there is a standard set of conversion functions).

comments powered by Disqus