在我的系统上,wcrtomb()似乎认为"窄多字节表示"意味着"只支持ASCII",即使我使用-fexec-charset=utf-8进行编译.我的印象是-fexec-charset GCC标志控制着"窄多字节表示"的含义,wcrtomb将从"宽字符集"转换为"窄多字节表示".如果"窄多字节表示"是UTF-8,而"宽字符集"是UTF-32,则wcrtomb应该从UTF-32转换为UTF-8.我know的practical answer大概就是用explicit utf-32 to utf-8 conversion instead of depending on "wide character set" and "narrow multibyte representation"吧.我想要了解why,这并不是我所期望的.
#include <clocale>
#include <cwchar>
#include <iostream>
#include <string>
#include <vector>
#include <fstream>
int main() {
wchar_t max = 0x10FFFF;
std::vector<char> out(MB_CUR_MAX * max);
char *end = &out[0];
for(wchar_t c = 0; c < max; ++c) {
std::mbstate_t state{};
std::size_t ret = wcrtomb(end, c, &state);
if(ret != static_cast<std::size_t>(-1)) {
end += ret;
}
}
std::ofstream outfile("out", std::ios::out | std::ios::binary);
outfile.write(&out[0], end - &out[0]);
return 0;
}
(export LC_ALL=en_US.UTF-8; g++ -fwide-exec-charset=utf-32le -fexec-charset=utf-8 main.cpp && ./a.out && cat -v ./out && echo)
^@^A^B^C^D^E^F^G^H
^K^L^M^N^O^P^Q^R^S^T^U^V^W^X^Y^Z^[^\^]^^^_ !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~^?
我try 了什么:
- 设置-fexec-charset=utf-8 even though gcc documentation says this is the default
- 设置-fwide-exec-charset=utf-32le,即使看起来已经是这样
- 为编译和执行设置LC_ALL=EN_US.UTF-8
- 使用clang而不是GCC进行编译(不支持-fwide-exec-charset,但打印
__clang_wide_literal_encoding__
支持Utf-32)
系统信息: Ubuntu22.04.3LTS G+(Ubuntu 11.4-1ubuntu1~22.04)11.4.0 UbuntuClang版本14.0.0-1ubuntu1.1