我对R中的inconsistent encoding of character vector有问题.
我从中读取表格的文本文件在UTF-8
中进行了编码(通过Notepad++
)(我也try 了UTF-8 without BOM
).
我想从这个文本文件中读取表格,将其转换为data.table
,设置key
并使用二进制搜索.当我试图这么做时,出现了以下情况:
警告信息:
和二进制搜索does not work.
我意识到我的data.table
-key
专栏包含"未知"和"UTF-8"两种编码类型:
> table(Encoding(poli.dt$word))
unknown UTF-8
2061312 2739122
我try 使用以下工具转换此列(在创建data.table
对象之前):
Encoding(word) <- "UTF-8"
word<- enc2utf8(word)
但没有效果.
我还try 了几种将文件读入R的不同方法(设置所有有用的参数,例如encoding = "UTF-8"
):
-
data.table::fread
utils::read.table
base::scan
colbycol::cbc.read.table
但没有效果.
==================================================
我的R版本:
> R.version
_
platform x86_64-w64-mingw32
arch x86_64
os mingw32
system x86_64, mingw32
status
major 3
minor 0.3
year 2014
month 03
day 06
svn rev 65126
language R
version.string R version 3.0.3 (2014-03-06)
nickname Warm Puppy
我的会话信息:
> sessionInfo()
R version 3.0.3 (2014-03-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=Polish_Poland.1250 LC_CTYPE=Polish_Poland.1250 LC_MONETARY=Polish_Poland.1250
[4] LC_NUMERIC=C LC_TIME=Polish_Poland.1250
base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.9.2 colbycol_0.8 filehash_2.2-2 rJava_0.9-6
loaded via a namespace (and not attached):
[1] plyr_1.8.1 Rcpp_0.11.1 reshape2_1.2.2 stringr_0.6.2 tools_3.0.3