我在用Python操作包含非UTF-8字符的数据集时遇到困难.字符串作为二进制导入. 但我在将二进制列转换为单元格包含非UTF-8字符的字符串时遇到了问题.
我的问题的一个最低工作示例是
import polars as pl
import pandas as pd
pd_df = pd.DataFrame([[b"bob", b"value 2", 3], [b"jane", b"\xc4", 6]], columns=["a", "b", "c"])
df = pl.from_pandas(pd_df)
column_names = df.columns
# Loop through the column names
for col_name in column_names:
# Check if the column has binary values
if df[col_name].dtype ==pl.Binary:
# Convert the binary column to string format
print(col_name)
df = df.with_columns(pl.col(col_name).cast(pl.String))
这在转换b列时会引发错误. 作为解决方案,我很好将任何非utf 8字符转换为空白.
我在在线建议中try 过许多其他转换建议,但我无法让其中任何一个发挥作用.