Django 如何过滤(或替换)在 UTF8 中占用超过 3 个字节的 unicode 字符

发布于07月11日

我正在使用Python和Django，但由于MySQL的限制，我遇到了一个问题.根据MySQL 5.1 documentation，他们的utf8实现不支持4字节字符.MySQL 5.5将支持使用utf8mb4的4字节字符；而且，在future 的某一天，utf8也可能会支持它.

但是我的服务器还没有准备好升级到MySQL5.5，因此我只能使用3字节或更少的UTF-8字符.

我的问题是:How to filter (or replace) unicode characters that would take more than 3 bytes?

我想用官方的\ufffd(U+FFFD REPLACEMENT CHARACTER)或?替换所有的4字节字符.

换句话说，我想要一个非常类似于Python自己的str.encode()方法的行为(当传递'replace'参数时).Edit: I want a behavior similar to 102, but I don't want to actually encode the string. I want to still have an unicode string after filtering.

我不想在存储到MySQL之前转义字符，因为这意味着我需要取消转义从数据库中获得的所有字符串，这非常烦人且不可行.

另请参阅:

"Incorrect string value" warning when saving some unicode characters to MySQL(在Django 售票系统)
‘?’ Not a valid unicode character, but in the unicode character set?(堆栈溢出时)

[编辑]添加了有关建议解决方案的测试

到目前为止，我得到了很好的答案.谢谢大家！现在，为了 Select 其中一个，我做了一个快速测试，以找到最简单和最快的一个.

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# vi:ts=4 sw=4 et

import cProfile
import random
import re

# How many times to repeat each filtering
repeat_count = 256

# Percentage of "normal" chars, when compared to "large" unicode chars
normal_chars = 90

# Total number of characters in this string
string_size = 8 * 1024

# Generating a random testing string
test_string = u''.join(
        unichr(random.randrange(32,
            0x10ffff if random.randrange(100) > normal_chars else 0x0fff
        )) for i in xrange(string_size) )

# RegEx to find invalid characters
re_pattern = re.compile(u'[^\u0000-\uD7FF\uE000-\uFFFF]', re.UNICODE)

def filter_using_re(unicode_string):
    return re_pattern.sub(u'\uFFFD', unicode_string)

def filter_using_python(unicode_string):
    return u''.join(
        uc if uc < u'\ud800' or u'\ue000' <= uc <= u'\uffff' else u'\ufffd'
        for uc in unicode_string
    )

def repeat_test(func, unicode_string):
    for i in xrange(repeat_count):
        tmp = func(unicode_string)

print '='*10 + ' filter_using_re() ' + '='*10
cProfile.run('repeat_test(filter_using_re, test_string)')
print '='*10 + ' filter_using_python() ' + '='*10
cProfile.run('repeat_test(filter_using_python, test_string)')

#print test_string.encode('utf8')
#print filter_using_re(test_string).encode('utf8')
#print filter_using_python(test_string).encode('utf8')

结果是:

filter_using_re()在0.139 CPU seconds中完成了515个函数调用(在sub()时为0.138 CPU秒)
filter_using_python()在3.413 CPU seconds次调用中执行了2097923次函数调用(join()次调用时为1.511 CPU秒，判断生成器表达式时为1.900 CPU秒)
我没有用itertools做测试，因为.嗯...这个解决方案虽然有趣，但却相当庞大和复杂.

结论

到目前为止，RegEx解决方案是最快的.

Django 如何过滤(或替换)在 UTF8 中占用超过 3 个字节的 unicode 字符

[编辑]添加了有关建议解决方案的测试

结论

推荐答案

Django相关问答推荐

Django REST framework：object has no attributed after annotation；Got attributeError when try to get a value for field field on serializer<>

获取PyCharm中继承方法的未解析属性引用

在Django管理中仅显示外键的特定值

如何自动删除 Django 模型中的字段值？

Django通用列表视图与多查询搜索

访问默认的 django-allauth 登录和注册页面时出现 TemplateSyntaxError

基于令牌的身份验证如何工作？

Django Query 在基于通用类的 UpdateView 中重复了 2 次

QuerySet对象在bulk_update中没有属性pk

当我告诉它时，如何使用 Django 的记录器来记录回溯？

Django Rest Framework 中的 to_representation() 可以访问普通字段吗

直接在模型类上使用 Django 管理器与静态方法

Django 密码以什么格式存储在数据库中？

Django：从视图中添加 non_field_error？

UnicodeEncodeError：ascii编解码器无法编码字符

如何查询名称包含python列表中任何单词的模型？

在 django 中获取空查询集的类名

有没有一种简单的方法可以从 CharField 填充 SlugField？

Django 在 css 文件中使用背景图像的方法

如何从 django 请求中获取完整的 url