python编码检测原理以及chardet模块应用

0 0

python编码检测原理以及chardet模块应用

有时候需要先检测一个文件的编码，然后将其转化为另一种编码。这时候就会用到 chardet （chardet是python的一个第三方库，是非常优秀的编码识别模块）

chardet有两种检测文件编码的方法：
一、

>>> import chardet
>>> f = open('songs.txt','r')
>>> result = chardet.detect(f.read())
>>> result
{'confidence': 0.99, 'encoding': 'utf-8'}

二、chardet comes with a command-line script which reports on the encodings of one or more files:

% chardetect.py somefile someotherfile
somefile: windows-1252 with confidence 0.5
someotherfile: ascii with confidence 1.0

def description_of(file, name='stdin'):
    """Return a string describing the probable encoding of a file."""
    u = UniversalDetector()
    for line in file:
        u.feed(line)
    u.close()
    result = u.result
    if result['encoding']:
        return '%s: %s with confidence %s' % (name,
                                              result['encoding'],
                                              result['confidence'])
    else:
        return '%s: no result' % name

猜测：第一种检测编码的方法可能类似于vim，从小的编码集合（比如说ascii）开始解析数据，计算解码错误率，错误率超过阈值，则换用更大的字符集合，直到得到一个可以容忍的解码结果。因此速率会慢，文件比较大的话此方法不是很合适。不知道是不是这样子的？

问题：