语种检测

发布时间 2023-06-13 10:41:05作者: 香家小叽

最近在做翻译相关的工作,需要先判断语种,找到了以下几种方法:fasttext、fastlid(基于fasttext)、langid、langdetect、googletrans、google_trans_new(改进googletrans),接下来就实现一下这几种方法。

import fasttext
from fastlid import fastlid
import langid
from langdetect import detect
from googletrans import Translator
from httpcore import SyncHTTPProxy
from google_trans_new import google_translator

## 调用google翻译需要代理
http_proxy = SyncHTTPProxy((b'http', b'xxx.xxx.xxx.xxx', xxxx, b''))
proxies = {'http': http_proxy, 'https': http_proxy}
translator = Translator(service_urls=['translate.google.com', 'translate.google.hk'], proxies=proxies)
## google_trans_new的代理设置比较简单
detector2 = google_translator(url_suffix="com", proxies={'http': 'xxx.xxx.xxx.xxx:xxxx', 'https': 'xxx.xxx.xxx.xxx:xxxx'})
text = ["全键热插拔", "主水路无双酚A,出水水路无硅胶,保障饮水安全", "图片、表格", "聯交所"]

fasttext.FastText.eprint = lambda x: None
## 模型需要下载,地址https://fasttext.cc/docs/en/language-identification.html
path_to_pretrained_model = 'resources/lid.176.bin'
fmodel = fasttext.load_model(path_to_pretrained_model)


def retry(func, text, max_retries=3):
    '''
    重试机制,谷歌翻译容易调用失败
    '''
    for i in range(max_retries):
        try:
            result = func(text)
            return result
        except Exception as e:
            print(f'Error: {e}')
            print(f'Retrying ({i+1}/{max_retries})...')
    print(f'Failed after {max_retries} retries.')
    return None


for i in text:
    print("fasttext: ", fmodel.predict(i))
    print("fastlid: ", fastlid(i))
    print("langid: ", langid.classify(i))
    print("langdetect: ", detect(i))
    print("googletrans: ", retry(translator.detect, i))
    print("google_trans_new: ", retry(detector2.detect, i))

至于时间和检测效果,大家可以自行判断,我主要测试了中文的效果,googletrans、google_trans_new最准确,fasttext、fastlid速度最快