1. 语音转录

说到语音转录，现在开源的模型中，效果最好的可以说是openAI开源的whisper了。

安装及使用教程可以直接看他们的项目说明：

whisper开放了不同尺寸的模型，参数量及占用显存的详情如下：

使用的时候，如果没指定模型，默认使用small模型。

如果你是在服务器上，下载很慢，可以使用迅雷进行下载，下面是从源文件中获取的链接：

"tiny.en": "https://openaipublic.azureedge.net/main/whisper/models/d3dd57d32accea0b295c96e26691aa14d8822fac7d9d27d5dc00b4ca2826dd03/tiny.en.pt",
    "tiny": "https://openaipublic.azureedge.net/main/whisper/models/65147644a518d12f04e32d6f3b26facc3f8dd46e5390956a9424a650c0ce22b9/tiny.pt",
    "base.en": "https://openaipublic.azureedge.net/main/whisper/models/25a8566e1d0c1e2231d1c762132cd20e0f96a85d16145c3a00adf5d1ac670ead/base.en.pt",
    "base": "https://openaipublic.azureedge.net/main/whisper/models/ed3a0b6b1c0edf879ad9b11b1af5a0e6ab5db9205f891f668f8b0e6c6326e34e/base.pt",
    "small.en": "https://openaipublic.azureedge.net/main/whisper/models/f953ad0fd29cacd07d5a9eda5624af0f6bcf2258be67c92b79389873d91e0872/small.en.pt",
    "small": "https://openaipublic.azureedge.net/main/whisper/models/9ecf779972d90ba49c06d968637d720dd632c55bbf19d441fb42bf17a411e794/small.pt",
    "medium.en": "https://openaipublic.azureedge.net/main/whisper/models/d7440d1dc186f76616474e0ff0b3b6b879abc9d1a4926b7adfa41db2d497ab4f/medium.en.pt",
    "medium": "https://openaipublic.azureedge.net/main/whisper/models/345ae4da62f9b3d59415adc60127b97c714f32e89e936602e85993674d08dcb1/medium.pt",
    "large-v1": "https://openaipublic.azureedge.net/main/whisper/models/e4b87e7e0bf463eb8e6956e646f1e277e901512310def2c24bf0e11bd3c28e9a/large-v1.pt",
    "large-v2": "https://openaipublic.azureedge.net/main/whisper/models/81f7c96c852ee8fc832187b0132e569d6c3065a3252ed18e56effd0b6a73e524/large-v2.pt",
    "large": "https://openaipublic.azureedge.net/main/whisper/models/81f7c96c852ee8fc832187b0132e569d6c3065a3252ed18e56effd0b6a73e524/large-v2.pt",

2. 字幕合并

一般通过whisper转录后的srt文件，字幕详情如下:

1
00:00:00,000 --> 00:00:08,240
Hello there, I hope you have all seen this, this is a new system by Facebook AI and what

2
00:00:08,240 --> 00:00:14,500
you're seeing here is a visualization of the attention maps of that neural network.

可以看到，有时候一句话比较长，在转成字幕的时候，就会将它变成两个字幕，如果将这些字幕直接使用翻译软件进行翻译，则就会出现下面这种情况：

翻译软件就会直接将其当成独立的两句话进行翻译，结果惨不忍睹。

因此，在翻译之前，需要将其合并成一句话，同时，前面的时间戳也必须要合并，可以使用python写一个简单的脚本进行合并。

合并完的效果如下：

1
00:00:00,000 --> 00:00:14,500
Hello there, I hope you have all seen this, this is a new system by Facebook AI and what you're seeing here is a visualization of the attention maps of that neural network.

3. 字幕翻译

字幕翻译有很多工具推荐，可以参考：https://www.jihosoft.cn/zimu/tutorial/translate-subtitles/。

我使用的是Subtitling Translation，可以直接将srt拖进去进行翻译，然后点击翻译的目标语音，点Tranlate就可以了。

4. 字幕拆分

翻译后的字幕，因为之前合并过的原因，显得非常长，不适合直接作为字幕输入：

4
00:00:30,160 --> 00:00:52,360
你可以看到这个系统既没有被训练来了解狗是什么，也没有被训练来进行任何类型的分割，但如果你看一下注意力图，它显然可以跟踪物体，它知道在图像中要注意什么，而且它可以做更多的事情。

一般来说，字幕有一些行业规范，参考：https://zhuanlan.zhihu.com/p/348776142

Netflix定义：每行字幕最大字符数42，每条字幕最大字符数84。

因此需要将其按照合适的字符拆分为一个个短一点的字幕，这个工作也可以使用python进行拆分，最后拆分后的效果如下：

5
00:00:30,160 --> 00:00:39,194
你可以看到这个系统既没有被训练来了解狗是什么，也没有被训练来进行任何类

6
00:00:39,194 --> 00:00:48,229
型的分割，但如果你看一下注意力图，它显然可以跟踪物体，它知道在图像中要

7
00:00:48,229 --> 00:00:52,360
注意什么，而且它可以做更多的事情

我是根据字符数直接进行拆分，效果简单粗暴。

最后这个字幕就基本完成了，如果要求比较高，可以在这个基础上进行人工校对。