kaldi lesson教程示例（转载）-526互联

转自：

https://blog.csdn.net/q_xiami123/article/details/117019177?ops_request_misc=%257B%2522request%255Fid%2522%253A%2522170312043616800188564167%2522%252C%2522scm%2522%253A%252220140713.130102334..%2522%257D&request_id=170312043616800188564167&biz_id=0&utm_medium=distribute.pc_search_result.none-task-blog-2~all~sobaiduend~default-1-117019177-null-null.142^v96^pc_search_result_base6&utm_term=--remove-archive%20&spm=1018.2226.3001.4187

创建示例目录
第一步：egs目录下创建lesson文件夹，lesson文件夹创建版本标识文件夹v1
mkdir lesson
cd lesson
mkdir v1
1
2
3
结果展示

(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs# ls -l lesson/
total 4
drwxr-xr-x 2 root root 4096 5月 19 10:32 v1

1
2
3
4
为什么lesson下还要创建v1？

v1表示第一个版本，方便版本管理
和path.sh返回目录一致, 后面会创建path.sh脚本。KALDI_ROOT为安装目录，如果不创建v1文件夹，这个路径需要修改一下。

第二步：创建软链接到steps, utils，sid（声纹任务）
进入v1文件夹，使用下面命令创建软链接, 其中不是声纹识别相关任务，不需要创建sid的软链接

cd v1
ln -s ../../wsj/s5/steps/ .
ln -s ../../wsj/s5/utils/ .
ln -s ../../sre08/v1/sid .
1
2
3
4
使用软连接是为了节约磁盘空间，linux软连接相当于windows快捷方式。steps封装了语音识别各个阶段的标准化处理脚本，utils封装语音识别阶段用到工具和方法。

第三步：创建cmd.sh文件，设置运行方式，单机还是集群
vim cmd.sh
添加如下内容：

export train_cmd="run.pl"

1
2
run.pl指定本机运行，如果用集群，可以按照这种配置

export train_cmd="queue.pl --mem 2G"
export decode_cmd="queue.pl --mem 4G"
export mkgraph_cmd="queue.pl --mem 8G"

1
2
3
4
train_cmd是训练运行方式，decode_cmd是解码运行方式，mkgraph_cmd是构建计算图运行方式，通常情况下，一般只用到run.pl，单机运行方式

第四步：创建path.sh环境变量文件
可以通过下面命令复制其它egs下path.sh到该目录下，也可以vim path.sh 然后复制下面内容进行创建

方式1：vim path.sh, 复制下面内容到文件里，然后wq!保存
export KALDI_ROOT=`pwd`/../../..
[ -f $KALDI_ROOT/tools/env.sh ] && . $KALDI_ROOT/tools/env.sh
export PATH=$PWD/utils/:$KALDI_ROOT/tools/openfst/bin:$PWD:$PATH
[ ! -f $KALDI_ROOT/tools/config/common_path.sh ] && echo >&2 "The standard file $KALDI_ROOT/tools/config/common_path.sh is not present -> Exit!" && exit
1
. $KALDI_ROOT/tools/config/common_path.sh
export LC_ALL=C

1
2
3
4
5
6
7
8
方式2：
cp ../../aishell/s5/path.sh .
1
详细解读下path.sh内容：

export KALDI_ROOT=pwd/../../.. 这行代码是添加kaldi安装目录到环境变量，相当于在环境变量里添加/opt/asr/kaldi，但是export是临时添加，机器重启之后会清掉这个环境变量。一般都选用临时添加kaldi运行环境变量。

[ -f $KALDI_ROOT/tools/env.sh ] && . $KALDI_ROOT/tools/env.sh 这行脚本是先判断env.sh文件是否存在，如果存在执行这个文件。env.sh用来导入python执行程序路径，通常指定python版本也在这个地方指定。具体修改python运行版本要去tools/python文件夹下。kaldi采用的python版本是python2,如果要用python3,可以修改python运行软连接指向python3执行程序。

(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1# more ../../../tools/env.sh
export PATH=/opt/asr/kaldi/tools/python:${PATH}
export PATH=/opt/asr/kaldi/tools/python:${PATH}
(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1# cd ../../../tools/python/
(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/tools/python# ls
python python2
(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/tools/python# ls -l
total 0
lrwxrwxrwx 1 root root 18 3月 10 22:50 python -> /usr/bin/python2.7
lrwxrwxrwx 1 root root 18 1月 24 08:02 python2 -> /usr/bin/python2.7
1
2
3
4
5
6
7
8
9
10
export PATH=$PWD/utils/:$KALDI_ROOT/tools/openfst/bin:$PWD:$PATH 这样代码添加kaldi utils 和 openfst/bin目录下内容到环境变量

[ ! -f $KALDI_ROOT/tools/config/common_path.sh ] && echo >&2 "The standard file $KALDI_ROOT/tools/config/common_path.sh is not present -> Exit!" && exit
1
1
2
判断common_path.sh脚本是否存在，如果不存在，推出当前shell，然后输出执行错误结果1

上面四行脚本执行都没问题，执行 . $KALDI_ROOT/tools/config/common_path.sh这个脚本，这个脚本负责添加kaldi源代码编译后可执行文件的路径到环境变量中，我们在linux执行这些可执行文件，不在需要指定其所在路径，直接用文件名即可。有哪些可执行文件，参考common_path的文件内容，其按照不同的功能将可执行文件组织在不同的文件夹。

(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1# more ../../../tools/config/common_path.sh
# we assume KALDI_ROOT is already defined
[ -z "$KALDI_ROOT" ] && echo >&2 "The variable KALDI_ROOT must be already defined" && exit 1
# The formatting of the path export command is intentionally weird, because
# this allows for easy diff'ing
export PATH=\
${KALDI_ROOT}/src/bin:\
${KALDI_ROOT}/src/chainbin:\
${KALDI_ROOT}/src/featbin:\
${KALDI_ROOT}/src/fgmmbin:\
${KALDI_ROOT}/src/fstbin:\
${KALDI_ROOT}/src/gmmbin:\
${KALDI_ROOT}/src/ivectorbin:\
${KALDI_ROOT}/src/kwsbin:\
${KALDI_ROOT}/src/latbin:\
${KALDI_ROOT}/src/lmbin:\
${KALDI_ROOT}/src/nnet2bin:\
${KALDI_ROOT}/src/nnet3bin:\
${KALDI_ROOT}/src/nnetbin:\
${KALDI_ROOT}/src/online2bin:\
${KALDI_ROOT}/src/onlinebin:\
${KALDI_ROOT}/src/rnnlmbin:\
${KALDI_ROOT}/src/sgmm2bin:\
${KALDI_ROOT}/src/sgmmbin:\
${KALDI_ROOT}/src/tfrnnlmbin:\
${KALDI_ROOT}/src/cudadecoderbin:\
$PATH

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
export LC_ALL=C最后一行脚本是将linux环境设置成c语言运行环境，这样好处是我们使用kaldi编译的可执行文件时，可以当成linux命令来使用，cat, ls等。不得不说这个设计太让人叹服！！！。

执行 source path.sh 看看path.sh的威力吧，

如上图所示，是不是可以随意使用${KALDI_ROOT}/src/featbin:\源码下内容了，如果没有添加环境变量，得到是下面的结果

好了，path.sh内容介绍完，进入下一个环节，创建local, feats, data,目录

第五步：创建conf, local, feats, data, exp文件夹
创建conf和local文件夹，conf下面存放和提特征相关的配置文件，如设置采样频率，mel滤波个数等，一般有mfcc.conf和vad.conf文件。 local下面放的是这个工程相关的子脚本，前面介绍介绍utils/ steps/存放的是主脚本。

feats存放mfcc或者fbank特征数据，data存放数据list，如wav.scp, utt2spk, spk2utt等。exp存放是模型文件和训练相关中间结果。

除了conf, local外，其它三个文件夹在准备阶段可以不用创建，后续主脚本执行过程中会创建的。

(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1# mkdir -p local conf data exp feats
1
综上，创建目录示例就已经完成了，看看现在lesson工程下目录内容

(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1# ls
cmd.sh conf data exp feats local path.sh steps utils
1
2
编写脚本run.sh
下面内容都围绕run.sh的内容展开，run.sh的参考模板如下，下面争取把每个环节内容填满。

#!/usr/bin/env bash
set -eo pipifail
. ./path.sh
. ./cmd.sh
# Set global env info
# Data preparation
# Feature extraction
# Mono training
# Graph compilation
# Decoding
1
2
3
4
5
6
7
8
9
10
run.sh前四行基本都是固定不变，脚本中"."是执行的意思，相当于source. 首先声明set -eo pipifail，这个脚本主要用来控制后续脚本如果有运行异常或者错误的立即退出当前脚本的运行，shell脚本运行时从前往后进行，如果前面的脚本运行错误，后续的脚本也能正常运行，但是后面脚本运行往往依赖前面脚本的运行结果，所以设置set -eo pipifail，一旦遇到有运行错误的脚本或者管道符执行错误则退出当前运行。
. ./path.sh， . ./cmd.sh 分别执行环境变量脚本和设置运行方式的脚本，这个两个脚本已经重点介绍过了。

Set global env info
指定源数据目录,数据下载目录，脚本执行控制指针，训练数据，特征数据，如下面这段代码。

data=/data/cn-celeb1
data_url=https://openslr.magicdatatech.com/resources/82
stage=1
stop_stage=10
train_data=data/mfcc/train
featdir=feats/mfcc
. tools/parse_options.sh

1
2
3
4
5
6
7
8
如果在运行run时候修改这些变量的值，这需要添加. tools/parse_options.sh这行代码，他的作用是允许运行脚本中修改变量的默认值，如 bash run.sh --stage 2 --stop_stage 20。

Data preparation
数据准备阶段主要是下载数据，如果已经手动下载数据了，可以跳过这个步骤。
这阶段要掌握的就是如何根据自己实际问题修改download_and_untar.sh脚本。

# Data preparation
if [ $stage -le 1 ] && [ $stop_stage -ge 1 ]; then
for part in cn-celeb; do
local/download_and_untar.sh $data $data_url $part
done
fi

1
2
3
4
5
6
7
8
download_and_untar.sh的用法如下，这个脚本主要完成数据下载和解压，详细内容没什么好分析的，用法如下：其中尖括号是必填参数，中括号中是可选参数。

download_and_untar.sh [--remove-archive] <data-base> <url-base> <corpus-part>
1
带上–remove-archive参数下载解压完成之后会删掉下载压缩包。data-base参数是下载数据存放的路径，如果不存在文件夹需要事先创建这个文件夹，url-base是数据集去除文件名的url, 如cn-celeb的数据集完整urll为https://www.openslr.org/resources/82/cn-celeb.tgz，那么url-base=https://www.openslr.org/resources/82
,corpus-part=cn-celeb, 所以如果我们要下载cn-celeb，cn-celeb2-part1，cn-celeb2-part1这三个数据集，他们url-base都一样，只需要指定为相应问文件夹名即可，具体操作就是修改这个脚本里面list为list="cn-celeb cn-celeb2-part1 cn-celeb-part2"，download_and_untar.sh脚本可以参考cslt写的这个

注释54-67行代码, 由于重新下载时需要判断原有文件是否完整，不完整删除重新下载。

重点：download_and_untar.sh
- 修改list
- 注释54-67行脚本

这个阶段主要内容就是这些，基本上就是在这个模板进行修改。

准备wav.scp utt2spk spk2utt text文件
这四个文件通常位于主目录data下，而且缺一不可，为后续特征提取提供数据资料
完整数据又1002个说话人，截取cn-celeb数据一部分

(base) root@ai-PowerEdge-R740:/data/cn-celeb1/CN-Celeb/data# ls -l | wc -l
1002
(base) root@ai-PowerEdge-R740:/data/cn-celeb1/CN-Celeb/data# ls | head
data
id00000
id00001
id00002
id00003
id00004
id00005
id00006
id00007
id00008
1
2
3
4
5
6
7
8
9
10
11
12
13
在data目录下创建./mfcc/train文件夹，复制400句语音文件到wav.lst, 关键用法是使用find的命令,wav.lst得到音频路径可以是相对路径也可以是绝对路径，一般用绝对路径

(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1/data/mfcc/train# find /data/cn-celeb1/CN-Celeb/data/ -name "*.wav" | head -400 > wav.lst
(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1/data/mfcc/train# ls
wav.lst
(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1/data/mfcc/train# wc wav.lst
400 400 23531 wav.lst
(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1/data/mfcc/train# head -2 wav.lst
/data/cn-celeb1/CN-Celeb/data/id00055/play-02-015.wav
/data/cn-celeb1/CN-Celeb/data/id00055/play-02-006.wav

1
2
3
4
5
6
7
8
9
使用soxi查看音频格式

(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1/data/mfcc/train# soxi /data/cn-celeb1/CN-Celeb/data/id00055/play-02-016.wav

Input File : '/data/cn-celeb1/CN-Celeb/data/id00055/play-02-016.wav'
Channels : 1
Sample Rate : 16000
Precision : 16-bit
Duration : 00:00:10.91 = 174615 samples ~ 818.508 CDDA sectors
File Size : 349k
Bit Rate : 256k
Sample Encoding: 16-bit Signed Integer PCM

1
2
3
4
5
6
7
8
9
10
11
接下来要创建wav.scp文件，它的第一列是uttID，第二列是wav路径。 uttID是以spkID_wavID的构成方式，spkID是uttID的前缀码。

获取spkID的脚本，使用 awk -F '/'是对每行内容进行分隔的分隔符，这样分隔以后，spkID刚好是第6列，'{print $6}'打印第6列出来就得到spkID

(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1/data/mfcc/train# cat wav.lst | awk -F '/' '{print $6}' | head
id00055
id00055
1
2
3
打印wavID, wavID可以不用去掉扩展名wav格式

(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1/data/mfcc/train# cat wav.lst | awk -F '/' '{print $7}' | head
play-02-015.wav
play-02-006.wav

1
2
3
4
打印spkID_wavID，这样拼接$6"_"$7

(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1/data/mfcc/train# cat wav.lst | awk -F '/' '{print $6"_"$7}' | head
id00055_play-02-015.wav
id00055_play-02-006.wav

1
2
3
4
打印uttID spkID

(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1/data/mfcc/train# cat wav.lst | awk -F '/' '{print $6"_"$7,$6}' | head
id00055_play-02-015.wav id00055

1
2
3
去掉uttID的wav扩展名， sed "s\.wav\ \g" s代表替换，将.wav替换为空格, 其实这里uttID可以带上扩展名，可以不用去掉，并不影响特征提取

(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1/data/mfcc/train# cat wav.lst | awk -F '/' '{print $6"_"$7,$6}' | sed "s\.wav\ \g" | head
id00055_play-02-015 id00055
id00055_play-02-006 id00055
id00055_play-02-004 id00055

1
2
3
4
5
将原来两个空格变成一个空格，awk -F ' ' '{print $1, $2}'

(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1/data/mfcc/train# cat wav.lst | awk -F '/' '{print $6"_"$7,$6}' | sed "s\.wav\ \g" | awk -F ' ' '{print $1, $2}' | head
id00055_play-02-015 id00055
id00055_play-02-006 id00055
id00055_play-02-004 id00055
1
2
3
4
现在可以生产utt2spk文件了

(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1/data/mfcc/train# cat wav.lst | awk -F '/' '{print $6"_"$7,$6}' | sed "s\.wav\ \g" | awk -F ' ' '{print $1, $2}' > utt2spk
(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1/data/mfcc/train# ls
utt2spk wav.lst
(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1/data/mfcc/train# head utt2spk
id00055_play-02-015 id00055
id00055_play-02-006 id00055

1
2
3
4
5
6
7
下面可以根据utt2spk和wav.lst生成wav.scp文件

(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1/data/mfcc/train# head -2 *
==> utt2spk <==
id00055_play-02-015 id00055
id00055_play-02-006 id00055

==> wav.lst <==
/data/cn-celeb1/CN-Celeb/data/id00055/play-02-015.wav
/data/cn-celeb1/CN-Celeb/data/id00055/play-02-006.wav
(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1/data/mfcc/train# paste -d ' ' utt2spk wav.lst | awk '{print $1, $3}' | head -2
id00055_play-02-015 /data/cn-celeb1/CN-Celeb/data/id00055/play-02-015.wav
id00055_play-02-006 /data/cn-celeb1/CN-Celeb/data/id00055/play-02-006.wav
(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1/data/mfcc/train# paste -d ' ' utt2spk wav.lst | awk '{print $1, $3}' > wav.scp
id00055_play-02-015 /data/cn-celeb1/CN-Celeb/data/id00055/play-02-015.wav
id00055_play-02-006 /data/cn-celeb1/CN-Celeb/data/id00055/play-02-006.wav
1
2
3
4
5
6
7
8
9
10
11
12
13
14
生成wav.scp文件

(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1/data/mfcc/train# head -2 *
==> utt2spk <==
id00055_play-02-015 id00055
id00055_play-02-006 id00055

==> wav.lst <==
/data/cn-celeb1/CN-Celeb/data/id00055/play-02-015.wav
/data/cn-celeb1/CN-Celeb/data/id00055/play-02-006.wav

==> wav.scp <==
id00055_play-02-015 /data/cn-celeb1/CN-Celeb/data/id00055/play-02-015.wav
id00055_play-02-006 /data/cn-celeb1/CN-Celeb/data/id00055/play-02-006.wav
1
2
3
4
5
6
7
8
9
10
11
12
接下来需要将wav.scp和utt2spk文件重新排序一下，保证两者之间每一行是一一对应的

(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1/data/mfcc/train# cat wav.scp | sort -k 1,1 -u -o wav.scp
(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1/data/mfcc/train# cat utt2spk | sort -k 1,1 -u -o utt2spk
(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1/data/mfcc/train# head utt2spk wav.scp
==> utt2spk <==
id00055_interview-01-001 id00055
id00055_interview-01-002 id00055
id00055_interview-01-003 id00055
id00055_interview-01-004 id00055
id00055_interview-01-005 id00055
id00055_interview-01-006 id00055
id00055_interview-01-007 id00055
id00055_interview-01-008 id00055
id00055_play-01-001 id00055
id00055_play-01-002 id00055

==> wav.scp <==
id00055_interview-01-001 /data/cn-celeb1/CN-Celeb/data/id00055/interview-01-001.wav
id00055_interview-01-002 /data/cn-celeb1/CN-Celeb/data/id00055/interview-01-002.wav

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
sort基本用法sort -k 1,1 -u -o utt2spk， -k表示按照哪一列进行排序，1,1表示按照第一列排序，如果要按照第二列排序， -k, 2,2， -u表示去掉重复行，-o表示把排序内容输出到文件，类似重定向作用。sort详细使用方法可以参考sort排序

通过排序后，utt2spk与wav.scp文件内容都发生了变换

最后来创建spk2utt文件，可以根据utils/utt2spk_to_spk2utt.pl下脚本创建

(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1/data/mfcc/train# ../../../utils/utt2spk_to_spk2utt.pl utt2spk > spk2utt
(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1/data/mfcc/train# ls
spk2utt utt2spk wav.lst wav.scp
(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1/data/mfcc/train# head -1 spk2utt
id00055 id00055_interview-01-001 id00055_interview-01-002 id00055_interview-01-003 id00055_interview-01-004 id00055_interview-01-005 id00055_interview-01-006 id00055_interview-01-007 id00055_interview-01-008 id00055_play-01-001 id00055_play-01-002 id00055_play-01-003 id00055_play-01-004 id00055_play-01-005 id00055_play-01-006 id00055_play-01-007 id00055_play-01-008 id00055_play-01-009 id00055_play-01-010 id00055_play-01-011 id00055_play-01-012 id00055_play-01-013 id00055_play-01-014 id00055_play-02-001 id00055_play-02-002 id00055_play-02-003 id00055_play-02-004 id00055_play-02-005 id00055_play-02-006 id00055_play-02-007 id00055_play-02-008 id00055_play-02-009 id00055_play-02-010 id00055_play-02-011 id00055_play-02-012 id00055_play-02-013 id00055_play-02-014 id00055_play-02-015 id00055_play-02-016 id00055_play-02-017 id00055_play-02-018 id00055_play-02-019 id00055_play-02-020 id00055_play-02-021 id00055_play-02-022 id00055_play-02-023 id00055_play-02-024 id00055_play-02-025 id00055_play-02-026 id00055_play-02-027 id00055_play-02-028 id00055_play-02-029 id00055_play-02-030 id00055_play-02-031

1
2
3
4
5
6
下面来准备text, 语音对应的标注文本

(base) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1/data/mfcc/train# head /data/aishell_s0/data_aishell/transcript/aishell_transcript_v0.8.txt
BAC009S0002W0122 而对楼市成交抑制作用最大的限购
BAC009S0002W0123 也成为地方政府的眼中钉
BAC009S0002W0124 自六月底呼和浩特市率先宣布取消限购后
BAC009S0002W0125 各地政府便纷纷跟进
BAC009S0002W0126 仅一个多月的时间里
BAC009S0002W0127 除了北京上海广州深圳四个一线城市和三亚之外
BAC009S0002W0128 四十六个限购城市当中
BAC009S0002W0129 四十一个已正式取消或变相放松了限购
BAC009S0002W0130 财政金融政策紧随其后而来
BAC009S0002W0131 显示出了极强的威力
(base) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1/data/mfcc/train# cat wav.lst | wc -l
1000
(base) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1/data/mfcc/train# cat /data/aishell_s0/data_aishell/transcript/aishell_transcript_v0.8.txt | wc -l
141600

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
数据准备阶段做完，我们得到了wav.scp utt2spk spk2utt这三个文件，用到关键脚本为

find /data/cn-celeb1/CN-Celeb/data/ -name "*.wav" | head -400 > wav.lst
cat wav.lst | awk -F '/' '{print $6"_"$7,$6}' | sed "s\.wav\ \g" | awk -F ' ' '{print $1, $2}' > utt2spk
paste -d ' ' utt2spk wav.lst | awk '{print $1, $3}' > wav.scp
cat wav.scp | sort -k 1,1 -u -o wav.scp
cat utt2spk | sort -k 1,1 -u -o utt2spk
../../../utils/utt2spk_to_spk2utt.pl utt2spk > spk2utt
1
2
3
4
5
6
简化的版本：

find /data/cn-celeb1/CN-Celeb/data/ -name "*.wav" | head -800 > wav.lst
cat wav.lst | awk -F '/' '{print $6"_"$7,$0}' > wav.scp
cat wav.lst | awk -F '/' '{print $6"_"$7,$6}' > utt2spk
cat utt2spk | sort -k 1,1 -u -o utt2spk
cat wav.scp | sort -k 1,1 -u -o wav.scp
../../../utils/utt2spk_to_spk2utt.pl utt2spk > spk2utt
1
2
3
4
5
6
这一步部分总结

实际过程中这段可能稍微复杂了点，比如audio_dir下有三个train，dev,test音频文件，这个时候可以先将audio_dir的所有wav写到wav.flist里，然后在用grep分别过滤trian,dev,test的那部分wav到数据集目录下对应wav.flist,可以参考ai_shell local 下aishell_data_prep.sh脚本处理办法。

Feature extraction
进行特征提取之前先配置mfcc提取参数

(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1/data/mfcc/train# cp /opt/asr/kaldi/egs/aishell/v1/conf/mfcc.conf ../../../conf/
(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1/data/mfcc/train# cd ../../../conf/
(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1/conf# ls
mfcc.conf
(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1/conf# cat mfcc.conf
--sample-frequency=16000
--num-mel-bins=40 #higher than the default which is 23
--num-ceps=20 # higher than the default which is 12.

1
2
3
4
5
6
7
8
9
10
sample-frequency 采样率是必须要提供的，其它参数可以在命令行里指定，具体应该添加什么参数，可以根据这个命令查看

(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1/conf# compute-mfcc-feats
compute-mfcc-feats

Create MFCC feature files.
Usage: compute-mfcc-feats [options...] <wav-rspecifier> <feats-wspecifier>

Options:
--allow-downsample : If true, allow the input waveform to have a higher frequency than the specified --sample-frequency (and we'll downsample). (bool, default = false)
--allow-upsample : If true, allow the input waveform to have a lower frequency than the specified --sample-frequency (and we'll upsample). (bool, default = false)
--blackman-coeff : Constant coefficient for generalized Blackman window. (float, default = 0.42)
--cepstral-lifter : Constant that controls scaling of MFCCs (float, default = 22)
--channel : Channel to extract (-1 -> expect mono, 0 -> left, 1 -> right) (int, default = -1)
--debug-mel : Print out debugging information for mel bin computation (bool, default = false)
--dither : Dithering constant (0.0 means no dither). If you turn this off, you should set the --energy-floor option, e.g. to 1.0 or 0.1 (float, default = 1)
--energy-floor : Floor on energy (absolute, not relative) in MFCC computation. Only makes a difference if --use-energy=true; only necessary if --dither=0.0. Suggested values: 0.1 or 1.0 (float, default = 0)
--frame-length : Frame length in milliseconds (float, default = 25)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
下面设置vad.conf配置

(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1/conf# compute-vad
compute-vad

This program reads input features and writes out, for each utterance,
a vector of floats that are 1.0 if we judge the frame voiced and 0.0
otherwise. The algorithm is very simple and is based on thresholding
the log mel energy (and taking the consensus of threshold decisions
within a window centered on the current frame). See the options for
more details, and egs/sid/s1/run.sh for examples; this program is
intended for use in speaker-ID.

Usage: compute-vad [options] <feats-rspecifier> <vad-wspecifier>
e.g.: compute-vad scp:feats.scp ark:vad.ark

Options:
--omit-unvoiced-utts : If true, do not write out voicing information for utterances that were judged 100% unvoiced. (bool, default = false)
--vad-energy-mean-scale : If this is set to s, to get the actual threshold we let m be the mean log-energy of the file, and use s*m + vad-energy-threshold (float, default = 0.5)
--vad-energy-threshold : Constant term in energy threshold for MFCC0 for VAD (also see --vad-energy-mean-scale) (float, default = 5)
--vad-frames-context : Number of frames of context on each side of central frame, in window for which energy is monitored (int, default = 0)
--vad-proportion-threshold : Parameter controlling the proportion of frames within the window that need to have more energy than the threshold (float, default = 0.6)

Standard options:
--config : Configuration file to read (this option may be repeated) (string, default = "")
--help : Print out usage message (bool, default = false)
--print-args : Print the command line arguments (to stderr) (bool, default = true)
--verbose : Verbose level (higher->more logging) (int, default = 0)

(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1/conf# echo --vad-energy-threshold=5.5 >> vad.conf
(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1/conf# echo --vad-energy-mean-scale=0.5 >> vad.conf
(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1/conf# echo --vad-proportion-threshold=0.12 >> vad.conf
(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1/conf# echo --vad-frames-context=2 >> vad.conf
(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1/conf# head vad.conf
--vad-energy-threshold=5.5
--vad-energy-mean-scale=0.5
--vad-proportion-threshold=0.12
--vad-frames-context=2

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
conf下配置完成

(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1/conf# head *
==> mfcc.conf <==
--sample-frequency=16000
--num-mel-bins=40 #higher than the default which is 23
--num-ceps=20 # higher than the default which is 12.

==> vad.conf <==
--vad-energy-threshold=5.5
--vad-energy-mean-scale=0.5
--vad-proportion-threshold=0.12
--vad-frames-context=2

1
2
3
4
5
6
7
8
9
10
11
12
准备特征提取的脚本

for x in train_data; do
eval data=\$$x
echo $data
steps/make_mfcc.sh --cmd "$train_cmd" --nj 40 --write-utt2num-frames true --mfcc-config conf/mfcc.conf $data $featdir/log_mfcc_$x $featdir/feat_$x || exit 1;
sid/compute_vad_decision.sh --cmd "$train_cmd" --nj 2 $data $featdir/log_vad_$x $featdir/vad_$x || exit 1;
utils/fix_data_dir.sh $data

1
2
3
4
5
6
7
8
解析这个脚本
train_data是存放wav.scp, utt2spk, spk2utt的文件目录，train_data=data/mfcc/train
eval data=\$$x作用相当于创建data是train_data的引用，data也指向data/mfcc/train，注意这里不能直接写成$x, $x的值是train_data, 如果写成\$$x也不符合要求，传入脚本时候是$train_data而不是目录data/mfcc/train，现在$data=data/mfcc/train.

make_mfcc.sh的用法：data-dir就是wav.scp, utt2spk, spk2utt的文件目录，其后跟的log-dir, mfcc-dir, 这两个文件目录有默认值，可以不用传。其它是可选参数， --nj 传入并行处理线程个数，一般要给说话人占用一个进程，值不能超过说话人个数。mfcc-conf是conf文件夹下的mfcc.conf文件，文件配置见上文描述。write-utt2num-frames是统计每个wav的帧数

(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1# steps/make_mfcc.sh
steps/make_mfcc.sh
Usage: steps/make_mfcc.sh [options] <data-dir> [<log-dir> [<mfcc-dir>] ]
e.g.: steps/make_mfcc.sh data/train
Note: <log-dir> defaults to <data-dir>/log, and
<mfcc-dir> defaults to <data-dir>/data.
Options:
--mfcc-config <config-file> # config passed to compute-mfcc-feats.
--nj <nj> # number of parallel jobs.
--cmd <run.pl|queue.pl <queue opts>> # how to run jobs.
--write-utt2num-frames <true|false> # If true, write utt2num_frames file.
--write-utt2dur <true|false> # If true, write utt2dur file.
1
2
3
4
5
6
7
8
9
10
11
12
compute_vad_decision.sh 的用法：使用之前，先创建sid的软连接

ln -s ../../sre08/v1/sid/ .
1
compute_vad_decision.sh脚本和make_mfcc.sh类似， vad不需要传入采样率， data-dir也是wav.scp,utt2spk, spk2utt的文件夹， vad-config配置见conf/vad.conf的配置

(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1# sid/compute_vad_decision.sh
sid/compute_vad_decision.sh
Usage: sid/compute_vad_decision.sh [options] <data-dir> [<log-dir> [<vad-dir>]]
e.g.: sid/compute_vad_decision.sh data/train exp/make_vad mfcc
Note: <log-dir> defaults to <data-dir>/log, and <vad-dir> defaults to <data-dir>/data
Options:
--vad-config <config-file> # config passed to compute-vad-energy
--nj <nj> # number of parallel jobs
--cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.

1
2
3
4
5
6
7
8
9
10
最后一行脚本用来确认对提取特征与原始wav.scp的行数进行二次校验，会看一下生成列表和输入列表是否能对应起来，并且把原始备份到.backup

utils/fix_data_dir.sh $data
1
好了，运行特征提取脚本，看看输出内容

(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1# ./run.sh
./run.sh [info]: extract feat and make vad
data/mfcc/train
steps/make_mfcc.sh --cmd run.pl --nj 40 --write-utt2num-frames true --mfcc-config conf/mfcc.conf data/mfcc/train feats/mfcc/log_mfcc_train_data feats/mfcc/feat_train_data
steps/make_mfcc.sh: moving data/mfcc/train/feats.scp to data/mfcc/train/.backup
utils/validate_data_dir.sh: Successfully validated data-directory data/mfcc/train
steps/make_mfcc.sh: [info]: no segments file exists: assuming wav.scp indexed by utterance.
steps/make_mfcc.sh: Succeeded creating MFCC features for train
sid/compute_vad_decision.sh --cmd run.pl --nj 2 data/mfcc/train feats/mfcc/log_vad_train_data feats/mfcc/vad_train_data
sid/compute_vad_decision.sh: moving data/mfcc/train/vad.scp to data/mfcc/train/.backup
Created VAD output for train
fix_data_dir.sh: kept all 400 utterances.
fix_data_dir.sh: old files are kept in data/mfcc/train/.backup
1
2
3
4
5
6
7
8
9
10
11
12
13
出现上面结果，证明特征提取成功

输出内容在两个文件夹下

(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1# ls -a data/mfcc/train/
. .. .backup conf feats.scp frame_shift spk2utt split2 split20 utt2dur utt2num_frames utt2spk vad.scp wav.lst wav.scp
(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1# ls -a feats/mfcc/
. .. feat_train_data log_mfcc_train_data log_vad_train_data vad_train_data
1
2
3
4
在原数据目录下除了原始wav.lst wav.scp utt2spk spk2utt以外，其它都是新生成的文件， feats/mfcc是新生成文件，这下面是我们在脚本指定的特征文件以及log

(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1/data/mfcc/train# head utt2dur
id00055_interview-01-001 10.70938 单位秒
id00055_interview-01-002 12.90669
id00055_interview-01-003 7.701375
id00055_interview-01-004 7.701375
id00055_interview-01-005 10.83737
id00055_interview-01-006 15.70138
id00055_interview-01-007 7.701375
id00055_interview-01-008 8.512
id00055_play-01-001 1.509313
id00055_play-01-002 3.71525
(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1/data/mfcc/train# head frame_shift
0.01 单位秒
(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1/data/mfcc/train# head utt2num_frames
id00055_interview-01-001 1069 #帧数
id00055_interview-01-002 1289
id00055_interview-01-003 768
id00055_interview-01-004 768
id00055_interview-01-005 1082
id00055_interview-01-006 1568
id00055_interview-01-007 768
id00055_interview-01-008 849
id00055_play-01-001 149
id00055_play-01-002 370

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
可以看到这两个文件是一一对应的

(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1/data/mfcc/train# head feats.scp utt2spk
==> feats.scp <==
id00055_interview-01-001 /opt/asr/kaldi/egs/lesson/v1/feats/mfcc/feat_train_data/raw_mfcc_train.1.ark:25
id00055_interview-01-002 /opt/asr/kaldi/egs/lesson/v1/feats/mfcc/feat_train_data/raw_mfcc_train.1.ark:21611
id00055_interview-01-003 /opt/asr/kaldi/egs/lesson/v1/feats/mfcc/feat_train_data/raw_mfcc_train.1.ark:47597
id00055_interview-01-004 /opt/asr/kaldi/egs/lesson/v1/feats/mfcc/feat_train_data/raw_mfcc_train.1.ark:63163
id00055_interview-01-005 /opt/asr/kaldi/egs/lesson/v1/feats/mfcc/feat_train_data/raw_mfcc_train.1.ark:78729
id00055_interview-01-006 /opt/asr/kaldi/egs/lesson/v1/feats/mfcc/feat_train_data/raw_mfcc_train.1.ark:100575
id00055_interview-01-007 /opt/asr/kaldi/egs/lesson/v1/feats/mfcc/feat_train_data/raw_mfcc_train.1.ark:132141
id00055_interview-01-008 /opt/asr/kaldi/egs/lesson/v1/feats/mfcc/feat_train_data/raw_mfcc_train.1.ark:147707
id00055_play-01-001 /opt/asr/kaldi/egs/lesson/v1/feats/mfcc/feat_train_data/raw_mfcc_train.1.ark:164888
id00055_play-01-002 /opt/asr/kaldi/egs/lesson/v1/feats/mfcc/feat_train_data/raw_mfcc_train.1.ark:168069

==> utt2spk <==
id00055_interview-01-001 id00055
id00055_interview-01-002 id00055
id00055_interview-01-003 id00055
id00055_interview-01-004 id00055
id00055_interview-01-005 id00055
id00055_interview-01-006 id00055
id00055_interview-01-007 id00055
id00055_interview-01-008 id00055
id00055_play-01-001 id00055
id00055_play-01-002 id00055

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
feats.scp 存放语音及语音特征路径

(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1/data/mfcc/train# head feats.scp
id00055_interview-01-001 /opt/asr/kaldi/egs/lesson/v1/feats/mfcc/feat_train_data/raw_mfcc_train.1.ark:25
id00055_interview-01-002 /opt/asr/kaldi/egs/lesson/v1/feats/mfcc/feat_train_data/raw_mfcc_train.1.ark:21611
id00055_interview-01-003 /opt/asr/kaldi/egs/lesson/v1/feats/mfcc/feat_train_data/raw_mfcc_train.1.ark:47597
id00055_interview-01-004 /opt/asr/kaldi/egs/lesson/v1/feats/mfcc/feat_train_data/raw_mfcc_train.1.ark:63163
id00055_interview-01-005 /opt/asr/kaldi/egs/lesson/v1/feats/mfcc/feat_train_data/raw_mfcc_train.1.ark:78729
id00055_interview-01-006 /opt/asr/kaldi/egs/lesson/v1/feats/mfcc/feat_train_data/raw_mfcc_train.1.ark:100575
id00055_interview-01-007 /opt/asr/kaldi/egs/lesson/v1/feats/mfcc/feat_train_data/raw_mfcc_train.1.ark:132141
id00055_interview-01-008 /opt/asr/kaldi/egs/lesson/v1/feats/mfcc/feat_train_data/raw_mfcc_train.1.ark:147707
id00055_play-01-001 /opt/asr/kaldi/egs/lesson/v1/feats/mfcc/feat_train_data/raw_mfcc_train.1.ark:164888
id00055_play-01-002 /opt/asr/kaldi/egs/lesson/v1/feats/mfcc/feat_train_data/raw_mfcc_train.1.ark:168069
(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1/data/mfcc/train# ls ../../../feats/mfcc/feat_train_data/
raw_mfcc_train.1.ark raw_mfcc_train.16.ark raw_mfcc_train.22.ark raw_mfcc_train.29.ark raw_mfcc_train.35.ark raw_mfcc_train.5.ark
raw_mfcc_train.1.scp raw_mfcc_train.16.scp raw_mfcc_train.22.scp raw_mfcc_train.29.scp raw_mfcc_train.35.scp raw_mfcc_train.5.scp
raw_mfcc_train.10.ark raw_mfcc_train.17.ark raw_mfcc_train.23.ark raw_mfcc_train.3.ark raw_mfcc_train.36.ark raw_mfcc_train.6.ark
raw_mfcc_train.10.scp raw_mfcc_train.17.scp raw_mfcc_train.23.scp raw_mfcc_train.3.scp raw_mfcc_train.36.scp raw_mfcc_train.6.scp
raw_mfcc_train.11.ark raw_mfcc_train.18.ark raw_mfcc_train.24.ark raw_mfcc_train.30.ark raw_mfcc_train.37.ark raw_mfcc_train.7.ark
raw_mfcc_train.11.scp raw_mfcc_train.18.scp raw_mfcc_train.24.scp raw_mfcc_train.30.scp raw_mfcc_train.37.scp raw_mfcc_train.7.scp
raw_mfcc_train.12.ark raw_mfcc_train.19.ark raw_mfcc_train.25.ark raw_mfcc_train.31.ark raw_mfcc_train.38.ark raw_mfcc_train.8.ark
raw_mfcc_train.12.scp raw_mfcc_train.19.scp raw_mfcc_train.25.scp raw_mfcc_train.31.scp raw_mfcc_train.38.scp raw_mfcc_train.8.scp
raw_mfcc_train.13.ark raw_mfcc_train.2.ark raw_mfcc_train.26.ark raw_mfcc_train.32.ark raw_mfcc_train.39.ark raw_mfcc_train.9.ark
raw_mfcc_train.13.scp raw_mfcc_train.2.scp raw_mfcc_train.26.scp raw_mfcc_train.32.scp raw_mfcc_train.39.scp raw_mfcc_train.9.scp
raw_mfcc_train.14.ark raw_mfcc_train.20.ark raw_mfcc_train.27.ark raw_mfcc_train.33.ark raw_mfcc_train.4.ark
raw_mfcc_train.14.scp raw_mfcc_train.20.scp raw_mfcc_train.27.scp raw_mfcc_train.33.scp raw_mfcc_train.4.scp
raw_mfcc_train.15.ark raw_mfcc_train.21.ark raw_mfcc_train.28.ark raw_mfcc_train.34.ark raw_mfcc_train.40.ark
raw_mfcc_train.15.scp raw_mfcc_train.21.scp raw_mfcc_train.28.scp raw_mfcc_train.34.scp raw_mfcc_train.40.scp

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
raw_mfcc_train.1.ark是二进制文件，一个文件存放多个语音的特征，冒号后面就是每一帧的偏移地址，表示从多少个字节开始是这个语音对应特征数值

vad.scp也是类似

(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1/data/mfcc/train# head vad.scp
id00055_interview-01-001 /opt/asr/kaldi/egs/lesson/v1/feats/mfcc/vad_train_data/vad_train.1.ark:25
id00055_interview-01-002 /opt/asr/kaldi/egs/lesson/v1/feats/mfcc/vad_train_data/vad_train.1.ark:4336

1
2
3
4
由于设置nj =2 ,所以这里共有2个ark文件， 1.ark存放都是train.1的语音帧数值

(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1/feats/mfcc/vad_train_data# head vad_train.1.scp
id00055_interview-01-001 /opt/asr/kaldi/egs/lesson/v1/feats/mfcc/vad_train_data/vad_train.1.ark:25
id00055_interview-01-002 /opt/asr/kaldi/egs/lesson/v1/feats/mfcc/vad_train_data/vad_train.1.ark:4336
id00055_interview-01-003 /opt/asr/kaldi/egs/lesson/v1/feats/mfcc/vad_train_data/vad_train.1.ark:9527
id00055_interview-01-004 /opt/asr/kaldi/egs/lesson/v1/feats/mfcc/vad_train_data/vad_train.1.ark:12634
id00055_interview-01-005 /opt/asr/kaldi/egs/lesson/v1/feats/mfcc/vad_train_data/vad_train.1.ark:15741
id00055_interview-01-006 /opt/asr/kaldi/egs/lesson/v1/feats/mfcc/vad_train_data/vad_train.1.ark:20104
id00055_interview-01-007 /opt/asr/kaldi/egs/lesson/v1/feats/mfcc/vad_train_data/vad_train.1.ark:26411
id00055_interview-01-008 /opt/asr/kaldi/egs/lesson/v1/feats/mfcc/vad_train_data/vad_train.1.ark:29518
id00055_play-01-001 /opt/asr/kaldi/egs/lesson/v1/feats/mfcc/vad_train_data/vad_train.1.ark:32944
id00055_play-01-002 /opt/asr/kaldi/egs/lesson/v1/feats/mfcc/vad_train_data/vad_train.1.ark:33570
(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1/feats/mfcc/vad_train_data# tail vad_train.1.scp
id00534_interview-02-060 /opt/asr/kaldi/egs/lesson/v1/feats/mfcc/vad_train_data/vad_train.1.ark:627042
id00534_interview-02-061 /opt/asr/kaldi/egs/lesson/v1/feats/mfcc/vad_train_data/vad_train.1.ark:627433
id00534_interview-02-062 /opt/asr/kaldi/egs/lesson/v1/feats/mfcc/vad_train_data/vad_train.1.ark:628192
id00534_interview-02-063 /opt/asr/kaldi/egs/lesson/v1/feats/mfcc/vad_train_data/vad_train.1.ark:632103
id00534_interview-02-064 /opt/asr/kaldi/egs/lesson/v1/feats/mfcc/vad_train_data/vad_train.1.ark:633134
id00534_interview-02-065 /opt/asr/kaldi/egs/lesson/v1/feats/mfcc/vad_train_data/vad_train.1.ark:635445
id00534_interview-02-066 /opt/asr/kaldi/egs/lesson/v1/feats/mfcc/vad_train_data/vad_train.1.ark:645752
id00534_interview-02-067 /opt/asr/kaldi/egs/lesson/v1/feats/mfcc/vad_train_data/vad_train.1.ark:646115
id00534_interview-02-068 /opt/asr/kaldi/egs/lesson/v1/feats/mfcc/vad_train_data/vad_train.1.ark:649710
id00534_interview-02-069 /opt/asr/kaldi/egs/lesson/v1/feats/mfcc/vad_train_data/vad_train.1.ark:653621

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
如何查看这个二进制文件呢，用copy-feats转换

(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1/feats/mfcc/feat_train_data# copy-feats
copy-feats

Copy features [and possibly change format]
Usage: copy-feats [options] <feature-rspecifier> <feature-wspecifier>
or: copy-feats [options] <feats-rxfilename> <feats-wxfilename>
e.g.: copy-feats ark:- ark,scp:foo.ark,foo.scp
or: copy-feats ark:foo.ark ark,t:txt.ark

1
2
3
4
5
6
7
8
9
(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1/feats/mfcc/feat_train_data# copy-feats ark:raw_mfcc_train.1.ark ark,t:raw_mfcc_train.1.txt
copy-feats ark:raw_mfcc_train.1.ark ark,t:raw_mfcc_train.1.txt
LOG (copy-feats[5.5.874~1-e1dd0]:main():copy-feats.cc:143) Copied 10 feature matrices.
(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1/feats/mfcc/feat_train_data# head raw_mfcc_train.1.txt
id00055_interview-01-001 [
12.43639 -21.39647 -23.50363 -17.66363 -10.20084 -0.6567631 -6.793909 6.670956 -14.87434 -4.254652 2.793976 5.536527 -0.8086554 -1.156208 -6.419188 6.523203 -7.518467 -11.07696 9.717544 -0.02353048
14.46793 -13.07955 -26.43559 -5.3449 -23.41877 -13.53366 -21.76398 -1.957577 -29.76107 1.194513 -7.264804 13.69981 1.204508 2.946947 -10.77626 2.03009 -17.69189 -16.5446 4.321137 1.757263
15.53716 -3.25045 -24.09002 3.625901 -24.6204 1.324299 -31.77462 0.5817032 -35.44082 18.79673 -20.66233 14.74193 -20.92516 7.214229 -23.69198 -2.927828 -5.302175 -0.3031273 3.130547 -0.4949169
15.96486 -7.030872 -22.18425 8.80039 -31.52427 8.626863 -38.25284 7.328015 -8.916069 30.71368 -17.47999 14.39456 -21.65611 6.065345 -23.69198 -15.30304 -23.45328 -7.799612 -0.1237345 -0.9139271
16.39255 -13.07955 -31.27333 -2.796198 -37.71728 22.09692 -63.96703 19.81213 -12.3208 27.1998 -38.63321 3.452284 -13.61557 13.12277 -14.26191 -4.477177 -16.86884 -4.277167 -0.838089 0.2907271
16.82024 -14.59171 -32.15292 10.60021 -36.39021 17.27431 -66.83359 10.77757 -19.53751 23.83857 -28.1502 17.34724 -26.95556 11.97389 -3.36924 6.523203 -3.61357 6.011729 -3.854251 1.076371
17.46178 -13.07955 -33.03251 4.750791 -56.12315 23.0947 -57.51726 6.83522 -21.32258 20.10388 -48.66392 8.315518 -41.63607 -1.812713 -22.82569 -4.477177 -19.33801 -3.915891 -6.653427 -0.3377881
18.10332 -13.07955 -37.59008 1.876423 -63.64797 35.60001 -39.54848 16.6911 -16.57671 29.81409 -45.49289 48.881 -15.62571 21.78878 -7.72631 -3.237698 -5.091099 5.638618 -7.21608 -0.2330356
18.42409 -16.10388 -42.90724 -2.371414 -48.77622 43.22137 -46.83647 26.0542 -9.767251 30.00082 -23.09589 40.46823 -3.492873 26.10878 11.88573 -6.181461 -4.56341 6.011729 -0.7587161 1.023995

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
这行 copy-feats ark:raw_mfcc_train.1.ark ark,t:raw_mfcc_train.1.txt脚本中ark:raw_mfcc_train.1.ark是固定格式，ark, t表示输入是ark实体文件，输出格式为text文本，然后冒号后面是文本文件名。

除了输出对应txt文件，也可以输出对应索引文件，注意索引文件用了相对路径，注意先用ark,t,scp的位置不能变，要先写实体，再写索引

(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1/feats/mfcc/feat_train_data# copy-feats ark:raw_mfcc_train.1.ark ark,t,scp:raw_mfcc_train.1.txt,raw_mfcc_train_t.1.scp
copy-feats ark:raw_mfcc_train.1.ark ark,t,scp:raw_mfcc_train.1.txt,raw_mfcc_train_t.1.scp
LOG (copy-feats[5.5.874~1-e1dd0]:main():copy-feats.cc:143) Copied 10 feature matrices.
(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1/feats/mfcc/feat_train_data# head raw_mfcc_train_t.1.scp
id00055_interview-01-001 raw_mfcc_train.1.txt:25
id00055_interview-01-002 raw_mfcc_train.1.txt:205739
id00055_interview-01-003 raw_mfcc_train.1.txt:454346
id00055_interview-01-004 raw_mfcc_train.1.txt:602555
id00055_interview-01-005 raw_mfcc_train.1.txt:750627
id00055_interview-01-006 raw_mfcc_train.1.txt:959657
id00055_interview-01-007 raw_mfcc_train.1.txt:1261969
id00055_interview-01-008 raw_mfcc_train.1.txt:1409955
id00055_play-01-001 raw_mfcc_train.1.txt:1573316
id00055_play-01-002 raw_mfcc_train.1.txt:1602534

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
将索引文件输出为矢量文件

(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1/feats/mfcc/vad_train_data# copy-vector scp:vad_train.1.scp ark,t:vad_train.1.txt
copy-vector scp:vad_train.1.scp ark,t:vad_train.1.txt
LOG (copy-vector[5.5.874~1-e1dd0]:main():copy-vector.cc:90) Copied 212 vectors.
1
2
3
这个文件中为1表示是语音帧，为0的表示是非语音帧

id00055_interview-01-007 [ 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ]
id00055_interview-01-008 [ 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ]
id00055_play-01-001 [ 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ]

1
2
3
4
也可以通过ark文件生成矢量文件

(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1/feats/mfcc/vad_train_data# copy-vector ark:vad_train.1.ark ark,t:vad_train.1.txt
copy-vector ark:vad_train.1.ark ark,t:vad_train.1.txt
LOG (copy-vector[5.5.874~1-e1dd0]:main():copy-vector.cc:90) Copied 212 vectors.

1
2
3
4
以上就是关于特征提取部分

下面附带讲一下sox工具
sox能实现改变采样率，量化位数，文件编码格式，改变音量，改变音素

16k采样率转8k文件， -r 8000 表示降采样到8000， -b 16 量化位数为16位， -c 1保留一个单通道输出, -t
wav表示输出为wav文件格式， -l是管道符，继续输入到内存。

(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1/data/mfcc/train# cat wav.scp | awk '{print $1,"sox",$2,"-r 8000 -b 16 -c 1 -t wav -l"}'|head
id00055_interview-01-001 sox /data/cn-celeb1/CN-Celeb/data/id00055/interview-01-001.wav -r 8000 -b 16 -c 1 -t wav -l
id00055_interview-01-002 sox /data/cn-celeb1/CN-Celeb/data/id00055/interview-01-002.wav -r 8000 -b 16 -c 1 -t wav -l
id00055_interview-01-003 sox /data/cn-celeb1/CN-Celeb/data/id00055/interview-01-003.wav -r 8000 -b 16 -c 1 -t wav -l
id00055_interview-01-004 sox /data/cn-celeb1/CN-Celeb/data/id00055/interview-01-004.wav -r 8000 -b 16 -c 1 -t wav -l
id00055_interview-01-005 sox /data/cn-celeb1/CN-Celeb/data/id00055/interview-01-005.wav -r 8000 -b 16 -c 1 -t wav -l
id00055_interview-01-006 sox /data/cn-celeb1/CN-Celeb/data/id00055/interview-01-006.wav -r 8000 -b 16 -c 1 -t wav -l
id00055_interview-01-007 sox /data/cn-celeb1/CN-Celeb/data/id00055/interview-01-007.wav -r 8000 -b 16 -c 1 -t wav -l
id00055_interview-01-008 sox /data/cn-celeb1/CN-Celeb/data/id00055/interview-01-008.wav -r 8000 -b 16 -c 1 -t wav -l
id00055_play-01-001 sox /data/cn-celeb1/CN-Celeb/data/id00055/play-01-001.wav -r 8000 -b 16 -c 1 -t wav -l
id00055_play-01-002 sox /data/cn-celeb1/CN-Celeb/data/id00055/play-01-002.wav -r 8000 -b 16 -c 1 -t wav -l

1
2
3
4
5
6
7
8
9
10
11
12
可以将上面的输出重定向为wav.scp文件，替换掉原来wav.scp文件

(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1/data/mfcc/train# cat wav.scp | awk '{print $1,"sox",$2,"-r 8000 -b 16 -c 1 -t wav -l"}' > wav_new.scp
(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1/data/mfcc/train# ls
conf feats.scp frame_shift spk2utt split2 split20 utt2dur utt2num_frames utt2spk vad.scp wav.lst wav.scp wav_new.scp
(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1/data/mfcc/train# head wav.scp wav_new.scp
==> wav.scp <==
id00055_interview-01-001 /data/cn-celeb1/CN-Celeb/data/id00055/interview-01-001.wav
id00055_interview-01-002 /data/cn-celeb1/CN-Celeb/data/id00055/interview-01-002.wav
id00055_interview-01-003 /data/cn-celeb1/CN-Celeb/data/id00055/interview-01-003.wav
id00055_interview-01-004 /data/cn-celeb1/CN-Celeb/data/id00055/interview-01-004.wav
id00055_interview-01-005 /data/cn-celeb1/CN-Celeb/data/id00055/interview-01-005.wav
id00055_interview-01-006 /data/cn-celeb1/CN-Celeb/data/id00055/interview-01-006.wav
id00055_interview-01-007 /data/cn-celeb1/CN-Celeb/data/id00055/interview-01-007.wav
id00055_interview-01-008 /data/cn-celeb1/CN-Celeb/data/id00055/interview-01-008.wav
id00055_play-01-001 /data/cn-celeb1/CN-Celeb/data/id00055/play-01-001.wav
id00055_play-01-002 /data/cn-celeb1/CN-Celeb/data/id00055/play-01-002.wav

==> wav_new.scp <==
id00055_interview-01-001 sox /data/cn-celeb1/CN-Celeb/data/id00055/interview-01-001.wav -r 8000 -b 16 -c 1 -t wav -l
id00055_interview-01-002 sox /data/cn-celeb1/CN-Celeb/data/id00055/interview-01-002.wav -r 8000 -b 16 -c 1 -t wav -l
id00055_interview-01-003 sox /data/cn-celeb1/CN-Celeb/data/id00055/interview-01-003.wav -r 8000 -b 16 -c 1 -t wav -l
id00055_interview-01-004 sox /data/cn-celeb1/CN-Celeb/data/id00055/interview-01-004.wav -r 8000 -b 16 -c 1 -t wav -l
id00055_interview-01-005 sox /data/cn-celeb1/CN-Celeb/data/id00055/interview-01-005.wav -r 8000 -b 16 -c 1 -t wav -l
id00055_interview-01-006 sox /data/cn-celeb1/CN-Celeb/data/id00055/interview-01-006.wav -r 8000 -b 16 -c 1 -t wav -l
id00055_interview-01-007 sox /data/cn-celeb1/CN-Celeb/data/id00055/interview-01-007.wav -r 8000 -b 16 -c 1 -t wav -l
id00055_interview-01-008 sox /data/cn-celeb1/CN-Celeb/data/id00055/interview-01-008.wav -r 8000 -b 16 -c 1 -t wav -l
id00055_play-01-001 sox /data/cn-celeb1/CN-Celeb/data/id00055/play-01-001.wav -r 8000 -b 16 -c 1 -t wav -l
id00055_play-01-002 sox /data/cn-celeb1/CN-Celeb/data/id00055/play-01-002.wav -r 8000 -b 16 -c 1 -t wav -l
(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1/data/mfcc/train# mv wav.scp wav_bak.scp
(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1/data/mfcc/train# mv wav_new.scp wav.scp
(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1/data/mfcc/train# ls
conf feats.scp frame_shift spk2utt split2 split20 utt2dur utt2num_frames utt2spk vad.scp wav.lst wav.scp wav_bak.scp
(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1/data/mfcc/train# head -1 wav.scp
id00055_interview-01-001 sox /data/cn-celeb1/CN-Celeb/data/id00055/interview-01-001.wav -r 8000 -b 16 -c 1 -t wav -l
(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1/data/mfcc/train# cd ..
(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1/data/mfcc# cd ..
(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1/data# cd ..
(notebook) root@ai-PowerEdge-R740:/opt/asr/kaldi/egs/lesson/v1# ./run.sh
./run.sh [info]: extract feat and make vad
data/mfcc/train
steps/make_mfcc.sh --cmd run.pl --nj 40 --write-utt2num-frames true --mfcc-config conf/mfcc.conf data/mfcc/train feats/mfcc/log_mfcc_train_data feats/mfcc/feat_train_data
steps/make_mfcc.sh: moving data/mfcc/train/feats.scp to data/mfcc/train/.backup
utils/validate_data_dir.sh: Successfully validated data-directory data/mfcc/train
steps/make_mfcc.sh: [info]: no segments file exists: assuming wav.scp indexed by utterance.
run.pl: 40 / 40 failed, log is in feats/mfcc/log_mfcc_train_data/make_mfcc_train.*.log

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
运行报错，查看报错日志，由于mfcc.conf没有把采样率改过来导致报错，如果把采样率改过来，上述特征提取能正常进行
————————————————
版权声明：本文为CSDN博主「ai-ai360」的原创文章，遵循CC 4.0 BY-SA版权协议，转载请附上原文出处链接及本声明。
原文链接：https://blog.csdn.net/q_xiami123/article/details/117019177