HW2

任务描述

音位分类预测（Phoneme classification），我们有音频->音位这样的训练数据，想要训练一个模型，学习这样的对应关系，然后给定音频，预测其音位

音位

音位（phoneme），是人类某一种语言中能够区别意义的最小语音单位，是音位学分析的基础概念。每种语言都有一套自己的音位系统。

音频处理

通过一定方法，将连续的音频信号处理切分成若干个frame，每一个frame相当于一个音位

数据

整体数据结构如下

feat
- test
  
  1078个测试样本，每个样本以id.pt文件的形式存储该样本的特征，其中id唯一标识一个样本（音频）。对于每一个样本可通过torch.load读取出Tensor类型的数据，数据维度：（n_frames, frature_dim）
  - n_frames
    
    一条音频样本经过处理形成的多个frame，见上文音频处理，不同音频样本经过处理后生成的fram个数是不同的
  - feature_dim
    
    对于每一个frame经过处理形成的特征的维度，所有frame的特征维度均为39
- train
  
  以id.pt文件的形式存储所有训练样本的特征，可进一步划分为训练集和验证集，特征存储方式同test
- test_split.txt
  
  所有测试样本的id的集合
- train_labels.txt
  
  所有训练样本的标签的集合，每一行第一列表示
- train_split.txt
  
  所有训练样本的id的集合

代码细节

torch.permute()

用于对Tensor的维度进行变换，由0，1，2...指定原始维度改变之后的维度

x = torch.randn((2, 3, 4))
print(x.size())
x = x.permute(2, 0, 1)
print(x.size())

# output
# torch.Size([2, 3, 4])
# torch.Size([4, 2, 3])

python除法

python有两个除法运算符

/为传统除法，根据运算数的类型进行传统除法运算
//为floor除法，运算结果直接舍弃小数部分

炼丹

sample baseline

直接运行样例代码即可

self.block = nn.Sequential(
    nn.Linear(input_dim, output_dim),
    nn.ReLU(),
)

concat_nframes = 1              
train_ratio = 0.8               

# training parameters
batch_size = 512                # batch size
num_epoch = 5                   # the number of training epoch
learning_rate = 0.0001          # learning rate

# model parameters
hidden_layers = 1               # the number of hidden layers
hidden_dim = 256                # the hidden dim

Medium Baseline

增加concat_nframes，修改模型宽度和深度，增加训练轮数，另外增加batch_size能够提高模型表现

concat_nframes = 17
train_ratio = 0.9
batch_size = 2048
num_epoch = 20
learning_rate = 0.001
hidden_layers = 5
hidden_dim = 1700
dropout=0.35
BN

Score: 0.756

Private score: 0.75667

要想达到boss baseline需要使用RNN，后续过来补充

classification

HW2

classification hw2 hw

526互联

HW2：classification