Proj. CAR Paper Reading: CodeBPE: Investigating Subtokenization Options for Large Language Model Pretraining on Source Code

发布时间 2023-06-22 06:22:53作者: 雪溯

Abstract

本文:探索LLM在source code上pretrain时的subtokenization效果。
subtokenization: split long tokens into smaller subtokens, in order to ensure the relatively high frequency of all subtokens

效果:

  1. 在下游任务表现没有明显下降的情况下,reduces average length by 17-40%
  2. 如果特别挑选subtokenization,在可能导致length increase的情况下甚至可以增加0.5%-2%的表现

Base Model: PLBART