Proj. CAR Paper Reading: CodeBPE: Investigating Subtokenization Options for Large Language Model Pretraining on Source Code-526互联

Abstract

本文：探索LLM在source code上pretrain时的subtokenization效果。
subtokenization: split long tokens into smaller subtokens, in order to ensure the relatively high frequency of all subtokens

效果：

在下游任务表现没有明显下降的情况下，reduces average length by 17-40%
如果特别挑选subtokenization，在可能导致length increase的情况下甚至可以增加0.5%-2%的表现

Base Model: PLBART

subtokenization investigating pretraining language

investigating property div-sum 11361

protein data-efficient蛋白质pretraining

language

vision-language

accept-language