Publication | Open Access
Big code != big vocabulary
140
Citations
73
References
2020
Year
Unknown Venue
Statistical LanguageEngineeringApi MigrationSoftware EngineeringMultilingual PretrainingSemanticsLarge Language ModelCorpus LinguisticsText MiningNatural Language ProcessingLarge Language ModelsBig CodeSyntaxData ScienceComputational LinguisticsLanguage EngineeringLanguage StudiesLanguage ModelsMachine TranslationVariable-length CodeSource CodeCode GenerationCode RepresentationLinguistics
Statistical language modeling techniques have successfully been applied to large source code corpora, yielding a variety of new software development tools, such as tools for code suggestion, improving readability, and API migration. A major issue with these techniques is that code introduces new vocabulary at a far higher rate than natural language, as new identifier names proliferate. Both large vocabularies and out-of-vocabulary issues severely affect Neural Language Models (NLMs) of source code, degrading their performance and rendering them unable to scale.
| Year | Citations | |
|---|---|---|
Page 1
Page 1