Publication | Open Access
Stochastic gradient descent training for L1-regularized log-linear models with cumulative penalty
229
Citations
24
References
2009
Year
Unknown Venue
Structured PredictionEngineeringMachine LearningPart-of-speech TaggingLanguage ProcessingText MiningNatural Language ProcessingData ScienceComputational LinguisticsEntity RecognitionLanguage StudiesRegularization (Mathematics)Supervised LearningMachine TranslationComputational Learning TheoryNlp TaskLarge Scale OptimizationComputer ScienceStatistical Learning TheoryDeep LearningShallow ParsingCumulative PenaltyL1-regularized Log-linear ModelsStochastic Gradient DescentApproximate GradientsChunkingPo Tagging
Stochastic gradient descent offers fast online updates but struggles with L1 regularization in high‑dimensional NLP models due to gradient fluctuations and large feature spaces. The authors propose a simple cumulative‑penalty method to enable efficient L1 regularization during SGD training. They evaluate this approach on text chunking, named‑entity recognition, and part‑of‑speech tagging, comparing it to standard training procedures. Results show the method yields compact, accurate models significantly faster than a state‑of‑the‑art quasi‑Newton baseline for L1‑regularized log‑linear models.
Stochastic gradient descent (SGD) uses approximate gradients estimated from subsets of the training data and updates the parameters in an online fashion. This learning framework is attractive because it often requires much less training time in practice than batch training algorithms. However, L1-regularization, which is becoming popular in natural language processing because of its ability to produce compact models, cannot be efficiently applied in SGD training, due to the large dimensions of feature vectors and the fluctuations of approximate gradients. We present a simple method to solve these problems by penalizing the weights according to cumulative values for L1 penalty. We evaluate the effectiveness of our method in three applications: text chunking, named entity recognition, and part-of-speech tagging. Experimental results demonstrate that our method can produce compact and accurate models much more quickly than a state-of-the-art quasi-Newton method for L1-regularized loglinear models.
| Year | Citations | |
|---|---|---|
Page 1
Page 1