Handling Imbalanced Data Sets in Insurance Risk Modeling

Abstract

As owners of cars, homes, and other property, consumers buy property and casualty insurance to protect themselves against the unexpected: i.e., accidents, fire, theft, etc. Such events occur very rarely at the level of individual policyholders. Data sets constructed for the purpose of insurance risk modeling are therefore highly imbalanced. In any given time period, most policyholders file no claims, a small percentage file one claim, and an even smaller percentage file two or more claims. This paper presents some of the tree-based learning techniques we have developed to model insurance risks. Two important aspects of our approach that distinguish it from other tree-based methods are that it incorporates a split-selection criterion tailored to the specific statistical characteristics of insurance data, and it uses constraints on the statistical accuracies of model parameter estimates to guide the construction of splits in order to overcome selection biases that arise because of the imbalance that is present in the data.

References

Page 1

	Year	Citations

Page 1