Publication | Closed Access
Estimating the Support of a High-Dimensional Distribution
5.8K
Citations
44
References
2001
Year
Mathematical ProgrammingEngineeringMachine LearningData ScienceData MiningPattern RecognitionKernel ExpansionEstimation TheoryApproximation TheoryStatisticsSupervised LearningDensity EstimationComputational Learning TheoryWeight VectorKnowledge DiscoveryComputer ScienceStatistical Learning TheoryFunctional Data AnalysisHigh-dimensional MethodReproducing Kernel MethodHigh-dimensional DistributionStatistical InferenceSupport Vector AlgorithmKernel Method
The problem is to estimate a simple subset S of input space from unlabeled data such that the probability of a test point falling outside S equals a specified value, extending the support vector algorithm to the unlabeled setting. The authors propose estimating a function f that is positive on S and negative elsewhere, and analyze its statistical performance. The function f is represented as a kernel expansion over a small subset of training data, regularized by controlling the weight vector length in feature space, with coefficients obtained by solving a quadratic programming problem via sequential pairwise optimization.
Suppose you are given some data set drawn from an underlying probability distribution P and you want to estimate a "simple" subset S of input space such that the probability that a test point drawn from P lies outside of S equals some a priori specified value between 0 and 1. We propose a method to approach this problem by trying to estimate a function f that is positive on S and negative on the complement. The functional form of f is given by a kernel expansion in terms of a potentially small subset of the training data; it is regularized by controlling the length of the weight vector in an associated feature space. The expansion coefficients are found by solving a quadratic programming problem, which we do by carrying out sequential optimization over pairs of input patterns. We also provide a theoretical analysis of the statistical performance of our algorithm. The algorithm is a natural extension of the support vector algorithm to the case of unlabeled data.
| Year | Citations | |
|---|---|---|
Page 1
Page 1