Soft competitive adaptation: neural network learning algorithms based on fitting statistical mixtures

Abstract

In this thesis, we consider learning algorithms for neural networks which are based on fitting a mixture probability density to a set of data. We begin with an unsupervised algorithm which is an alternative to the classical winner-take-all competitive algorithms. Rather than updating only the parameters of the winner on each case, the parameters of all competitors are updated in proportion to their relative responsibility for the case. Use of such a competitive algorithm is shown to give better performance than the more traditional algorithms, with little additional cost. We then consider a supervised modular architecture in which a number of simple networks compete to solve distinct pieces of a large task. A soft competitive mechanism is used to determine how much an expert learns on a case, based on how well the expert performs relative to the other expert networks. At the same time, a separate gating network learns to weight the output of each expert according to a prediction of its relative performance based on the input to the system. Experiments on a number of tasks illustrate that this architecture is capable of uncovering interesting task decompositions and of generalizing better than a single network with small training sets. Finally, we consider learning algorithms in which we assume that the actual output of the network should fall into one of a small number of classes or clusters. The objective of learning is to make the variance of these classes as small as possible. In the classical decision-directed algorithm, we decide that an output belongs to the class it is closest to and minimize the squared distance between the output and the center (mean) of this closest class. In the version of this algorithm, we minimize the squared distance between the actual output and a weighted average of the means of all of the classes. The weighting factors are the relative probability that the output belongs to each class. This idea may also be used to model the weights of a network, to produce networks which generalize better from small training sets.