MLReal: Bridging the gap between training on synthetic data and real data applications in machine learning

Abstract

Summary The requirement for accurate labels in supervised learning often forces us to train our networks using synthetic data. However, synthetic experiments do not reflect the realities of the field experiment, and we end up with poor performance of the trained neural network (NN) models at the inference stage. Thus, we describe a novel approach to enhance our NN model training with real data features (domain adaptation). This is accomplished by applying two operations on the input data to the NN model, whether they are from the synthetic or real data subset class: 1) The crosscorrelation of the input data section (i.e. shot gather or seismic image) with a fixed reference trace from that section. 2) The convolution of the resulting data with a randomly chosen auto correlated section of the other subset class. In the training stage, as expected, the input data are from the synthetic subset class and the auto-corrected sections are from the real subset class, and in the inference/application stage, it is the opposite. An example application on passive seismic data for microseismic event source location determination is used to demonstrate the power of this approach in improving the applicability of our trained models on real data.