Abstract
Model generalization remains a key challenge in the analysis of large amounts of heterogeneous satellite image data. One major limiting factor in developing generalizable models, in the context of supervised learning, is the lack of high quality training datasets. A model's capacity to perform well on new data is often inhibited by imbalance and bias in the data that was used for training. This is especially a problem when using convolutional neural networks to classify urban land-use in satellite images. Notable dataset imbalance issues in this application include land-use type imbalance and image scene imbalance. To begin understanding these dataset imbalance problems in more detail, we develop and test a number of sampling methods for generating training image datasets from subjective training polygons for urban land-use classification. We investigate sampling at different point densities as a means to reduce content repetition and therefore content imbalance and bias in the training image dataset.