Investigating the performance of satellite-based models in estimating the surface PM2.5 over China. Accurate estimation of surface PM2.5 concentration is critical for the assessment of PM2.5 exposure and associated health impacts. Due to the limited spatial coverage of ground monitoring stations, most studies often use the satellite products to estimate surface PM2.5 concentration by constructing a comprehensive relationship between satellite-retrieved aerosol optical depth (AOD) and ground-based measured PM2.5 concentration with machine learning (ML) technologies. However, uncertainties of ML-based models may lead to considerable biases in PM2.5 estimation, which need carefully examined. Here we evaluate the accuracy of estimated PM2.5 concentration from two popular ML-models (i.e., Random Forest and the BP Neural Network) which were trained and tested using hourly data of satellite-retrieved AOD from HIMAWARI, ground-based measured PM2.5 from China National Environmental Monitoring Center, ERA5 meteorological conditions, and other auxiliary variables for a whole year of 2017 over China. We propose a new validation method considering the spatial pattern of the data during the validation. The results suggest that the traditional validation methods may overestimate the performance of the models on estimating the PM2.5 at the area with sparse in-situ measurements. Moreover, the spatial distribution pattern of the training data will largely affect the evaluation of models performance, which should be carefully considered. For future study, at least a site-specifically validation is needed rather than only using random sampling validation.