EDBench: Large-Scale Electron Density Data for Molecular Modeling

Hongxin Xiang1*, Ke Li2*, Mingquan Liu1, Zhixiang Cheng1, Bin Yao3, Wenjie Du4, Jun Xia5, Li Zeng1, Xin Jin6†, Xiangxiang Zeng1†,
1College of Computer Science and Electronic Engineering, Hunan University, 2College of Life Sciences, East China Normal University, 3College of Materials Science and Engineering, Hunan University, 4University of Science and Technology of China, 5Westlake University, 6Eastern Institute of Technology

arXiv 2025


*Indicates Equal Contribution
Correspondence: jinxin@eitech.edu.cn, xzeng@hnu.edu.cn
Introduction of EDBench

Figure1: (a) Advancing MLFFs from atomic-level interactions—based on discrete atomistic representations—to electronic-level modeling using continuous ED, enabling richer and more physically grounded supervision; (b) Overview of the proposed EDBench dataset; (c) DFT method selection guided by Jacob's ladder to balance accuracy and computational cost.

Abstract

Existing molecular machine learning force fields (MLFFs) generally focus on the learning of atoms, molecules, and simple quantum chemical properties (such as energy and force), but ignore the importance of electron density (ED) ρ(r) in accurately understanding molecular force fields (MFFs). ED describes the probability of finding electrons at specific locations around atoms or molecules, which uniquely determines all ground state properties (such as energy, molecular structure, etc.) of interactive multi-particle systems according to the Hohenberg-Kohn theorem. However, the calculation of ED relies on the time-consuming first-principles density functional theory (DFT), which leads to the lack of large-scale ED data and limits its application in MLFFs.
In this paper, we introduce EDBench, a large-scale, high-quality dataset of ED designed to advance learning-based research at the electronic scale. Built upon the PCQM4Mv2, EDBench provides accurate ED data, covering 3.3 million molecules. To comprehensively evaluate the ability of models to understand and utilize electronic information, we design a suite of ED-centric benchmark tasks spanning prediction, retrieval, and generation. Our evaluation of several state-of-the-art methods demonstrates that learning from EDBench is not only feasible but also achieves high accuracy. Moreover, we show that learning-based methods can efficiently calculate ED with comparable precision while significantly reducing the computational cost relative to traditional DFT calculations. All data and benchmarks from EDBench will be freely available, laying a robust foundation for ED-driven drug discovery and materials science.

EDBench Dataset

We performed large-scale density functional theory (DFT) calculations on 3,359,472 molecules from the PCQM4Mv2 dataset using Psi4 1.7 . We employed the widely used B3LYP hybrid functional due to its strong empirical performance across diverse molecular systems. The choice of reference wavefunction was determined by the spin multiplicity, which was computed from the number of unpaired electrons according to Hund’s rule. Specifically, we used a restricted Hartree-Fock (RHF) reference for closed-shell systems (multiplicity = 1), and an unrestricted Hartree-Fock (UHF) reference for open-shell systems (multiplicity > 1) to allow for independent optimization of ε and ϑ spin orbitals. Basis sets were assigned based on elemental composition. We used 6-31G** for molecules without sulfur, while for sulfur-containing molecules, we used 6-31+G** to incorporate diffuse functions that better capture the more delocalized and polarizable electron distributions of heavier atoms like sulfur. After achieving self-consistent field (SCF) convergence, we generated cube files containing electron density (ED) data from Equation 1 with a grid spacing of 0.4 Bohr, a padding of 4.0 Bohr, and a density fraction threshold of 0.85 to define the isosurface region. All computations were carried out on a high-performance server equipped with 8 Intel(R) Xeon(R) Platinum 8270 CPUs, each with 26 physical cores and 2 threads per core, yielding a total of 416 logical cores. The total computational cost exceeded 205,000 core-hours, equivalent to approximately 23.4 years of single-core compute time.



Statistical information

Figure 2: (a) Distribution of molecular lengths (heavy atoms only) in EDBench. (b) Distribution of heavy atom counts. (c) Distribution of ED vector lengths. (d) Distribution of per-molecule mean ED values.

Example of ED visualization

Figure 3: Example of ED visualization of a molecule with different thresholds ωϱ . Point represents the number of ED points.

Table 1: Comparison of various databases in quantum chemistry. Compared with other datasets, EDBench provides the largest known ED benchmark dataset, characterized by its large scale and comprehensive content.

Tasks based on both molecular structures (MS) and ED

To enable rigorous benchmarking and model development, we further design an ED-centric benchmark suite covering three task categories: (i) quantum property prediction, including four core tasks—energy components prediction (ED5-EC), orbital energe estimation (ED5-OE), multipole moment regression (ED5-MM), and open-/closed-shell classification (ED5-OCS)—to evaluate how well ED alone can serve as a sufficient descriptor for inferring fundamental quantum properties; (ii) cross-modal retrieval between molecular structures and ED (ED5-MER), designed to probe the mutual consistency and representational alignment between structural and density spaces, which is critical for density-based force field construction and virtual screening; and (iii) ED prediction from molecular structures (ED5-EDP), aimed at approximating DFT-level density fields at significantly reduced computational cost, thereby enabling scalable quantum-aware modeling. Finally, we evaluate several state-of-the-art deep learning models on the proposed benchmark, offering the first large-scale assessment of ED understanding in data-driven systems.


Table 2: Statistical information of designed 6 benchmarks with a scaffold split.

BibTeX

@misc{xiang2025edbenchlargescaleelectrondensity,
      title={EDBench: Large-Scale Electron Density Data for Molecular Modeling}, 
      author={Hongxin Xiang and Ke Li and Mingquan Liu and Zhixiang Cheng and Bin Yao and Wenjie Du and Jun Xia and Li Zeng and Xin Jin and Xiangxiang Zeng},
      year={2025},
      eprint={2505.09262},
      archivePrefix={arXiv},
      primaryClass={physics.chem-ph},
      url={https://arxiv.org/abs/2505.09262}, 
}