NeurIPS 2025 Dataset & Benchmark

EDBench: Large-Scale Electron Density Data for Molecular Modeling

A high-fidelity electron-density foundation for moving molecular learning from atom-level interactions toward continuous, physically grounded electronic-level modeling.

Hongxin Xiang1,5, Ke Li2, Mingquan Liu1, Zhixiang Cheng1, Bin Yao3, Wenjie Du4, Jun Xia5,6, Li Zeng1, Xin Jin7†, Xiangxiang Zeng1†
1College of Computer Science and Electronic Engineering, Hunan University;  2College of Life Sciences, East China Normal University;  3College of Materials Science and Engineering, Hunan University;  4University of Science and Technology of China;  5AIMS Lab, The Hong Kong University of Science and Technology (Guangzhou);  6The Hong Kong University of Science and Technology;  7Eastern Institute of Technology
NeurIPS 2025; AI4Science@NeurIPS 2025
Correspondence: jinxin@eitech.edu.cn, xzeng@hnu.edu.cn
3.36Mmolecules with ED
205K+DFT core-hours
6ED-centric benchmarks
3task families
EDBench poster overview

Poster. A visual overview of the EDBench dataset, benchmark tasks, and electronic-level molecular modeling motivation.

Core idea

From atoms to electrons.

EDBench reframes molecular learning around electron density ρ(r), a dense and continuous signal that directly reflects the spatial distribution of electrons and provides richer supervision for quantum-aware modeling.

EDBench motivation and dataset workflow

Figure 1. EDBench advances MLFFs from discrete atomistic representations to continuous electron-density modeling; summarizes the dataset workflow; and positions the DFT method selection along Jacob's ladder.

Abstract

A large-scale benchmark for electronic-level molecular understanding.

Existing molecular machine learning force fields (MLFFs) generally focus on atoms, molecules, and simple quantum chemical properties such as energy and force, but overlook the importance of electron density (ED) ρ(r) for accurately understanding molecular force fields. ED describes the probability of finding electrons at specific locations around atoms or molecules, and according to the Hohenberg-Kohn theorem, it uniquely determines all ground-state properties of interactive multi-particle systems.

EDBench introduces a large-scale, high-quality ED dataset built upon PCQM4Mv2, covering 3.3 million molecules. It also provides an ED-centric benchmark suite spanning prediction, retrieval, and generation. The results show that learning from EDBench is feasible, accurate, and can substantially reduce computational cost compared with traditional DFT calculations, laying a foundation for ED-driven drug discovery and materials science.

Dataset construction

3.36 million molecules, computed at serious scale.

We performed large-scale density functional theory (DFT) calculations on 3,359,472 molecules from the PCQM4Mv2 dataset using Psi4 1.7. The B3LYP hybrid functional was selected for strong empirical performance across diverse molecular systems. Closed-shell systems use an RHF reference, while open-shell systems use a UHF reference to allow independent optimization of alpha and beta spin orbitals.

Basis sets are assigned by elemental composition: 6-31G** for molecules without sulfur, and 6-31+G** for sulfur-containing molecules to better capture delocalized and polarizable electron distributions. After SCF convergence, cube files are generated with a 0.4 Bohr grid spacing, 4.0 Bohr padding, and 0.85 density fraction threshold.

01PCQM4Mv2 source molecules Drug-like chemical space for large-scale ED generation.
02DFT / SCF pipeline B3LYP + tailored basis sets with RHF/UHF references.
03CUBE electron density output Dense continuous ED fields plus quantum chemical properties.
Statistical information of EDBench

Figure 2. Distributions of molecular lengths, heavy atom counts, ED vector lengths, and per-molecule mean ED values.

Electron density visualization at different thresholds

Figure 3. Electron density visualization of a molecule under different thresholds, showing how density filtering changes the point distribution.

Positioning

The largest known ED benchmark dataset in quantum chemistry.

Compared with classical quantum chemistry, molecular dynamics, pharmaceutical, and materials datasets, EDBench is designed to combine large-scale coverage with directly available electron density and comprehensive benchmark labels.

Comparison of quantum chemistry databases

Table 1. Comparison of quantum chemistry databases. EDBench provides ED in CUBE format along with quantum chemical properties at unprecedented molecular scale.

Benchmarks

Six ED-centric tasks across prediction, retrieval, and generation.

EDBench tests whether models can understand and use electronic information, not merely atom-level molecular graphs. The suite covers quantum property prediction, cross-modal retrieval between molecular structures and ED, and ED prediction from molecular structures.

Quantum property prediction

ED5-EC, ED5-OE, ED5-MM, and ED5-OCS evaluate whether ED alone can infer energy components, orbital energies, multipole moments, and open-/closed-shell states.

MS ↔ ED retrieval

ED5-MER probes representational alignment between molecular structures and electron-density fields for virtual screening and electron-aware representation learning.

ED generation

ED5-EDP predicts DFT-level electron density from molecular structures, enabling scalable quantum-aware modeling at far lower computational cost.

Statistical information for six EDBench benchmark tasks

Table 2. Statistical information of the six benchmark datasets with scaffold split.

Citation

BibTeX

@inproceedings{xiang2025edbench,
  title         = {EDBench: Large-Scale Electron Density Data for Molecular Modeling},
  author        = {Hongxin Xiang and Ke Li and Mingquan Liu and Zhixiang Cheng and Bin Yao and Wenjie Du and Jun Xia and Li Zeng and Xin Jin and Xiangxiang Zeng},
  booktitle     = {The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
  year          = {2025},
  url           = {https://openreview.net/forum?id=pAd7qVrYPG}
}

@misc{xiang2025edbenchlargescaleelectrondensity,
  title         = {EDBench: Large-Scale Electron Density Data for Molecular Modeling},
  author        = {Hongxin Xiang and Ke Li and Mingquan Liu and Zhixiang Cheng and Bin Yao and Wenjie Du and Jun Xia and Li Zeng and Xin Jin and Xiangxiang Zeng},
  year          = {2025},
  eprint        = {2505.09262},
  archivePrefix = {arXiv},
  primaryClass  = {physics.chem-ph},
  url           = {https://arxiv.org/abs/2505.09262}
}