To, Van-Thinh; Phước-Chung Van Nguyễn; Gia-Bao Truông; Tuyệt-Minh Phan; Tiêu-Long Phan; Rolf Fagerberg; Peter F. Stadler and Tuyển Ngọc Trường
Molecular property prediction has become essential in accelerating advancements in drug discovery and materials science. Graph Neural Networks have recently demonstrated remarkable success in molecular representation learning; however, their broader adoption is impeded by two significant challenges: (1) data scarcity and constrained model generalization due to the expensive and time-consuming task of acquiring labeled data and (2) inadequate initial node and edge features that fail to incorporate comprehensive chemical domain knowledge, notably orbital information. To address these limitations, we introduce a Knowledge-Guided Graph (KGG) framework employing self-supervised learning to pretrain models using orbital-level features in order to mitigate reliance on extensive labeled data sets. In addition, we propose novel representations for atomic hybridization and bond types that explicitly consider orbital engagement. Our pretraining strategy is cost efficient, utilizing approximately 250,000 molecules from the ZINC15 data set, in contrast to contemporary approaches that typically require between two and ten million molecules, consequently reducing the risk of potential data contamination. Extensive evaluations on diverse downstream molecular property data sets demonstrate that our method significantly outperforms state-of-the-art baselines. Complementary analyses, including t-SNE visualizations and comparisons with traditional molecular fingerprints, further validate the effectiveness and robustness of our proposed KGG approach. The key advantages of KGG are its data efficiency and architectural versatility, driven by orbital-informed representations. By distilling essential chemical knowledge from modest corpora, it avoids extensive pretraining and excels in low-data fine-tuning, providing a robust and chemically meaningful foundation for diverse GNN architectures.