Rethinking Genomic Modeling Through Optical Character Recognition

Hongxin Xiang*, Pengsen Ma*, Yunkang Cao, Di Yu, Haowen Chen, Xinyu Yang, Xiangxiang Zeng,
Hunan University & Yuelushan Laboratory
arXiv 2026

*Indicates Equal Contribution, Indicates Co-Corresponding Author
Motivation

OCR-inspired genomic modeling enables efficient, reconstructible visual compression.

Abstract

Recent genomic foundation models largely adopt large language model architectures that treat DNA as a one-dimensional token sequence. However, exhaustive sequential reading is structurally misaligned with sparse and discontinuous genomic semantics, leading to wasted computation on low-information background and preventing understanding-driven compression for long contexts. Here, we present OpticalDNA, a vision-based framework that reframes genomic modeling as OCR-style document understanding. OpticalDNA renders DNA into structured visual layouts and trains an OCR-capable vision--language model with a visual DNA encoder and a document decoder, where the encoder produces compact, reconstructible visual tokens for high-fidelity compression. Building on this representation, OpticalDNA defines prompt-conditioned objectives over core genomic primitives—reading, region grounding, subsequence retrieval, and masked span completion—thereby learning layout-aware DNA representations that retain fine-grained genomic information under a reduced effective token budget. Across diverse genomic benchmarks, OpticalDNA consistently outperforms recent baselines; on sequences up to 450k bases, it achieves the best overall performance with nearly 20× fewer effective tokens, and surpasses models with up to 985× more activated parameters while tuning only 256k trainable parameters.

Method

Overview of OpticalDNA

Overview of OpticalDNA. (a) Render a 1D genomic sequence into a multi-page DNA document with bounding-box annotations. (b) Construct six OCR-style prompted genomic tasks. (c) Pretrain a visual encoder–document decoder under prompt supervision.

Results

Comparison result 1

Fig1. AUROC performance for eQTL tasks on DNALONGBench. The best results are bolded, and the second best are underlined.

Comparison result 2

Fig2. Generalization performance across Rice subspecies from in-domain to far-OOD evaluation (Accuracy / AUROC).

Comparison result 3

Fig3. Grad-CAM visualization on multi-page fusion for a donor case (eight pages). Purple boxes indicate donor splice sites; numbers denote page-level mean attribution.

BibTeX

@misc{xiang2026rethinkinggenomicmodelingoptical,
      title={Rethinking Genomic Modeling Through Optical Character Recognition},
      author={Hongxin Xiang and Pengsen Ma and Yunkang Cao and Di Yu and Haowen Chen and Xinyu Yang and Xiangxiang Zeng},
      year={2026},
      eprint={2602.02014},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2602.02014},
}