DNA

Biological Inductive Biasing of LLM for Aussie Crops Properties Prediction

Project embeds biological inductive bias into transformer LLMs to improve genomic prediction of agronomic traits in Australian crops. Using chemistry, SNP and protein structure rules, models predict yield, flowering and gene function from sparse data.

Project outline

This project investigates how large language models (LLMs) can be enhanced with biological inductive bias to improve prediction of key agronomic traits in Australian crops. While LLMs have shown remarkable success across many domains, their performance on genomic and biological data is often limited by a lack of built‑in biological structure. This research addresses that gap by embedding chemical, physical, and structural biological rules directly into transformer‑based genomic language models.

The project will develop novel methods for extracting and incorporating biological inductive bias—such as k‑mer chemistry, SNP structure, and protein 3D constraints—into LLMs, enabling more accurate prediction and generation of genomic sequences and crop traits. Using CSIRO datasets for major Australian crops including wheat, chickpea, and canola, the models will be applied to downstream tasks such as yield prediction, flowering time estimation, gene expression inference, and functional region identification, even in data‑sparse settings.

By combining transformer architectures, custom activation functions, and biologically informed constraints, this research aims to deliver robust, interpretable AI tools for crop property prediction, supporting future advances in crop breeding, resilience, and food security under changing environmental conditions.
 

From engaging in this project the student will learn how to:

Through this project, the student will gain hands‑on experience designing and implementing transformer‑based LLMs from scratch, with a focus on inductive bias and biological interpretability. They will develop skills in genomic data processing, tokenisation strategies, and multimodal modelling using real CSIRO crop datasets. Working within a collaborative ANU–CSIRO research environment, the student will also strengthen their research communication skills while applying advanced AI techniques to real‑world challenges in agricultural genomics and crop improvement.

 

More detail on this project can be found here

Questions about this project can be directed to shannon.dillon@csiro.au

To register an expression of interest, click here. You will need to outline why you have selected the research project and how your skills, experience and/or knowledge meet the project requirements.