News Release

Groundbreaking research compares prompt styles and LLMs for structured data generation - Unveiling key trade-offs for real-world AI applications

Peer-Reviewed Publication

ELSP

Evaluates six prompt styles (JSON, CSV, Prefix, YAML, Function, Hybrid) across three LLMs (ChatGPT-4o, Claude, Gemini) on datasets (Stories, Medical, Receipts), measuring Accuracy, Token Cost, and Time to reveal model trade-offs and optimal prompt choices

image: 

Evaluates six prompt styles (JSON, CSV, Prefix, YAML, Function, Hybrid) across three LLMs (ChatGPT-4o, Claude, Gemini) on datasets (Stories, Medical, Receipts), measuring Accuracy, Token Cost, and Time to reveal model trade-offs and optimal prompt choices.

view more 

Credit: Ashraf Elnashar, Jules White/Vanderbilt University, Douglas C. Schmidt/William & Mary

Nashville, TN & Williamsburg, VA – 24 Nov 2025 – A new study published in Artif. Intell. Auton. Syst. delivers the first systematic cross-model analysis of prompt engineering for structured data generation, offering actionable guidance for developers, data scientists, and organizations leveraging large language models (LLMs) in healthcare, e-commerce, and beyond. Led by Ashraf Elnashar from Vanderbilt University, alongside co-authors Jules White (Vanderbilt University) and Douglas C. Schmidt (William & Mary), the research benchmarks six prompt styles across three leading LLMs to solve a critical challenge: balancing accuracy, speed, and cost in structured data workflows.

Structured data—from medical records and receipts to business analytics—powers essential AI-driven tasks, but its quality and efficiency depend heavily on how prompts are designed. “Prior research only scratched the surface, testing a limited set of prompts on single models,” said Elnashar, the study’s corresponding author and a researcher in Vanderbilt’s Department of Computer Science. “Our work expands the horizon by evaluating six widely used prompt formats across ChatGPT-4o, Claude, and Gemini, revealing clear trade-offs that let practitioners tailor their approach to real-world needs.”

Key Findings: Accuracy vs. Efficiency—A Clear Choice for Every Use Case

The team’s rigorous experiment, conducted across three datasets (personal stories, medical records, and receipts), measured accuracy, token cost (a key driver of API expenses), and generation time for each prompt style-LLM combination. The results uncovered distinct strengths in each model:

  • Claude emerged as the accuracy leader (85% overall), excelling with hierarchical prompt formats like JSON and YAML—ideal for complex, high-stakes tasks such as medical record generation where data integrity is non-negotiable.
  • ChatGPT-4o stood out for efficiency, delivering the lowest token usage (under 100 tokens for lightweight formats) and fastest processing times (4–6 seconds on average), making it perfect for cost-sensitive or real-time applications like e-commerce receipt processing.
  • Gemini offered a balanced middle ground, with solid performance across all metrics—though it showed variability with mixed-format prompts like Hybrid CSV/Prefix.

“Hierarchical formats like JSON and YAML boost accuracy but come with higher token costs, while lightweight options like CSV and simple prefixes cut latency without sacrificing much precision,” Elnashar explained. “For example, a healthcare provider handling patient data might prioritize Claude + JSON for accuracy, while an e-commerce platform could opt for ChatGPT-4o + CSV to process thousands of receipts efficiently.”

The study also highlighted a universal challenge: all LLMs struggled with narrative-style unstructured data (e.g., personal stories), with accuracy dropping to ~40% across prompt styles—underscoring the need for tailored approaches for different data types.

Practical Tools for Developers: Reusable Resources to Accelerate AI Workflows

Beyond insights, the research provides tangible value for the AI community. The team has made datasets, prompt templates, validation scripts, and design guidelines publicly available on GitHub (https://github.com/elnashara/EfficientStructuringMethods/tree/main), enabling reproducibility and immediate adoption.

“We wanted to move beyond theory—these resources let developers skip the trial-and-error and directly apply our findings to their pipelines,” said Jules White, co-author and professor at Vanderbilt’s Department of Computer Science. “Whether you’re building a medical data system or an e-commerce analytics tool, our work gives you a roadmap to choose the right prompt style and LLM.”

Looking Ahead: Expanding the Boundaries of Prompt Engineering

The study builds on the authors’ prior work focused on GPT-4o, now generalized to multiple models and prompt formats. Future research will explore LLMs’ robustness to noisy instructions, missing fields, and unseen schemas—critical considerations for real-world deployments. “As AI becomes more integrated into critical systems, we need to understand how these models perform when faced with the messiness of real data,” noted Schmidt, a professor in William & Mary’s Department of Computer Science.

This research was conducted without specific grant funding. The authors acknowledge the support of LLMs ChatGPT-4o, Claude, and Gemini for code generation, data visualization, and comparative evaluation.

About the Authors

  • Ashraf Elnashar: Department of Computer Science, Vanderbilt University (ashraf.elnashar@vanderbilt.edu)
  • Jules White: Department of Computer Science, Vanderbilt University
  • Douglas C. Schmidt: Department of Computer Science, William & Mary

About the Publication

Title: Prompt engineering for structured data: a comparative evaluation of styles and LLM performanceJournal: Artif. Intell. Auton. Syst.DOI: 10.55092/aias2025009

License: Creative Commons Attribution 4.0 International License


Disclaimer: AAAS and EurekAlert! are not responsible for the accuracy of news releases posted to EurekAlert! by contributing institutions or for the use of any information through the EurekAlert system.