xiand.ai
AI

John Rush Uses AI to Analyze 25 Years of Egg Receipts

John Rush utilized artificial intelligence to process 11,345 receipts spanning 25 years, aiming to track egg prices and test automation limits. The project demonstrated how specialized AI models can overcome OCR challenges in unstructured personal archives. Results showed high accuracy in extracting financial data from degraded thermal prints.

La Era

3 min read

John Rush Uses AI to Analyze 25 Years of Egg Receipts
John Rush Uses AI to Analyze 25 Years of Egg Receipts
Publicidad

John Rush completed a massive data analysis project using advanced artificial intelligence to review 11,345 personal receipts spanning 25 years. The hobbyist developer aimed to test the limits of AI coding agents and modern OCR technology on unstructured financial data. His primary objective was to extract egg purchase prices from a decade-spanning archive of thermal prints and PDFs to validate modern automation capabilities. The archive included various file formats ranging from email attachments to flatbed scans.

The project required 14 days of processing time and consumed approximately 1.6 billion tokens across multiple models. Rush utilized AI coding agents like Codex and Claude to plan the extraction workflow and manage the complex file system. These agents identified forgotten databases and generated project plans within hours of initial interaction while running 15 interactive sessions. The workflow combined human direction with long stretches of autonomous agent execution.

A significant technical hurdle involved segmenting flatbed scans where receipts and scanner beds shared similar white backgrounds. Traditional computer vision approaches failed to detect receipt boundaries with sufficient accuracy. Meta’s SAM3 segmentation model eventually resolved the issue with confidence scores ranging from 0.92 to 0.98 per scan. This capability allowed the processing of 1,873 receipts from 760 multi-receipt pages.

Optical character recognition proved difficult due to degraded thermal prints and random document orientations. Initial attempts with Tesseract resulted in legible text errors and missing decimal points on older receipts. Rush switched to PaddleOCR-VL, a local vision-language model optimized for Apple Silicon hardware. Dynamic slicing based on aspect ratio solved issues with tall receipt documents.

Structured extraction required moving beyond simple regular expressions to handle store-specific abbreviations and OCR artifacts. Regex matching failed to identify items labeled with codes like STO LRG BRUNN or truncated descriptions such as EDGS. The team pivoted to sending every receipt through Codex for comprehensive structured data extraction. This approach handled variations like LG EGO 12 CT caused by OCR mangling.

Automation scripts frequently crashed during long-running overnight jobs, causing data loss before completion. Rush implemented a fix requiring fresh processes for each batch with checkpointing and cache resumption capabilities. This adjustment reduced the estimated processing time from 12 hours to just three hours per run. The solution involved launching processes in a tmux session to prevent session timeout failures.

Validation involved hand-labeling 375 receipts to establish a ground truth for model accuracy. The final classifier achieved over 99% accuracy after incorporating edge cases into few-shot examples. A custom labeling tool built in 22 minutes facilitated this manual verification process. Human QA confirmed that the AI model correctly identified items even when initial labels were incorrect.

The experiment demonstrated that specialized models outperform generalist language models in specific technical tasks. While LLMs excelled at orchestration and tool generation, SAM3 and PaddleOCR handled image segmentation and text recognition. This hybrid approach represents a practical pipeline for processing complex historical archives. Rush noted that the agents could not segment images or replace the OCR engine directly.

Rush plans to extend the dataset to cover 30 years of consumer price tracking in the near future. The project highlights the growing capability of AI agents to manage complex data engineering pipelines autonomously. It suggests that personal archives can yield valuable economic insights through automated analysis. The findings indicate a new era where hobbyists can access deep historical data without manual entry.

Publicidad

Comments

Comments are stored locally in your browser.

Publicidad