DharmaOCR introduces Direct Preference Optimization (DPO) to combat text degeneration in OCR models. The second training stage reduced degeneration rates by an average of 59.4%, addressing a significant limitation of supervised fine-tuning.
In April, DharmaOCR was released, a specialized structured OCR model available on Hugging Face. A subsequent paper detailed its development methodology and benchmark results demonstrating its quality and cost efficiency.
The paper benchmarked various vision-language model families, comparing both open-source and commercial options for structured document extraction tasks. The key metric reported was the text degeneration rate, which measures the frequency of repetition loops during transcription of Brazilian Portuguese text.
Degeneration rates varied widely among tested models, from below 1% to above 33% for vanilla models.
Supervised fine-tuning (SFT) typically improved degeneration rates, yet rarely brought them down to acceptable production levels. This indicates a structural limitation in that SFT optimizes outputs without explicitly addressing the degeneration issue. A ceiling exists regarding how much SFT can mitigate this failure.
The introduction of Direct Preference Optimization (DPO), which followed SFT, was shown to directly reduce degeneration in every model tested, averaging a 59.4% reduction and peaking at 87.6% in some cases. DPO employs binary preference signals, choosing correct transcriptions and rejecting degeneration loops, making it particularly suited for OCR tasks where subjectivity isn’t a factor.
The exact reasons for SFT's limitations on reducing degeneration are still under investigation, with leading theories suggesting a loss granularity issue. DPO's application in OCR presents a significant advancement in improving output quality for these types of models.
✨ This summary was generated by AI from the outlets' reporting listed below. It is not independently verified and may contain errors — check the original sources. How BrevFeed works →
DharmaOCR introduces Direct Preference Optimization (DPO) to combat text degeneration in OCR models. The second training stage reduced degeneration rates by an average of 59.4%, addressing a significant limitation of supervised fine-tuning.