Overview
A document classification system that mirrors real-world document routing workflows used in title and escrow processing. The system leverages OCR to extract text from scanned images, then employs a Bidirectional LSTM (BiLSTM) recurrent neural network to analyze word sequences and classify documents into four categories: letters, forms, emails, and invoices. Built on the RVL-CDIP dataset, this text-only approach demonstrates how sequential models can understand document structure through natural language patterns.
Technical Highlights
OCR Pipeline: Pytesseract (Tesseract wrapper) integration converts scanned images to text, preserving document structure through sequential word order.
BiLSTM Architecture: Bidirectional LSTM (128 hidden units) reads text both forward and backward to capture contextual relationships in document headers, lines, and signatures.
Text Processing Pipeline: Keras TextVectorization layer with 20K vocabulary, 300-token sequences, and padding/truncation for consistent input dimensions across all documents.
Training Pipeline: 4000 training samples across 4 balanced classes, 3 epochs with AdamW optimizer, achieving ~83% test accuracy on 476 validation samples.
Real-Time Prediction: predict_image() function accepts unseen scanned pages, performs OCR + tokenization, and returns class + confidence scores for production deployment.
Production-Ready Dataset: RVL-CDIP streaming from Hugging Face with per-class sampling (1000 train, 120 test per category) for balanced model performance.
Features
Multi-Class Classification: Accurately distinguishes between 4 document types with softmax activation for probabilistic predictions.
RNN-Based Understanding: BiLSTM captures sequential patterns in document text, understanding context that simple bag-of-words models miss.
Real-World Dataset: Trained on RVL-CDIP, a public dataset of real scanned documents, ensuring practical applicability.
Inference API: predict_image() function accepts PIL images, performs OCR, tokenization, and returns class prediction with confidence scores.
Validation Metrics: Comprehensive evaluation with accuracy tracking, confusion matrix support for error analysis.
Why RNN/Text-Only?
OCR extracts structured text that LSTMs can process sequentially, capturing document flow (headers → body → signatures). This text-only approach keeps the model simple and interpretable without requiring complex computer vision transformers, while still achieving strong classification performance by leveraging the inherent sequential nature of document content.