Extracting Invoice Data Using Regex and LLMs

Learn how to extract invoice numbers, amounts, dates, and vendor information from PDFs using regular expressions combined with AI language models.

Published: November 15, 2025

The Invoice Data Entry Problem

Manual invoice entry is one of the most time-consuming bookkeeping tasks. Processing 100 invoices per month can consume 15-20 hours of manual data entry. With regex-powered AI, you can reduce this to under 1 hour while improving accuracy.

Critical Invoice Fields to Extract

Every invoice contains similar data points that regex can reliably identify:

1. Invoice Number Patterns

Common Formats:

  • INV-12345 → Pattern: INV-\d{5}
  • Invoice #A-2025-001 → Pattern: Invoice #[A-Z]-\d{4}-\d{3}
  • #2025110001 → Pattern: #\d{10}
  • SI-Nov-2025-123 → Pattern: SI-[A-Za-z]{3}-\d{4}-\d+

2. Date Extraction

Invoices use various date formats. Regex patterns for each:

Format Pattern Example
MM/DD/YYYY \d{2}/\d{2}/\d{4} 11/15/2025
DD-MM-YYYY \d{2}-\d{2}-\d{4} 15-11-2025
Month DD, YYYY [A-Za-z]+ \d{1,2}, \d{4} November 15, 2025
YYYY-MM-DD \d{4}-\d{2}-\d{2} 2025-11-15

3. Amount Extraction

Currency amounts come in many formats:

  • With currency symbol: \$[\d,]+\.\d{2} → $1,234.56
  • Without symbol: \b\d+\.\d{2}\b → 1234.56
  • With thousands separator: \$?[\d,]+\.\d{2} → $1,234.56 or 1,234.56
  • Total line: (?i)total:?\s*\$?([\d,]+\.\d{2})

4. Vendor Information

Extract vendor details:

  • Company name: Often in first few lines, all caps
  • Address: Street, City, State ZIP pattern
  • Tax ID: \d{2}-\d{7} (EIN format)
  • Website: www\.[a-z0-9-]+\.(com|net|org)

AI + Regex Workflow for Invoices

Step-by-Step Process

  1. Convert PDF to text

    Use OCR or PDF extraction tool (many AI platforms include this)

  2. Apply regex pre-extraction

    Pull out obvious patterns: dates, amounts, invoice numbers

  3. AI prompt with extracted data
    "From this invoice text, I've extracted:
    - Invoice number: [regex result]
    - Date: [regex result]  
    - Total: [regex result]
    
    Please verify these are correct and extract:
    1. Vendor name
    2. Billing address
    3. Line items with descriptions and amounts
    4. Tax amount
    5. Payment terms
    
    Return in JSON format."
  4. AI processes and structures data

    Returns clean JSON with all invoice fields

  5. Regex validation of AI output

    Verify amounts match pattern, dates are valid, invoice number format correct

  6. Import to accounting system

    Direct API integration or CSV import

Advanced Extraction Patterns

Line Items

Extract individual line items from invoices:

Pattern: ^(.+?)\s+(\d+)\s+\$?([\d,]+\.\d{2})\s+\$?([\d,]+\.\d{2})$

Matches:
Office Supplies    5    $10.00    $50.00
Consulting Hours  10   $150.00  $1,500.00

Groups:
1. Description
2. Quantity
3. Unit price
4. Line total

Tax Amounts

Find sales tax or VAT:

Pattern: (?i)(sales?\s+tax|vat):?\s*\$?([\d,]+\.\d{2})

Matches:
Sales Tax: $45.67
VAT $123.45
Tax $5.00

Payment Terms

Pattern: (?i)(net|due in)\s+(\d+)\s+(days?|months?)

Matches:
Net 30 days
Due in 15 days
Net 60

Real-World Success Story

Case Study: Construction Company

Challenge: 200 vendor invoices monthly, each requiring manual entry
Time: 25 hours per month

Solution: Regex + AI extraction system
Results:
• 95% of fields auto-extracted
• Time reduced to 2 hours (92% savings)
• Error rate dropped from 5% to 0.5%
• ROI: $2,000+ monthly in saved labor

Best Practices

1. Vendor-Specific Patterns

Create custom patterns for your top 20 vendors—they likely represent 80% of invoice volume.

2. Validation Regex

After extraction, validate:

  • Amount format is correct
  • Date is not in future
  • Invoice number is unique
  • Subtotal + tax = total (AI can calculate, regex can validate format)

3. Confidence Scoring

Have AI rate extraction confidence. If <90%, flag for manual review.

Common Pitfalls to Avoid

  • ❌ Making patterns too specific (won't catch variations)
  • ❌ Making patterns too broad (false positives)
  • ❌ Not testing on sample invoices first
  • ❌ Forgetting case-insensitive matching
  • ✅ Start with vendor-specific patterns
  • ✅ Use AI to suggest patterns based on examples
  • ✅ Validate extracted data before importing

Next in This Series

Continue learning:


Anyone may arrange his affairs so that his taxes shall be as low as possible; he is not bound to choose that pattern which best pays the treasury. There is not even a patriotic duty to increase one's taxes. Over and over again the Courts have said that there is nothing sinister in so arranging affairs as to keep taxes as low as possible. Everyone does it, rich and poor alike and all do right, for nobody owes any public duty to pay more than the law demands.



Judge Learned Hand
Chief Judge of the United States Court of Appeals
for the Second Circuit
Gregory v. Helvering, 69 F
Judge Learned Hand



© 2025 by Joseph Stacy. All rights reserved.
Disclaimer | Sitemap | Privacy | SMS Terms & Conditions