Why Data Cleaning Comes First
The old programming adage "garbage in, garbage out" applies doubly to AI-assisted bookkeeping. AI models work best with clean, structured data. Feed them messy, inconsistent data, and you'll get unreliable results—no matter how advanced the AI.
Regular expressions are the bookkeeper's power tool for data cleaning. Before sending your financial data to an LLM for analysis, use regex to remove noise, standardize formats, and ensure quality.
Common Data Cleaning Tasks
1. Remove Extra Whitespace
Bank exports often have irregular spacing:
Before: "VENDOR NAME $500.00"
Pattern: \s+
Replace: " " (single space)
After: "VENDOR NAME $500.00"
2. Standardize Currency Symbols
Before: "USD 100.00", "US$ 100.00", "100.00 USD"
Pattern: (USD\s*\$?|US\$|\s*USD$)
Replace: "$"
After: "$100.00"
Makes all amounts consistent for AI processing
3. Remove Non-Printable Characters
Pattern: [^\x20-\x7E]+
Replace: ""
Removes: Tabs, line breaks, special characters
Leaves: Only printable ASCII characters
Critical for clean CSV imports
4. Normalize Account Numbers
Before: "Account #1234", "ACCT 1234", "Acct# 1234"
Pattern: (?i)acct\.?\s*#?\s*(\d+)
Extract: Group 1 (just the number)
After: "1234"
Standardized for matching and lookups
Pre-Processing for AI Analysis
Clean Transaction Descriptions
Bank transaction descriptions contain clutter that confuses AI:
Example:
Raw: "SQ *COFFEE SHOP 123 MAIN ST CA SN:AB12CD34 CARD 1234"
Cleaning patterns:
1. Remove card info: CARD\s+\d{4} → ""
2. Remove serial: SN:[A-Z0-9]+ → ""
3. Extract vendor: SQ \*(.+?)(?:\s+\d+|$) → "COFFEE SHOP"
Clean: "COFFEE SHOP"
Now AI can accurately categorize without noise!
Standardizing Field Formats
Phone Numbers
Before: "(760) 249-7680", "760-249-7680", "7602497680"
Pattern: \(?(\d{3})\)?[-.\s]?(\d{3})[-.\s]?(\d{4})
Replace: "$1-$2-$3"
After: "760-249-7680"
Consistent format for AI to process contact info
Email Addresses
Extract valid emails:
Pattern: [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}
Validate: Use AI to check if domain exists
Clean: Convert to lowercase for consistency
EINs (Employer Identification Numbers)
Before: "12-3456789", "123456789", "12 3456789"
Pattern: (\d{2})[-\s]?(\d{7})
Replace: "$1-$2"
After: "12-3456789"
Properly formatted for IRS forms
Removing Duplicates
Duplicate Transaction Detection
Use regex to create unique identifiers:
Create hash from:
- Date: \d{2}/\d{2}/\d{4}
- Vendor: ^[A-Z\s]+
- Amount: \$[\d,]+\.\d{2}
Combine: "2025-11-15|AMAZON|1234.56"
AI can then: "Find all transactions with identical hashes.
These are likely duplicates. Flag for review."
Special Characters and Encoding
Remove Problem Characters
// Smart quotes to straight quotes
Pattern: [""]
Replace: "
// Em dash to hyphen
Pattern: —
Replace: -
// Degree symbol to word
Pattern: °
Replace: deg
// Non-breaking space to regular space
Pattern: \u00A0
Replace: " "
Google Sheets Cleaning Functions
Comprehensive Cleaning Formula
=TRIM(
REGEXREPLACE(
REGEXREPLACE(
REGEXREPLACE(A2,
"[^\x20-\x7E]", "" // Remove non-printable
),
"\s+", " " // Multiple spaces to one
),
"^\s+|\s+$", "" // Trim edges
)
)
Chains three regex operations:
1. Remove special characters
2. Collapse multiple spaces
3. Trim whitespace
AI-Guided Data Quality Checks
After regex cleaning, use AI to verify quality:
Quality Check Prompt:
"I cleaned this data using regex patterns. Validate quality:
1. All amounts match ^\$[\d,]+\.\d{2}$ ✓
2. All dates match ^\d{4}-\d{2}-\d{2}$ ✓
3. No duplicate whitespace ✓
Now check for:
- Logical inconsistencies
- Unlikely amounts (e.g., $0.00 transactions)
- Missing required fields
- Dates in wrong fiscal period"
Real-World Cleaning Workflow
Excel/CSV Import Preparation
- Export from bank (often messy format)
-
Regex cleaning:
- Remove extra spaces:
\s+→ " " - Standardize amounts: Add $ and .00 where missing
- Fix dates: Convert all to YYYY-MM-DD
- Clean vendor names: Remove transaction codes
- Remove extra spaces:
- AI validation: "Check this cleaned data for any remaining issues"
- Import to QuickBooks with confidence
Best Practices
- Clean early: Don't wait until reconciliation time
- Document patterns: Save regex for reuse
- Test thoroughly: Run on historical data first
- Validate with AI: Double-check cleaning didn't corrupt data
- Keep originals: Always maintain raw data backup
Conclusion
Data cleaning is the unglamorous but essential foundation of AI-assisted bookkeeping. By using regex to systematically clean and standardize your financial data before AI analysis, you ensure accurate results, save debugging time, and build reliable automated workflows.
Remember: Clean data in = accurate insights out!