A Python tool I built to profile enterprise transaction data, detect fraud and anomalies, and let you ask questions about your data in plain English — without needing a data warehouse.
| Feature | Description |
|---|---|
| 📊 Data Profiling | Automatically reports rows, columns, types, missing values, duplicates, and date range from any CSV |
| 🚨 Anomaly / Fraud Detection | Flags suspicious transactions using 6 rule-based detectors — each with a plain-English reason |
| 💬 Natural-Language Q&A | Ask "which 5 accounts spent the most?" → the tool generates SQL → DuckDB runs it → returns a table |
- Amount outliers — transactions far above the statistical norm (mean + 3σ)
- Large round numbers — suspiciously clean amounts like $5,000 or $10,000
- Velocity bursts — same account with 3+ transactions within 1 hour
- Duplicate charges — same account + amount + day, charged twice
- High-risk countries — transactions from known fraud-prone regions
- Off-hours activity — transactions between 11 PM and 5 AM
Python · Pandas · DuckDB · OpenAI API · Streamlit · Statistical Anomaly Detection
| Layer | Technology |
|---|---|
| Data handling | Python + Pandas |
| In-process SQL | DuckDB (no warehouse needed) |
| Natural language Q&A | OpenAI API |
| Web UI | Streamlit |
enterprise-data-detective/
├── data/
│ ├── generate_sample_data.py
│ └── transactions.csv
├── detective/
│ ├── profile_data.py
│ ├── detect_anomalies.py
│ └── ask_data.py
├── app.py
└── requirements.txt
pip install -r requirements.txt
python data/generate_sample_data.py
python detective/detect_anomalies.py data/transactions.csv
streamlit run app.pyFor the natural language Q&A feature, set your API key:
export OPENAI_API_KEY=your-key-hereI wanted to build something that works the way analysts actually think — profile the data first, find what's wrong, then answer questions about it in plain English instead of writing SQL every time.
📌 The sample data is synthetic — generated by the script in this repo with suspicious transactions deliberately injected. Not real customer data.