Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -41,4 +41,6 @@ jobs:
- name: Run proxyScraper
run: python3 proxyScraper.py -p http
- name: Run proxyChecker
run: python3 proxyChecker.py -t 20 -s google.com -l output.txt
run: python3 proxyChecker.py -t 20 -s google.com -l output.txt
- name: Run proxyGeolocation
run: python3 proxyGeolocation.py -i 8.8.8.8
52 changes: 51 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -138,4 +138,54 @@ dmypy.json
# Cython debug symbols
cython_debug/

output.txt
# PyCharm
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
# be added to the global gitignore or merged into this project gitignore
.idea/

# VS Code
.vscode/
*.code-workspace

# Project specific files
.github/copilot-instructions.md
output.txt
test_small.txt
test_local.py
*.txt
!requirements.txt
!dev_requirements.txt
!user_agents.txt
!README.txt

# Temporary files
*.tmp
*.temp
*.swp
*.swo
*~

# OS generated files
.DS_Store
.DS_Store?
._*
.Spotlight-V100
.Trashes
ehthumbs.db
Thumbs.db

# Logs
*.log
logs/

# Backup files
*.bak
*.backup

# IDE files
*.sublime-project
*.sublime-workspace

# Poetry/PDM (modern Python package managers)
poetry.lock
.pdm.toml
225 changes: 189 additions & 36 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,85 +1,238 @@
# Proxy Scraper and Checker
# Proxy Scraper & Checker

[![Tests](https://github.com/iw4p/proxy-scraper/actions/workflows/tests.yml/badge.svg)](https://github.com/iw4p/proxy-scraper/actions/workflows/tests.yml)
[![Downloads](https://static.pepy.tech/badge/proxyz)](https://pepy.tech/project/proxyz)

Scrape more than 1K HTTP - HTTPS - SOCKS4 - SOCKS5 proxies in less than 2 seconds.
**Fast, reliable proxy scraper that collects 30K+ HTTP/HTTPS/SOCKS proxies from 24+ sources in seconds.**

Scraping fresh public proxies from different sources:
✨ **Features:**
- ⚡ **Fast scraping** - All sources scraped concurrently
- 🛡️ **Smart filtering** - Automatically removes CDN/bad IPs (Cloudflare, etc.)
- 🌍 **Global coverage** - Proxies from Asia, Europe, Americas
- 🔧 **Easy to use** - Simple CLI interface
- ✅ **Quality checked** - Built-in proxy validation

- [sslproxies.org](http://sslproxies.org) (HTTP, HTTPS)
- [free-proxy-list.net](http://free-proxy-list.net) (HTTP, HTTPS)
- [us-proxy.org](http://us-proxy.org) (HTTP, HTTPS)
- [socks-proxy.net](http://socks-proxy.net) (Socks4, Socks5)
- [proxyscrape.com](https://proxyscrape.com) (HTTP, Socks4, Socks5)
- [proxy-list.download](https://www.proxy-list.download) (HTTP, HTTPS, Socks4, Socks5)
- [geonode.com](https://geonode.com) (HTTP, HTTPS, Socks4, Socks5)
## Installation & Setup

## Installation
### 📦 Option 1: Install from PyPI (Recommended)

You can install the package directly from PyPI using `pip`:

```bash
pip install proxyz
```

**Verify installation:**
```bash
proxy_scraper --help
proxy_checker --help
```

### 🔧 Option 2: Install from Source Code

Alternatively, you can install dependencies manually if you're working from the source code:

```bash
# Clone the repository
git clone https://github.com/iw4p/proxy-scraper.git
cd proxy-scraper

# Install dependencies
pip3 install -r requirements.txt

# Test the installation
python proxyScraper.py --help
python proxyChecker.py --help
```

## Usage
### 🐍 Python Requirements
- **Python 3.9+** (3.9, 3.10, 3.11, 3.12 supported)
- **Dependencies:** httpx, beautifulsoup4, pysocks

### Using the Command-Line Interface
## Quick Start Tutorial

Once installed via `pip`, you can use the command-line tools `proxy_scraper` and `proxy_checker` directly.
### Step 1: Scrape Proxies
```bash
# Get HTTP proxies (basic)
proxy_scraper -p http

# Get SOCKS5 proxies with detailed output
proxy_scraper -p socks5 -v

#### For Scraping Proxies:
# Save to custom file
proxy_scraper -p http -o my_proxies.txt -v
```

### Step 2: Check Proxy Quality
```bash
proxy_scraper -p http
# Test scraped proxies (basic)
proxy_checker -l output.txt -t 10

# Test against specific site with verbose output
proxy_checker -l output.txt -s https://google.com -v

# Use random user agents for testing
proxy_checker -l output.txt -r -v
```

### Step 3: Complete Workflow Example
```bash
# 1. Scrape HTTP proxies
proxy_scraper -p http -v -o fresh_proxies.txt

# 2. Check their quality
proxy_checker -l fresh_proxies.txt -t 15 -v

# 3. Result: output.txt contains only working proxies
```

- With `-p` or `--proxy`, you can choose your proxy type. Supported proxy types are: **HTTP - HTTPS - Socks (Both 4 and 5) - Socks4 - Socks5**.
- With `-o` or `--output`, specify the output file name where the proxies will be saved. (Default is **output.txt**).
- With `-v` or `--verbose`, increase output verbosity.
- With `-h` or `--help`, show the help message.
## Supported Proxy Types
- **HTTP** - Web traffic
- **HTTPS** - Secure web traffic
- **SOCKS4** - TCP connections
- **SOCKS5** - TCP + UDP connections

#### For Checking Proxies:
## Proxy Sources

We collect proxies from **24 sources**:

**🌐 Direct Websites (11 sources)**
- spys.me, free-proxy-list.net, proxyscrape.com, geonode.com
- sslproxies.org, us-proxy.org, socks-proxy.net
- proxy-list.download, proxyscan.io, proxyspace.pro
- freeproxy.lunaproxy.com

**📦 GitHub Repositories (13 sources)**
- proxifly/free-proxy-list, monosans/proxy-list, TheSpeedX/PROXY-List
- jetkai/proxy-list, roosterkid/openproxylist, mmpx12/proxy-list
- ShiftyTR/Proxy-List, clarketm/proxy-list, sunny9577/proxy-scraper
- zloi-user/hideip.me, almroot/proxylist, aslisk/proxyhttps
- proxy4parsing/proxy-list

## Advanced Usage

### CLI Options

**Scraping:**
```bash
proxy_checker -p http -t 20 -s https://google.com -l output.txt
proxy_scraper -p <type> [-o output.txt] [-v]

Options:
-p, --proxy Proxy type: http, https, socks, socks4, socks5
-o, --output Output file (default: output.txt)
-v, --verbose Show detailed statistics
```

- With `-t` or `--timeout`, set the timeout in seconds after which the proxy is considered dead. (Default is **20**).
- With `-p` or `--proxy`, check HTTPS, HTTP, SOCKS4, or SOCKS5 proxies. (Default is **HTTP**).
- With `-l` or `--list`, specify the path to your proxy list file. (Default is **output.txt**).
- With `-s` or `--site`, check proxies against a specific website like google.com. (Default is **https://google.com**).
- With `-r` or `--random_agent`, use a random user agent per proxy.
- With `-v` or `--verbose`, increase output verbosity.
- With `-h` or `--help`, show the help message.
**Checking:**
```bash
proxy_checker [-l input.txt] [-t timeout] [-s site] [-v]

Options:
-l, --list Input proxy file (default: output.txt)
-t, --timeout Timeout in seconds (default: 20)
-s, --site Test site (default: https://google.com)
-r, --random_agent Use random user agents
-v, --verbose Show detailed progress
```

### From Source Code
```bash
# Clone repository
git clone https://github.com/iw4p/proxy-scraper
cd proxy-scraper

# Install dependencies
pip install -r requirements.txt

# Run scraper
python proxyScraper.py -p http -v

### Running Directly from Source
# Check proxies
python proxyChecker.py -l output.txt -v
```

## Quality & Performance

If you prefer running the scripts directly from the source code, you can use the following commands:
- ✅ **Automatic filtering** - Removes bad IPs (Cloudflare, CDNs, private ranges)
- 📊 **Source statistics** - See which sources provide the best proxies
- ⚡ **Fast concurrent** - All sources scraped simultaneously

#### For Scraping:

## Example Output
```bash
python3 proxyScraper.py -p http
Scraping proxies using 24 sources...
📊 Source Statistics:
--------------------------------------------------
ProxyScrapeScraper: 18769 valid, 16408 bad IPs filtered
PlainTextScraper: 13516 valid, 5515 bad IPs filtered
GitHubScraper: 1767 valid, 739 bad IPs filtered
...
Total filtered: 22177 bad IPs (CDN/etc), 1 invalid format
Found 30938 unique valid proxies
```

#### For Checking:
## 🌍 Proxy Geolocation & Analysis

The project includes a powerful geolocation tool to analyze proxy origins and track sources:

### Features
- **🔍 IP Geolocation** - Get country, city, ISP, and organization info
- **☁️ CDN Detection** - Automatically identifies Cloudflare and other CDNs
- **🏢 Datacenter Detection** - Flags hosting providers and datacenters
- **📊 Source Tracking** - Maps proxies back to their original sources
- **💾 JSON Export** - Save analysis results for further processing

### Usage Examples

**Analyze single IP:**
```bash
python proxyGeolocation.py -i 104.16.1.31
```

**Analyze proxy file:**
```bash
python proxyGeolocation.py -f output.txt -l 50
```

**Track proxy sources:**
```bash
python proxyGeolocation.py -f output.txt -s --limit 100
```

**Export to JSON:**
```bash
python proxyGeolocation.py -f output.txt -o analysis.json
```

### Sample Output
```bash
python3 proxyChecker.py -p http -t 20 -s https://google.com -l output.txt
🔍 Proxy Geolocation Analysis Results
==================================================

📊 Summary:
Total proxies analyzed: 50
Proxies with geolocation data: 45
Cloudflare proxies: 8
Datacenter proxies: 12

🌎 Countries:
United States (US): 15
Germany (DE): 8
Singapore (SG): 6
...

📋 Detailed Results:
────────────────────────────────────────────────────────────────
☁️ 104.16.1.31:80 - San Francisco, United States | Cloudflare Inc.
🌍 45.79.143.52:3128 - Tokyo, Japan | Linode LLC
🏢 159.203.61.169:3128 - New York, United States | DigitalOcean
```

## Good to Know

- Dead proxies will be removed, and only alive proxies will remain in the output file.
- This script is capable of scraping SOCKS proxies, but `proxyChecker` currently only checks HTTP(S) proxies.
- The proxy checker supports all proxy types: **HTTP, HTTPS, SOCKS4, and SOCKS5**.
- Use random user agents (`-r` flag) for better success rates when checking proxies.

## Star History

Expand Down
16 changes: 8 additions & 8 deletions dev_requirements.txt
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
flake8==4.0.1
flake8-black==0.3.1
flake8-bugbear==22.4.25
flake8-builtins==1.5.3
flake8-commas==2.1.0
flake8-isort==4.1.1
flake8-polyfill==1.0.2
pep8-naming==0.12.1
flake8>=4.0.1,<8.0.0
flake8-black>=0.3.1,<1.0.0
flake8-bugbear>=22.4.25,<25.0.0
flake8-builtins>=1.5.3,<3.0.0
flake8-commas>=2.1.0,<5.0.0
flake8-isort>=4.1.1,<7.0.0
flake8-polyfill>=1.0.2,<2.0.0
pep8-naming>=0.12.1,<1.0.0
Loading
Loading