Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
65 changes: 43 additions & 22 deletions .github/workflows/links.yml
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ jobs:

# Download initial sitemap and process
echo "Downloading sitemap..."
SITEMAP=$(wget --compression=auto -qO- "https://${{ matrix.website }}/sitemap.xml") || { echo "Failed to download sitemap"; exit 1; }
SITEMAP=$(curl --compressed -fsSL "https://${{ matrix.website }}/sitemap.xml") || { echo "Failed to download sitemap"; exit 1; }

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 MEDIUM: Switching the sitemap fetches from wget to plain curl removes retry behavior here, so a transient network/TLS/5xx failure now aborts the whole job before the parallel download step even starts. Please add retries to this fetch, and mirror the same change for the sub-sitemap download below.

Suggested change:

Suggested change
SITEMAP=$(curl --compressed -fsSL "https://${{ matrix.website }}/sitemap.xml") || { echo "Failed to download sitemap"; exit 1; }
SITEMAP=$(curl --compressed --retry 3 --retry-all-errors -fsSL "https://${{ matrix.website }}/sitemap.xml") || { echo "Failed to download sitemap"; exit 1; }

echo "$SITEMAP" | parse_sitemap > urls.txt

# Process any subsitemaps if they exist
Expand All @@ -77,7 +77,7 @@ jobs:
grep -v 'sitemap' urls.txt > urls.tmp || true
while read -r submap; do
echo "Processing submap: $submap"
SUBMAP_CONTENT=$(wget --compression=auto -qO- "$submap") || { echo "Failed to download submap: $submap"; continue; }
SUBMAP_CONTENT=$(curl --compressed -fsSL "$submap") || { echo "Failed to download submap: $submap"; continue; }
echo "$SUBMAP_CONTENT" | parse_sitemap >> urls.tmp
done < subsitemaps.txt
mv urls.tmp urls.txt || true
Expand All @@ -90,26 +90,47 @@ jobs:
- name: Download Website
continue-on-error: true
run: |
# Set higher wait seconds for discourse community to avoid 429 rate limit errors
if [ "${{ matrix.website }}" = "community.ultralytics.com" ]; then
WAIT=1
else
WAIT=0.001
fi

# Download all URLs
wget \
--compression=auto \
--adjust-extension \
--reject "*.jpg*,*.jpeg*,*.png*,*.gif*,*.webp*,*.svg*,*.txt" \
--input-file=urls.txt \
--no-clobber \
--no-parent \
--wait=$WAIT \
--random-wait \
--tries=3 \
--no-verbose \
--force-directories
# Download all URLs as decompressed local HTML while using Brotli/gzip over the wire.
python - <<'PY'
from pathlib import Path
from urllib.parse import urlsplit

reject_suffixes = (".jpg", ".jpeg", ".png", ".gif", ".webp", ".svg", ".txt")

def quote(value: str) -> str:
return value.replace("\\", "\\\\").replace('"', '\\"')

count = 0
with open("urls.txt", encoding="utf-8") as urls, open("curl-downloads.txt", "w", encoding="utf-8") as config:
for url in (line.strip() for line in urls):
if not url:
continue
parsed = urlsplit(url)
path = f"{parsed.netloc}{parsed.path}"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 MEDIUM: This path builder misclassifies bare-origin URLs like https://docs.ultralytics.com because path becomes just the hostname, and the '.' in Path(path).name check then treats the hostname dots as a file extension. That saves the root page as docs.ultralytics.com instead of docs.ultralytics.com/index.html, which changes the local base path and can throw off relative-link resolution for the homepage.

Suggested change:

Suggested change
path = f"{parsed.netloc}{parsed.path}"
path = f"{parsed.netloc}{parsed.path or '/'}"

if path.endswith("/"):
output = f"{path}index.html"
elif "." in Path(path).name:
output = path
else:
output = f"{path}.html"
if output.lower().endswith(reject_suffixes) or Path(output).exists():
continue
config.write(f'url = "{quote(url)}"\noutput = "{quote(output)}"\n')
count += 1
print(f"Prepared {count} page downloads")
PY

curl --compressed \
--fail \
--silent \
--show-error \
--location \
--retry 3 \
--retry-all-errors \
--create-dirs \
--parallel \
--parallel-max 16 \
--config curl-downloads.txt || true

- name: Check image sizes
if: github.event_name != 'workflow_dispatch' || inputs.check_images
Expand Down