Skip to content

wiki_dumps: strip the 86 page from public dumps#547

Merged
nthmost merged 1 commit into
masterfrom
nthmost/wiki-dump-exclude-86
Jun 30, 2026
Merged

wiki_dumps: strip the 86 page from public dumps#547
nthmost merged 1 commit into
masterfrom
nthmost/wiki-dump-exclude-86

Conversation

@nthmost

@nthmost nthmost commented Jun 24, 2026

Copy link
Copy Markdown
Member

Why

The daily public dump at https://dumps.noisebridge.net/ is still leaking this page.

dumpBackup.php --public exports every publicly-readable page, and the 86 page has no read restriction on the wiki — so it lands in latest.xml.gz.

Worse, index.json actively invites bots and AI scrapers to ingest the dump ("Please use these dumps instead of hitting the live site").

What

  • files/dump_filter.py — streams the gzipped dump and drops any <page> whose title is an excluded base, a subpage (86/…), or a Talk page (Talk:86, Talk:86/…), then rewrites a clean gzip (namespaces preserved, no ns0: prefixes).
  • files/wiki_dump.sh — exports to .raw.gz, filters into the final file. If the filter fails the script aborts (set -e) rather than publishing an unfiltered dump, so latest.xml.gz keeps pointing at the last good dump.
  • tasks/main.yml — deploys the filter and passes EXCLUDE_TITLES into the cron job.
  • defaults/main.ymlwiki_dumps_exclude_titles: ["86"], with a comment to keep it in sync with roles/mediawiki/files/robots.txt.

Testing

Ran the filter against a synthetic export: it stripped 86, 86/2023, and Talk:86 while correctly keeping Main Page and 868 HAYES (no false-prefix match). Output XML stayed byte-clean.

Notes

  • Takes effect on the next 2 AM dump after deploy. To scrub sooner, run /usr/local/sbin/wiki_dump by hand on the host after the playbook runs.

The public dump invites bots and AI scrapers to ingest the whole wiki
(see index.json's note_to_bots), but dumpBackup.php exports the 86 page
along with everything else. That undoes the robots.txt Disallow/Noindex
on /wiki/86 -- the dump becomes the larger exposure vector.

Add a post-export filter that drops excluded base titles plus their
subpages (Base/...) and Talk pages (Talk:Base, Talk:Base/...) before
publishing. Titles are configurable via wiki_dumps_exclude_titles
(default: 86), kept in sync with roles/mediawiki/files/robots.txt. The
EXCLUDE_TITLES env var is passed through the systemd service unit.

If the filter fails the dump aborts (set -e) rather than publishing an
unfiltered file, so latest.xml.gz keeps pointing at the last good dump.
@nthmost nthmost force-pushed the nthmost/wiki-dump-exclude-86 branch from 56459f6 to 53be3ee Compare June 24, 2026 02:09
@nthmost

nthmost commented Jun 24, 2026

Copy link
Copy Markdown
Member Author

@ElanHR @SuperQ wasn't sure if there was an easier or more efficient way to do this. it's a little hacky.

@SuperQ

SuperQ commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

Hmm, yea, without having some wiki database attributes that can be used to filter, hacks may be required.

I did a quick look at the backup PHP code, I don't see any obvious exclude list features.

@nthmost nthmost requested review from ElanHR, SuperQ and jetpham June 24, 2026 08:00
@nthmost nthmost merged commit 885f2a3 into master Jun 30, 2026
1 check passed
@nthmost nthmost deleted the nthmost/wiki-dump-exclude-86 branch June 30, 2026 00:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants