1
0
Fork 0
forge/docs/disaster-recovery.md
Jason Hall 4dc1b58f2f initial commit
Signed-off-by: Jason Hall <imjasonh@gmail.com>
2026-05-07 20:02:59 -04:00

4.6 KiB

Disaster recovery

What to do when things go wrong, in rough order of severity.

Pre-requisite: verify backups are real

Before you need them. Run monthly:

./scripts/test-restore.sh

This pulls the latest GCS backup, boots Forgejo against it in a throwaway local container, and probes the API. If it fails, fix backups before you have an actual incident.

VM is unreachable but the disk is fine

Symptoms: Forgejo doesn't load, gcloud compute ssh ... --tunnel-through-iap times out, but forgejo-data disk and forgejo-ip static IP both still exist.

Recovery:

cd terraform
terraform apply -replace=google_compute_instance.forgejo

The data disk has prevent_destroy = true and is reattached; cloud-init re-bootstraps the stack against the existing data. The static IP is preserved, so DNS keeps working.

Persistent disk is corrupted or accidentally deleted

  1. (If still present and corrupt) remove prevent_destroy from google_compute_disk.forgejo_data, then terraform apply to destroy and recreate. Re-add prevent_destroy immediately afterward.
  2. SSH to the VM.
  3. sudo /var/lib/forgejo/restore.sh <latest-backup>.tar.gz — restores from GCS into the fresh disk.

Whole GCP project is lost

Worst case, but recoverable from GCS-side backups if you copied them out before deleting the project.

  1. Before deleting the old project: copy the latest backup to durable storage you control.
    gsutil cp gs://OLD_PROJECT-forgejo-backups/forgejo-LATEST.tar.gz ~/Backups/
    
  2. Create a new GCP project, enable APIs.
  3. ./scripts/bootstrap-secrets.sh — this generates new SECRET_KEY and INTERNAL_TOKEN. If you saved the originals to a password manager, manually upload those instead so encrypted DB fields survive (see below).
  4. Update project_id in terraform.tfvars.
  5. terraform apply.
  6. Upload the saved tarball to the new bucket: gsutil cp ~/Backups/forgejo-LATEST.tar.gz gs://NEW_PROJECT-forgejo-backups/.
  7. SSH to the VM and run restore.sh.

Preserving SECRET_KEY across projects

Forgejo uses SECRET_KEY to encrypt some DB fields (2FA tokens, OAuth tokens, mirror credentials). Rotating it leaves repos and accounts intact but breaks those features.

For bit-exact recovery, save the secrets to a password manager when you first create them:

gcloud secrets versions access latest --secret=forgejo-secret-key
gcloud secrets versions access latest --secret=forgejo-internal-token

To restore them in a new project, skip bootstrap-secrets.sh and create the secrets manually with the saved values:

echo -n "OLD_SECRET_KEY_VALUE" | gcloud secrets create forgejo-secret-key \
  --replication-policy=automatic --data-file=-
echo -n "OLD_INTERNAL_TOKEN_VALUE" | gcloud secrets create forgejo-internal-token \
  --replication-policy=automatic --data-file=-

Backup itself is corrupt

This is what scripts/test-restore.sh exists to catch before an incident.

If the latest is corrupt, list older versions:

gsutil ls -l gs://YOUR_PROJECT-forgejo-backups/

Backups are kept 30 days (lifecycle rule in backups.tf). Within that window, fall back to an earlier nightly tarball.

If all backups in the bucket are corrupt: there is no recovery beyond what's still on the data disk. This is why monthly verification matters.

Domain / DNS lost

The static IP (google_compute_address.forgejo) is reserved separately from the VM and persists across VM replacements. You only lose it if you terraform destroy or manually release it.

To re-point: set your registrar's A record (or Cloud DNS if manage_dns = true) to the value of terraform output static_ip.

Caddy will re-issue a Let's Encrypt cert automatically once DNS resolves and ports 80/443 are reachable. ACME state lives in the data disk (/mnt/disks/forgejo-data/caddy), so existing certs survive VM replacements within their validity period.

Compromise / suspected intrusion

  1. Cut public network access immediately:
    gcloud compute firewall-rules update allow-https --disabled
    
    (Or terraform it: temporarily set source_ranges to your IP only.)
  2. SSH in via IAP, snapshot evidence: docker logs forgejo > /tmp/forensics.log, copy /mnt/disks/forgejo-data/forgejo aside.
  3. Rotate every secret: forgejo-secret-key, forgejo-internal-token, all Forgejo user passwords + PATs, your Google account password.
  4. Review gcloud logging read 'resource.type=gce_instance' for unexpected access.
  5. If unsure of the compromise vector, treat the disk as tainted: nuke the VM and restore from a backup taken before the suspected breach.