Jason Hall 4dc1b58f2f initial commit

Signed-off-by: Jason Hall <imjasonh@gmail.com>

2026-05-07 20:02:59 -04:00

4.6 KiB

Raw Blame History

Disaster recovery

What to do when things go wrong, in rough order of severity.

Pre-requisite: verify backups are real

Before you need them. Run monthly:

./scripts/test-restore.sh

This pulls the latest GCS backup, boots Forgejo against it in a throwaway local container, and probes the API. If it fails, fix backups before you have an actual incident.

VM is unreachable but the disk is fine

Symptoms: Forgejo doesn't load, gcloud compute ssh ... --tunnel-through-iap times out, but forgejo-data disk and forgejo-ip static IP both still exist.

Recovery:

cd terraform
terraform apply -replace=google_compute_instance.forgejo

The data disk has prevent_destroy = true and is reattached; cloud-init re-bootstraps the stack against the existing data. The static IP is preserved, so DNS keeps working.

Persistent disk is corrupted or accidentally deleted

(If still present and corrupt) remove prevent_destroy from google_compute_disk.forgejo_data, then terraform apply to destroy and recreate. Re-add prevent_destroy immediately afterward.
SSH to the VM.
sudo /var/lib/forgejo/restore.sh <latest-backup>.tar.gz — restores from GCS into the fresh disk.

Whole GCP project is lost

Worst case, but recoverable from GCS-side backups if you copied them out before deleting the project.

Before deleting the old project: copy the latest backup to durable storage you control.
```
gsutil cp gs://OLD_PROJECT-forgejo-backups/forgejo-LATEST.tar.gz ~/Backups/
```
Create a new GCP project, enable APIs.
./scripts/bootstrap-secrets.sh — this generates new SECRET_KEY and INTERNAL_TOKEN. If you saved the originals to a password manager, manually upload those instead so encrypted DB fields survive (see below).
Update project_id in terraform.tfvars.
terraform apply.
Upload the saved tarball to the new bucket: gsutil cp ~/Backups/forgejo-LATEST.tar.gz gs://NEW_PROJECT-forgejo-backups/.
SSH to the VM and run restore.sh.

Preserving SECRET_KEY across projects

Forgejo uses SECRET_KEY to encrypt some DB fields (2FA tokens, OAuth tokens, mirror credentials). Rotating it leaves repos and accounts intact but breaks those features.

For bit-exact recovery, save the secrets to a password manager when you first create them:

gcloud secrets versions access latest --secret=forgejo-secret-key
gcloud secrets versions access latest --secret=forgejo-internal-token

To restore them in a new project, skip bootstrap-secrets.sh and create the secrets manually with the saved values:

echo -n "OLD_SECRET_KEY_VALUE" | gcloud secrets create forgejo-secret-key \
  --replication-policy=automatic --data-file=-
echo -n "OLD_INTERNAL_TOKEN_VALUE" | gcloud secrets create forgejo-internal-token \
  --replication-policy=automatic --data-file=-

Backup itself is corrupt

This is what scripts/test-restore.sh exists to catch before an incident.

If the latest is corrupt, list older versions:

gsutil ls -l gs://YOUR_PROJECT-forgejo-backups/

Backups are kept 30 days (lifecycle rule in backups.tf). Within that window, fall back to an earlier nightly tarball.

If all backups in the bucket are corrupt: there is no recovery beyond what's still on the data disk. This is why monthly verification matters.

Domain / DNS lost

The static IP (google_compute_address.forgejo) is reserved separately from the VM and persists across VM replacements. You only lose it if you terraform destroy or manually release it.

To re-point: set your registrar's A record (or Cloud DNS if manage_dns = true) to the value of terraform output static_ip.

Caddy will re-issue a Let's Encrypt cert automatically once DNS resolves and ports 80/443 are reachable. ACME state lives in the data disk (/mnt/disks/forgejo-data/caddy), so existing certs survive VM replacements within their validity period.

Compromise / suspected intrusion

Cut public network access immediately:
```
gcloud compute firewall-rules update allow-https --disabled
```
(Or terraform it: temporarily set source_ranges to your IP only.)
SSH in via IAP, snapshot evidence: docker logs forgejo > /tmp/forensics.log, copy /mnt/disks/forgejo-data/forgejo aside.
Rotate every secret: forgejo-secret-key, forgejo-internal-token, all Forgejo user passwords + PATs, your Google account password.
Review gcloud logging read 'resource.type=gce_instance' for unexpected access.
If unsure of the compromise vector, treat the disk as tainted: nuke the VM and restore from a backup taken before the suspected breach.

4.6 KiB Raw Blame History