initial commit
Signed-off-by: Jason Hall <imjasonh@gmail.com>
This commit is contained in:
commit
4dc1b58f2f
20 changed files with 1398 additions and 0 deletions
101
docs/disaster-recovery.md
Normal file
101
docs/disaster-recovery.md
Normal file
|
|
@ -0,0 +1,101 @@
|
|||
# Disaster recovery
|
||||
|
||||
What to do when things go wrong, in rough order of severity.
|
||||
|
||||
## Pre-requisite: verify backups are real
|
||||
|
||||
Before you need them. Run monthly:
|
||||
|
||||
```bash
|
||||
./scripts/test-restore.sh
|
||||
```
|
||||
|
||||
This pulls the latest GCS backup, boots Forgejo against it in a throwaway local container, and probes the API. If it fails, fix backups before you have an actual incident.
|
||||
|
||||
## VM is unreachable but the disk is fine
|
||||
|
||||
Symptoms: Forgejo doesn't load, `gcloud compute ssh ... --tunnel-through-iap` times out, but `forgejo-data` disk and `forgejo-ip` static IP both still exist.
|
||||
|
||||
Recovery:
|
||||
|
||||
```bash
|
||||
cd terraform
|
||||
terraform apply -replace=google_compute_instance.forgejo
|
||||
```
|
||||
|
||||
The data disk has `prevent_destroy = true` and is reattached; cloud-init re-bootstraps the stack against the existing data. The static IP is preserved, so DNS keeps working.
|
||||
|
||||
## Persistent disk is corrupted or accidentally deleted
|
||||
|
||||
1. (If still present and corrupt) remove `prevent_destroy` from `google_compute_disk.forgejo_data`, then `terraform apply` to destroy and recreate. **Re-add `prevent_destroy` immediately afterward.**
|
||||
2. SSH to the VM.
|
||||
3. `sudo /var/lib/forgejo/restore.sh <latest-backup>.tar.gz` — restores from GCS into the fresh disk.
|
||||
|
||||
## Whole GCP project is lost
|
||||
|
||||
Worst case, but recoverable from GCS-side backups *if* you copied them out before deleting the project.
|
||||
|
||||
1. **Before deleting the old project**: copy the latest backup to durable storage you control.
|
||||
```bash
|
||||
gsutil cp gs://OLD_PROJECT-forgejo-backups/forgejo-LATEST.tar.gz ~/Backups/
|
||||
```
|
||||
2. Create a new GCP project, enable APIs.
|
||||
3. `./scripts/bootstrap-secrets.sh` — this generates *new* `SECRET_KEY` and `INTERNAL_TOKEN`. If you saved the originals to a password manager, manually upload those instead so encrypted DB fields survive (see below).
|
||||
4. Update `project_id` in `terraform.tfvars`.
|
||||
5. `terraform apply`.
|
||||
6. Upload the saved tarball to the new bucket: `gsutil cp ~/Backups/forgejo-LATEST.tar.gz gs://NEW_PROJECT-forgejo-backups/`.
|
||||
7. SSH to the VM and run `restore.sh`.
|
||||
|
||||
### Preserving SECRET_KEY across projects
|
||||
|
||||
Forgejo uses `SECRET_KEY` to encrypt some DB fields (2FA tokens, OAuth tokens, mirror credentials). Rotating it leaves repos and accounts intact but breaks those features.
|
||||
|
||||
For bit-exact recovery, save the secrets to a password manager when you first create them:
|
||||
|
||||
```bash
|
||||
gcloud secrets versions access latest --secret=forgejo-secret-key
|
||||
gcloud secrets versions access latest --secret=forgejo-internal-token
|
||||
```
|
||||
|
||||
To restore them in a new project, *skip* `bootstrap-secrets.sh` and create the secrets manually with the saved values:
|
||||
|
||||
```bash
|
||||
echo -n "OLD_SECRET_KEY_VALUE" | gcloud secrets create forgejo-secret-key \
|
||||
--replication-policy=automatic --data-file=-
|
||||
echo -n "OLD_INTERNAL_TOKEN_VALUE" | gcloud secrets create forgejo-internal-token \
|
||||
--replication-policy=automatic --data-file=-
|
||||
```
|
||||
|
||||
## Backup itself is corrupt
|
||||
|
||||
This is what `scripts/test-restore.sh` exists to catch *before* an incident.
|
||||
|
||||
If the latest is corrupt, list older versions:
|
||||
|
||||
```bash
|
||||
gsutil ls -l gs://YOUR_PROJECT-forgejo-backups/
|
||||
```
|
||||
|
||||
Backups are kept 30 days (lifecycle rule in `backups.tf`). Within that window, fall back to an earlier nightly tarball.
|
||||
|
||||
If all backups in the bucket are corrupt: there is no recovery beyond what's still on the data disk. This is why monthly verification matters.
|
||||
|
||||
## Domain / DNS lost
|
||||
|
||||
The static IP (`google_compute_address.forgejo`) is reserved separately from the VM and persists across VM replacements. You only lose it if you `terraform destroy` or manually release it.
|
||||
|
||||
To re-point: set your registrar's A record (or Cloud DNS if `manage_dns = true`) to the value of `terraform output static_ip`.
|
||||
|
||||
Caddy will re-issue a Let's Encrypt cert automatically once DNS resolves and ports 80/443 are reachable. ACME state lives in the data disk (`/mnt/disks/forgejo-data/caddy`), so existing certs survive VM replacements within their validity period.
|
||||
|
||||
## Compromise / suspected intrusion
|
||||
|
||||
1. Cut public network access immediately:
|
||||
```bash
|
||||
gcloud compute firewall-rules update allow-https --disabled
|
||||
```
|
||||
(Or `terraform` it: temporarily set `source_ranges` to your IP only.)
|
||||
2. SSH in via IAP, snapshot evidence: `docker logs forgejo > /tmp/forensics.log`, copy `/mnt/disks/forgejo-data/forgejo` aside.
|
||||
3. Rotate every secret: `forgejo-secret-key`, `forgejo-internal-token`, all Forgejo user passwords + PATs, your Google account password.
|
||||
4. Review `gcloud logging read 'resource.type=gce_instance'` for unexpected access.
|
||||
5. If unsure of the compromise vector, treat the disk as tainted: nuke the VM and restore from a backup taken *before* the suspected breach.
|
||||
110
docs/runbook.md
Normal file
110
docs/runbook.md
Normal file
|
|
@ -0,0 +1,110 @@
|
|||
# Runbook
|
||||
|
||||
Common operations against the running Forgejo VM.
|
||||
|
||||
## Admin SSH
|
||||
|
||||
Public port 22 is closed. Use IAP tunneling:
|
||||
|
||||
```bash
|
||||
gcloud compute ssh forgejo --zone=us-east1-b --tunnel-through-iap
|
||||
```
|
||||
|
||||
Your Google account needs:
|
||||
- `roles/iap.tunnelResourceAccessor` on the instance (granted by Terraform via `var.admin_email`)
|
||||
- `roles/compute.osLogin` on the project (same)
|
||||
- 2FA on the Google account (manual, but strongly recommended — IAP is only as strong as your login)
|
||||
|
||||
## Inspect the stack
|
||||
|
||||
```bash
|
||||
docker ps # caddy, forgejo, watchtower expected
|
||||
docker logs --tail 200 forgejo
|
||||
docker logs --tail 200 caddy
|
||||
docker logs --tail 200 watchtower
|
||||
journalctl -u forgejo-stack.service -n 200
|
||||
journalctl -u forgejo-backup.service -n 50
|
||||
systemctl list-timers forgejo-backup.timer
|
||||
```
|
||||
|
||||
## Restart the stack
|
||||
|
||||
```bash
|
||||
sudo systemctl restart forgejo-stack.service
|
||||
```
|
||||
|
||||
Single container only:
|
||||
|
||||
```bash
|
||||
docker restart forgejo
|
||||
```
|
||||
|
||||
## Update containers immediately
|
||||
|
||||
Watchtower pulls new images at 04:00 UTC by default. To force now:
|
||||
|
||||
```bash
|
||||
docker exec watchtower kill -s SIGHUP 1
|
||||
# or, manually:
|
||||
docker pull codeberg.org/forgejo/forgejo:11
|
||||
sudo systemctl restart forgejo-stack.service
|
||||
```
|
||||
|
||||
## Run a backup on demand
|
||||
|
||||
```bash
|
||||
sudo /var/lib/google/forgejo/backup.sh
|
||||
gsutil ls gs://YOUR_PROJECT-forgejo-backups/
|
||||
```
|
||||
|
||||
## Restore from a backup
|
||||
|
||||
`scripts/restore.sh` is in the repo, not on the VM. Copy it over and run:
|
||||
|
||||
```bash
|
||||
gcloud compute scp scripts/restore.sh forgejo:/tmp/restore.sh \
|
||||
--zone=us-east1-b --tunnel-through-iap
|
||||
gcloud compute ssh forgejo --zone=us-east1-b --tunnel-through-iap \
|
||||
--command='sudo bash /tmp/restore.sh forgejo-20260507T033000Z.tar.gz'
|
||||
```
|
||||
|
||||
For a clean-environment dry run, use `scripts/test-restore.sh` from your workstation — it pulls the latest backup, boots Forgejo against it in a throwaway container, and probes the API.
|
||||
|
||||
## Forgejo major version upgrade
|
||||
|
||||
1. Read the [release notes](https://codeberg.org/forgejo/forgejo/releases) for breaking changes.
|
||||
2. Take a manual backup (`sudo /var/lib/google/forgejo/backup.sh`).
|
||||
3. Bump `forgejo_image` in `terraform.tfvars` (e.g. `codeberg.org/forgejo/forgejo:12`).
|
||||
4. `terraform apply` — replaces the VM. The data disk persists; first boot runs DB migrations.
|
||||
5. Watch `docker logs forgejo` to confirm migrations and startup.
|
||||
|
||||
## Resize the data disk
|
||||
|
||||
GCP supports online disk growth:
|
||||
|
||||
```bash
|
||||
gcloud compute disks resize forgejo-data --zone=us-east1-b --size=40
|
||||
```
|
||||
|
||||
Then on the VM:
|
||||
|
||||
```bash
|
||||
sudo resize2fs /dev/disk/by-id/google-forgejo-data
|
||||
```
|
||||
|
||||
Update `size = 40` in `terraform/main.tf` afterward to keep state in sync.
|
||||
|
||||
## Rotate secrets
|
||||
|
||||
```bash
|
||||
# Add a new version (the latest is read at boot):
|
||||
openssl rand -hex 32 | gcloud secrets versions add forgejo-secret-key --data-file=-
|
||||
sudo systemctl restart forgejo-stack.service
|
||||
```
|
||||
|
||||
Rotating `SECRET_KEY` invalidates 2FA and some encrypted DB fields. Read the Forgejo docs before rotating.
|
||||
|
||||
## Cost / billing watch
|
||||
|
||||
- Set a project budget alert at $10/month in Cloud Billing (manual; not in Terraform by design — the budget API requires the billing-account-admin role).
|
||||
- Skim the billing report monthly. Egress is the most likely surprise.
|
||||
Loading…
Add table
Add a link
Reference in a new issue