1
0
Fork 0
forge/docs/runbook.md
Jason Hall 15ea287728 add budget alert and nightly OS-update reboot
- $10/month project budget via google_billing_budget, alerts to admin_email
- forgejo-reboot.timer at 04:30 UTC applies staged COS updates
- relocate cloud-init scripts to /var/lib/google/forgejo (COS noexec on /var)
- runbook: updated zone, script paths, added "How updates work" section

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 20:35:58 -04:00

134 lines
4.8 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Runbook
Common operations against the running Forgejo VM.
## Admin SSH
Public port 22 is closed. Use IAP tunneling:
```bash
gcloud compute ssh forgejo --zone=us-east1-b --tunnel-through-iap
```
Your Google account needs:
- `roles/iap.tunnelResourceAccessor` on the instance (granted by Terraform via `var.admin_email`)
- `roles/compute.osLogin` on the project (same)
- 2FA on the Google account (manual, but strongly recommended — IAP is only as strong as your login)
## Inspect the stack
```bash
docker ps # caddy, forgejo, watchtower expected
docker logs --tail 200 forgejo
docker logs --tail 200 caddy
docker logs --tail 200 watchtower
journalctl -u forgejo-stack.service -n 200
journalctl -u forgejo-backup.service -n 50
systemctl list-timers forgejo-backup.timer
```
## Restart the stack
```bash
sudo systemctl restart forgejo-stack.service
```
Single container only:
```bash
docker restart forgejo
```
## How updates work
| Layer | Mechanism | Schedule |
|---|---|---|
| Host OS (COS) | `cos-update-strategy=update_enabled` stages updates onto the inactive A/B partition; reboot applies them. | Applied on the nightly reboot below. |
| Forgejo & Caddy patch updates | Watchtower pulls new image digests for the pinned tags (`forgejo:11`, `caddy:2-alpine`). | 04:00 UTC daily (inside the watchtower container; cron `0 0 4 * * *`). |
| Forgejo major version (e.g. 11→12) | Bump `var.forgejo_image` in tfvars and `terraform apply` — VM is replaced, data disk persists, first boot runs DB migrations. | Manual / deliberate. |
| Watchtower itself | Pinned at `containrrr/watchtower` (no tag = `latest`), self-updates with `--cleanup`. | 04:00 UTC daily. |
| Backups | `forgejo-backup.service` via timer. | 03:30 UTC daily. |
| Reboot to apply COS updates | `forgejo-reboot.service` runs `shutdown -r +0`. Containers come back via `forgejo-stack.service` + `--restart=unless-stopped`. | 04:30 UTC daily. ~3060s downtime. |
Tonight's order: backup at 03:30 → container update check at 04:00 → reboot at 04:30. Backups always land before any reboot, so a bad update can be rolled back from GCS.
### Disable the nightly reboot
If the reboot ever causes trouble, turn it off without affecting backups or container updates:
```bash
gcloud compute ssh forgejo --zone=us-east1-b --tunnel-through-iap \
--command='sudo systemctl disable --now forgejo-reboot.timer'
```
Re-enable with `enable --now` instead of `disable --now`. Cloud-init will re-enable it on the next VM replacement regardless.
## Update containers immediately
Watchtower pulls new images at 04:00 UTC by default. To force now:
```bash
docker exec watchtower kill -s SIGHUP 1
# or, manually:
docker pull codeberg.org/forgejo/forgejo:11
sudo systemctl restart forgejo-stack.service
```
## Run a backup on demand
```bash
sudo /var/lib/google/forgejo/backup.sh
gsutil ls gs://YOUR_PROJECT-forgejo-backups/
```
## Restore from a backup
`scripts/restore.sh` is in the repo, not on the VM. Copy it over and run:
```bash
gcloud compute scp scripts/restore.sh forgejo:/tmp/restore.sh \
--zone=us-east1-b --tunnel-through-iap
gcloud compute ssh forgejo --zone=us-east1-b --tunnel-through-iap \
--command='sudo bash /tmp/restore.sh forgejo-20260507T033000Z.tar.gz'
```
For a clean-environment dry run, use `scripts/test-restore.sh` from your workstation — it pulls the latest backup, boots Forgejo against it in a throwaway container, and probes the API.
## Forgejo major version upgrade
1. Read the [release notes](https://codeberg.org/forgejo/forgejo/releases) for breaking changes.
2. Take a manual backup (`sudo /var/lib/google/forgejo/backup.sh`).
3. Bump `forgejo_image` in `terraform.tfvars` (e.g. `codeberg.org/forgejo/forgejo:12`).
4. `terraform apply` — replaces the VM. The data disk persists; first boot runs DB migrations.
5. Watch `docker logs forgejo` to confirm migrations and startup.
## Resize the data disk
GCP supports online disk growth:
```bash
gcloud compute disks resize forgejo-data --zone=us-east1-b --size=40
```
Then on the VM:
```bash
sudo resize2fs /dev/disk/by-id/google-forgejo-data
```
Update `size = 40` in `terraform/main.tf` afterward to keep state in sync.
## Rotate secrets
```bash
# Add a new version (the latest is read at boot):
openssl rand -hex 32 | gcloud secrets versions add forgejo-secret-key --data-file=-
sudo systemctl restart forgejo-stack.service
```
Rotating `SECRET_KEY` invalidates 2FA and some encrypted DB fields. Read the Forgejo docs before rotating.
## Cost / billing watch
- A $10/month project budget is managed by `terraform/budget.tf`. Email alerts at 50%, 90%, 100% (current spend) and 100% (forecasted) go to `admin_email`. Adjust the threshold via `budget_amount_usd` in tfvars.
- Skim the billing report monthly. Egress is the most likely surprise.