Files
ProxmoxVE/misc/error_handler.func
CanbiZ (MickLesk) c9ecb1ccca core: smart recovery for failed installs | extend exit_codes (#11221)
* feat(build.func): smart error recovery menu for failed installations

Replace simple Y/n removal prompt with interactive recovery menu:

- Option 1: Remove container and exit (default, auto after 60s timeout)
- Option 2: Keep container for debugging
- Option 3: Retry installation with verbose mode enabled
- Option 4: Retry with 1.5x RAM and +1 CPU core (OOM errors only)

Improvements:
- Detect OOM errors (exit codes 137, 243) and offer resource increase
- Show human-readable error explanation using explain_exit_code()
- Recursive rebuild preserves ALL settings from advanced/app.vars/default.vars
- Settings preserved: Network (IP, Gateway, VLAN, MTU, Bridge), Features
  (Nesting, FUSE, TUN, GPU), Storage, SSH keys, Tags, Hostname, etc.
- Show rebuild summary before retry (old→new CTID, resources, network)
- New container ID generated automatically for rebuilds

This helps users recover from transient failures without re-running
the entire script manually.

* fix(api.func): fix duplicate exit codes and add missing error codes

Exit code fixes:
- Remove duplicate definitions for codes 243, 254 (Node.js vs DB)
- Reassign MySQL/MariaDB to 240-242, 244 (was 241-244)
- Reassign MongoDB to 250-253 (was 251-254)

New exit codes added (based on GitHub issues analysis):
- 6: curl couldn't resolve host (DNS failure)
- 7: curl failed to connect (network unreachable)
- 22: curl HTTP error (404, 429 rate limit, 500)
- 28: curl timeout (very common in download failures)
- 35: curl SSL error
- 102: APT lock held by another process
- 124: Command timeout
- 141: SIGPIPE (broken pipe)

Also update OOM detection to include exit code 134 (SIGABRT)
which is commonly seen in Node.js heap overflow issues.

Fixes based on analysis of ~500 GitHub issues.

* fix(exit-codes): sync error_handler.func and api.func with conflict-free code ranges

- Add curl error codes (6, 7, 22, 28, 35)
- Add APT lock code (102), timeout (124), signals (134, 141)
- Move Python codes: 210-212 → 160-162 (avoid Proxmox conflict)
- Move PostgreSQL codes: 231-234 → 170-173
- Move MySQL/MariaDB codes: 241-244 → 180-183
- Move MongoDB codes: 251-254 → 190-193
- Keep Node.js at 243-249, Proxmox at 200-231
- Both files now synchronized with identical mappings

* feat(exit-codes): add systemd and build error codes (150-154)

- 150: Systemd service failed to start
- 151: Systemd service unit not found
- 152: Permission denied (EACCES)
- 153: Build/compile failed (make/gcc/cmake)
- 154: Node.js native addon build failed (node-gyp)

Based on issue analysis: 57 service failures, 25 build failures, 22 node-gyp issues

* fix(build): restore smart recovery and add OOM/DNS retry paths

* feat(build): APT in-place repair, exit 1 subclassification, new exit codes

- Add APT/DPKG in-place recovery: detects exit 100/101/102/255 and exit 1
  with APT log patterns, offers to repair dpkg state and re-run install
  script without destroying the container
- Add exit 1 subclassification: analyzes combined log to identify root
  cause (APT, OOM, network, command-not-found) and routes to appropriate
  recovery option
- Add exit 10 hint: shows privileged mode / nesting suggestion
- Add exit 127 hint: extracts missing command name from logs
- Refactor recovery menu: use named option variables (APT_OPTION,
  OOM_OPTION, DNS_OPTION) instead of hardcoded option numbers, supports
  up to 6 dynamic options cleanly
- Map missing exit codes in api.func: curl 27/36/45/47/55, signals
  129 (SIGHUP) / 131 (SIGQUIT), npm 239

* feat(api+build): map 25 more exit codes, add SIGHUP trap, network/perm hints

api.func:
- Map 25+ new exit codes that were showing as 'Unknown' in telemetry:
  curl: 3, 16, 18, 24, 26, 32-34, 39, 44, 46, 48, 51, 52, 57, 59, 61,
  63, 79, 92, 95; signals: 125, 132, 144, 146
- Update code 8 description (FTP + apk untrusted key)
- Update header comment with full supported ranges

build.func:
- Add SIGHUP trap: reports 'failed/129' to API when terminal is closed,
  should significantly reduce the 2841 stuck 'installing' records
- Add exit 52 (empty reply) and 57 (poll error) to network issue
  detection for DNS override recovery option
- Add exit 125/126 hint: suggests privileged mode for permission errors

* fix: sync error_handler fallback, Alpine APK repair, retry limit

error_handler.func:
- Sync fallback explain_exit_code() with api.func: add 25+ codes that
  were missing (curl 16/18/24/26/27/32-34/36/39/44-48/51/52/55/57/59/
  61/63/79/92/95, signals 125/129/131/132/144/146, npm 239, code 3/8)
- Ensures consistent error descriptions even when api.func isn't loaded

build.func:
- Alpine APK repair: detect var_os=alpine and run 'apk fix && apk
  cache clean && apk update' instead of apt-get/dpkg commands
- Show 'Repair APK state' instead of 'APT/DPKG' in menu for Alpine
- Retry safety counter: OOM x2 retry limited to max 2 attempts
  (prevents infinite RAM doubling via RECOVERY_ATTEMPT env var)
- Show attempt count in rebuild summary

* fix(build): preserve exit code in ERR trap to prevent false exit_code=0

The ERR trap called ensure_log_on_host before post_update_to_api,
which reset \True to 0 (success). This caused ~15-20 records/day to be
reported as 'failed' with exit_code=0 instead of the actual error code.

Root cause chain:
1. Command fails with exit code N → ERR trap fires (\True = N)
2. ensure_log_on_host succeeds → \True becomes 0
3. post_update_to_api 'failed' '\True' → sends 'failed/0' (wrong!)
4. POST_UPDATE_DONE=true → EXIT trap skips the correct code

Fix: capture \True into _ERR_CODE before ensure_log_on_host runs.

* Implement telemetry settings and repo source detection

Add telemetry configuration and repository source detection function.
2026-02-17 12:14:46 +01:00

18 KiB