core: smart recovery for failed installs | extend exit_codes (#11221)

* feat(build.func): smart error recovery menu for failed installations

Replace simple Y/n removal prompt with interactive recovery menu:

- Option 1: Remove container and exit (default, auto after 60s timeout)
- Option 2: Keep container for debugging
- Option 3: Retry installation with verbose mode enabled
- Option 4: Retry with 1.5x RAM and +1 CPU core (OOM errors only)

Improvements:
- Detect OOM errors (exit codes 137, 243) and offer resource increase
- Show human-readable error explanation using explain_exit_code()
- Recursive rebuild preserves ALL settings from advanced/app.vars/default.vars
- Settings preserved: Network (IP, Gateway, VLAN, MTU, Bridge), Features
  (Nesting, FUSE, TUN, GPU), Storage, SSH keys, Tags, Hostname, etc.
- Show rebuild summary before retry (old→new CTID, resources, network)
- New container ID generated automatically for rebuilds

This helps users recover from transient failures without re-running
the entire script manually.

* fix(api.func): fix duplicate exit codes and add missing error codes

Exit code fixes:
- Remove duplicate definitions for codes 243, 254 (Node.js vs DB)
- Reassign MySQL/MariaDB to 240-242, 244 (was 241-244)
- Reassign MongoDB to 250-253 (was 251-254)

New exit codes added (based on GitHub issues analysis):
- 6: curl couldn't resolve host (DNS failure)
- 7: curl failed to connect (network unreachable)
- 22: curl HTTP error (404, 429 rate limit, 500)
- 28: curl timeout (very common in download failures)
- 35: curl SSL error
- 102: APT lock held by another process
- 124: Command timeout
- 141: SIGPIPE (broken pipe)

Also update OOM detection to include exit code 134 (SIGABRT)
which is commonly seen in Node.js heap overflow issues.

Fixes based on analysis of ~500 GitHub issues.

* fix(exit-codes): sync error_handler.func and api.func with conflict-free code ranges

- Add curl error codes (6, 7, 22, 28, 35)
- Add APT lock code (102), timeout (124), signals (134, 141)
- Move Python codes: 210-212 → 160-162 (avoid Proxmox conflict)
- Move PostgreSQL codes: 231-234 → 170-173
- Move MySQL/MariaDB codes: 241-244 → 180-183
- Move MongoDB codes: 251-254 → 190-193
- Keep Node.js at 243-249, Proxmox at 200-231
- Both files now synchronized with identical mappings

* feat(exit-codes): add systemd and build error codes (150-154)

- 150: Systemd service failed to start
- 151: Systemd service unit not found
- 152: Permission denied (EACCES)
- 153: Build/compile failed (make/gcc/cmake)
- 154: Node.js native addon build failed (node-gyp)

Based on issue analysis: 57 service failures, 25 build failures, 22 node-gyp issues

* fix(build): restore smart recovery and add OOM/DNS retry paths

* feat(build): APT in-place repair, exit 1 subclassification, new exit codes

- Add APT/DPKG in-place recovery: detects exit 100/101/102/255 and exit 1
  with APT log patterns, offers to repair dpkg state and re-run install
  script without destroying the container
- Add exit 1 subclassification: analyzes combined log to identify root
  cause (APT, OOM, network, command-not-found) and routes to appropriate
  recovery option
- Add exit 10 hint: shows privileged mode / nesting suggestion
- Add exit 127 hint: extracts missing command name from logs
- Refactor recovery menu: use named option variables (APT_OPTION,
  OOM_OPTION, DNS_OPTION) instead of hardcoded option numbers, supports
  up to 6 dynamic options cleanly
- Map missing exit codes in api.func: curl 27/36/45/47/55, signals
  129 (SIGHUP) / 131 (SIGQUIT), npm 239

* feat(api+build): map 25 more exit codes, add SIGHUP trap, network/perm hints

api.func:
- Map 25+ new exit codes that were showing as 'Unknown' in telemetry:
  curl: 3, 16, 18, 24, 26, 32-34, 39, 44, 46, 48, 51, 52, 57, 59, 61,
  63, 79, 92, 95; signals: 125, 132, 144, 146
- Update code 8 description (FTP + apk untrusted key)
- Update header comment with full supported ranges

build.func:
- Add SIGHUP trap: reports 'failed/129' to API when terminal is closed,
  should significantly reduce the 2841 stuck 'installing' records
- Add exit 52 (empty reply) and 57 (poll error) to network issue
  detection for DNS override recovery option
- Add exit 125/126 hint: suggests privileged mode for permission errors

* fix: sync error_handler fallback, Alpine APK repair, retry limit

error_handler.func:
- Sync fallback explain_exit_code() with api.func: add 25+ codes that
  were missing (curl 16/18/24/26/27/32-34/36/39/44-48/51/52/55/57/59/
  61/63/79/92/95, signals 125/129/131/132/144/146, npm 239, code 3/8)
- Ensures consistent error descriptions even when api.func isn't loaded

build.func:
- Alpine APK repair: detect var_os=alpine and run 'apk fix && apk
  cache clean && apk update' instead of apt-get/dpkg commands
- Show 'Repair APK state' instead of 'APT/DPKG' in menu for Alpine
- Retry safety counter: OOM x2 retry limited to max 2 attempts
  (prevents infinite RAM doubling via RECOVERY_ATTEMPT env var)
- Show attempt count in rebuild summary

* fix(build): preserve exit code in ERR trap to prevent false exit_code=0

The ERR trap called ensure_log_on_host before post_update_to_api,
which reset \True to 0 (success). This caused ~15-20 records/day to be
reported as 'failed' with exit_code=0 instead of the actual error code.

Root cause chain:
1. Command fails with exit code N → ERR trap fires (\True = N)
2. ensure_log_on_host succeeds → \True becomes 0
3. post_update_to_api 'failed' '\True' → sends 'failed/0' (wrong!)
4. POST_UPDATE_DONE=true → EXIT trap skips the correct code

Fix: capture \True into _ERR_CODE before ensure_log_on_host runs.

* Implement telemetry settings and repo source detection

Add telemetry configuration and repository source detection function.
This commit is contained in:
CanbiZ (MickLesk)
2026-02-17 12:14:46 +01:00
committed by GitHub
parent d274a269b5
commit c9ecb1ccca
3 changed files with 387 additions and 23 deletions

View File

@@ -37,21 +37,47 @@ if ! declare -f explain_exit_code &>/dev/null; then
case "$code" in
1) echo "General error / Operation not permitted" ;;
2) echo "Misuse of shell builtins (e.g. syntax error)" ;;
3) echo "General syntax or argument error" ;;
10) echo "Docker / privileged mode required (unsupported environment)" ;;
4) echo "curl: Feature not supported or protocol error" ;;
5) echo "curl: Could not resolve proxy" ;;
6) echo "curl: DNS resolution failed (could not resolve host)" ;;
7) echo "curl: Failed to connect (network unreachable / host down)" ;;
8) echo "curl: FTP server reply error" ;;
8) echo "curl: Server reply error (FTP/SFTP or apk untrusted key)" ;;
16) echo "curl: HTTP/2 framing layer error" ;;
18) echo "curl: Partial file (transfer not completed)" ;;
22) echo "curl: HTTP error returned (404, 429, 500+)" ;;
23) echo "curl: Write error (disk full or permissions)" ;;
24) echo "curl: Write to local file failed" ;;
25) echo "curl: Upload failed" ;;
26) echo "curl: Read error on local file (I/O)" ;;
27) echo "curl: Out of memory (memory allocation failed)" ;;
28) echo "curl: Operation timeout (network slow or server not responding)" ;;
30) echo "curl: FTP port command failed" ;;
32) echo "curl: FTP SIZE command failed" ;;
33) echo "curl: HTTP range error" ;;
34) echo "curl: HTTP post error" ;;
35) echo "curl: SSL/TLS handshake failed (certificate error)" ;;
36) echo "curl: FTP bad download resume" ;;
39) echo "curl: LDAP search failed" ;;
44) echo "curl: Internal error (bad function call order)" ;;
45) echo "curl: Interface error (failed to bind to specified interface)" ;;
46) echo "curl: Bad password entered" ;;
47) echo "curl: Too many redirects" ;;
48) echo "curl: Unknown command line option specified" ;;
51) echo "curl: SSL peer certificate or SSH host key verification failed" ;;
52) echo "curl: Empty reply from server (got nothing)" ;;
55) echo "curl: Failed sending network data" ;;
56) echo "curl: Receive error (connection reset by peer)" ;;
57) echo "curl: Unrecoverable poll/select error (system I/O failure)" ;;
59) echo "curl: Couldn't use specified SSL cipher" ;;
61) echo "curl: Bad/unrecognized transfer encoding" ;;
63) echo "curl: Maximum file size exceeded" ;;
75) echo "Temporary failure (retry later)" ;;
78) echo "curl: Remote file not found (404 on FTP/file)" ;;
79) echo "curl: SSH session error (key exchange/auth failed)" ;;
92) echo "curl: HTTP/2 stream error (protocol violation)" ;;
95) echo "curl: HTTP/3 layer error" ;;
64) echo "Usage error (wrong arguments)" ;;
65) echo "Data format error (bad input data)" ;;
66) echo "Input file not found (cannot open input)" ;;
@@ -69,15 +95,21 @@ if ! declare -f explain_exit_code &>/dev/null; then
101) echo "APT: Configuration error (bad sources.list, malformed config)" ;;
102) echo "APT: Lock held by another process (dpkg/apt still running)" ;;
124) echo "Command timed out (timeout command)" ;;
125) echo "Command failed to start (Docker daemon or execution error)" ;;
126) echo "Command invoked cannot execute (permission problem?)" ;;
127) echo "Command not found" ;;
128) echo "Invalid argument to exit" ;;
130) echo "Terminated by Ctrl+C (SIGINT)" ;;
129) echo "Killed by SIGHUP (terminal closed / hangup)" ;;
130) echo "Aborted by user (SIGINT)" ;;
131) echo "Killed by SIGQUIT (core dumped)" ;;
132) echo "Killed by SIGILL (illegal CPU instruction)" ;;
134) echo "Process aborted (SIGABRT - possibly Node.js heap overflow)" ;;
137) echo "Killed (SIGKILL / Out of memory?)" ;;
139) echo "Segmentation fault (core dumped)" ;;
141) echo "Broken pipe (SIGPIPE - output closed prematurely)" ;;
143) echo "Terminated (SIGTERM)" ;;
144) echo "Killed by signal 16 (SIGUSR1 / SIGSTKFLT)" ;;
146) echo "Killed by signal 18 (SIGTSTP)" ;;
150) echo "Systemd: Service failed to start" ;;
151) echo "Systemd: Service unit not found" ;;
152) echo "Permission denied (EACCES)" ;;
@@ -123,6 +155,7 @@ if ! declare -f explain_exit_code &>/dev/null; then
224) echo "Proxmox: PBS storage is for backups only" ;;
225) echo "Proxmox: No template available for OS/Version" ;;
231) echo "Proxmox: LXC stack upgrade failed" ;;
239) echo "npm/Node.js: Unexpected runtime error or dependency failure" ;;
243) echo "Node.js: Out of memory (JavaScript heap out of memory)" ;;
245) echo "Node.js: Invalid command-line option" ;;
246) echo "Node.js: Internal JavaScript Parse Error" ;;