* Standardize exit codes and add mappings
Replace generic exit 1 usages with specific numeric exit codes and add corresponding explanations to the error lookup. This commit updates multiple misc/* scripts to return distinct codes for validation, Proxmox/LXC, networking, download and curl errors (e.g. 103-123, 64, 107-120, 206, 0 for explicit user cancels). It also updates curl error handling to propagate the original curl exit code and adds new entries in explain_exit_code and the error handler to improve diagnostics.
* Set exit code 115 for update_os errors
Change exit status from 6 to 115 in misc/alpine-install.func's update_os() error handlers when failing to download tools.func or when the expected functions are missing. This gives a distinct exit code for these specific failure cases.
* Add tools/addon exit codes and use them
Introduce exit codes 232-238 for Tools & Addon scripts in misc/api.func and misc/error_handler.func. Update addon scripts (tools/addon/adguardhome-sync.sh, tools/addon/copyparty.sh, tools/addon/cronmaster.sh) to return specific codes instead of generic exit 1: 238 for unsupported OS and 233 when the application is not installed/upgrade prerequisites are missing. This makes failures more descriptive and aligns scripts with the central error explanations.
* Standardize exit codes in exporter addons
Unify exit codes across exporter addon scripts: return 238 for unsupported OS detections and 233 when an update is requested but the exporter is not installed. Applied to nextcloud-exporter.sh, pihole-exporter.sh, prometheus-paperless-ngx-exporter.sh, and qbittorrent-exporter.sh to make failure modes distinguishable for callers/automation.
* Use specific exit codes in addon scripts
Replace generic exit 1 with distinct exit codes across multiple addon scripts to enable finer-grained error handling in automation. Exit codes introduced: 10 for Docker/Compose missing or user-declined Docker install, 233 for "nothing to update" cases, and 238 for unsupported OS cases. Affected files: tools/addon/arcane.sh, coolify.sh, dockge.sh, dokploy.sh, filebrowser-quantum.sh, filebrowser.sh, immich-public-proxy.sh, jellystat.sh, runtipi.sh.
* Use specific exit codes in addon scripts
Replace generic exit 1 with specific exit codes across multiple addon scripts to improve error signaling and handling. Files updated: tools/addon/add-netbird-lxc.sh (exit 238 on unsupported distro), tools/addon/add-tailscale-lxc.sh (treat user cancel as exit 0), tools/addon/glances.sh (exit 233 when not installed), tools/addon/komodo.sh (distinct exits for missing compose, legacy DB, backup/download failures, docker checks), tools/addon/netdata.sh (distinct exits for unsupported PVE versions, OS/codename detection, repo lookups), and tools/addon/phpmyadmin.sh (distinct exits for unsupported OS, network/download issues, package install/start failures, and invalid input). These changes make failures easier to identify and automate recovery or reporting.
* Use specific exit codes in PVE scripts
Replace generic exit 1 with distinct exit codes across tools/pve scripts to provide clearer failure signals for callers. post-pve-install.sh now returns 105 for unsupported Proxmox versions; pve-privilege-converter.sh uses 104 for non-root, 234 when no containers, and 235 for backup/conversion failures; update-apps.sh maps backup failures to 235, missing containers/selections to 234 (and UI cancellations to 0), missing backup storage to 119, and returns the actual container update exit code on failure. These changes improve diagnostics and allow external tooling to react to specific error conditions.
* Standardize exit codes and behaviors
Adjust exit codes and abort handling across multiple PVE helper scripts to provide clearer outcomes for automation and interactive flows. Changes include:
- container-restore-from-backup.sh, core-restore-from-backup.sh: return 235 when no backups found (was 1).
- fstrim.sh: treat user cancellation of non-ext4 warning as non-error (exit 0 instead of 1).
- kernel-clean.sh: treat no selection or user abort as non-error (exit 0 instead of 1).
- lxc-delete.sh: return 234 when no containers are present; treat no selection as non-error (exit 0).
- nic-offloading-fix.sh: use specific non-zero codes for root check and tool install failures (exit 104, 237) and 236 when no matching interfaces (was 1).
- pbs_microcode.sh, post-pmg-install.sh, post-pbs-install.sh: use distinct exit codes (232 and 105) for detected VM/PVE/unsupported distro conditions instead of generic 1.
These modifications make scripts return distinct codes for different failure modes and ensure user-initiated aborts or benign conditions exit with 0 where appropriate.
* Use exit 105 for unsupported PVE versions
Standardize error handling by replacing generic exit 1 with exit 105 in pve_check() across multiple VM template scripts to indicate unsupported Proxmox VE versions. Also add API exit code 226 message for "Proxmox: VM disk import or post-creation setup failed" in misc/api.func. Affected files include misc/api.func and various vm/*-vm.sh scripts.
* Use specific exit codes in VM scripts
Replace generic exit 1 with distinct exit codes across vm/*.sh to make failures more actionable for callers. Changes include: use 226 for missing imported-disk references, 237 for pv installation failures, 115 for download/extract/ISO-related failures, 214 for insufficient disk space during FreeBSD decompression, and 119 for missing storage detection. Updated scripts: archlinux-vm.sh, docker-vm.sh, haos-vm.sh, openwrt-vm.sh, opnsense-vm.sh, truenas-vm.sh, umbrel-os-vm.sh.
Remove host-side tee capture of lxc-attach output and PIPESTATUS handling; lxc-attach is now invoked directly and the exit code is taken from $?. Simplify install log retrieval by pulling /root/.install-<SESSION_ID>.log directly and removing the fallback that used the host-captured terminal log, related temp-file size checks, and timeout logic. Remove terminal-state restores and input-draining (stty/dd) and stop redirecting reads from /dev/tty so interactive reads use standard input; similar simplifications applied to the retry flow. Also remove cleanup of the discarded capture log. These changes reduce complexity and terminal manipulation, at the cost of losing the previous terminal-capture fallback for installs that failed to produce a container-side log.
Restore and sanitize terminal state before prompting by draining stale input from /dev/tty (dd iflag=nonblock) and adding a short sleep, then perform timed reads from /dev/tty in misc/build.func and misc/error_handler.func. Also make _REPO_CACHE a global associative array (declare -gA) with fallbacks in misc/tools.func so the cache survives when tools.func is sourced inside a function scope.
Replace pre-opened _RECOVERY_TTY handling with direct reads from /dev/tty in misc/build.func and misc/error_handler.func. The change opens /dev/tty at prompt time (with stty sane) so the prompt reads aren't affected by tty state corruption from lxc-attach|tee, simplifies the read logic by using a local response variable with a timeout, and removes the pre-open/close bookkeeping for _RECOVERY_TTY.
Two critical bugs fixed:
1. Install scripts (80+) using 'read' for interactive prompts all fail because
lxc-attach stdin was redirected from /dev/null. Change to /dev/tty so install
scripts like immich, elementsynapse, etc. can prompt the user interactively.
2. Recovery menu read gets 'Input/output error' from /dev/tty after the
lxc-attach|tee pipeline corrupts the terminal state. Pre-open a separate
file descriptor to /dev/tty BEFORE the pipeline starts. This fd survives
any tty corruption and is used as fallback for the recovery menu read.
Fixes the 'command not found' issue where user input falls through to the
parent shell.
Both build.func (main install + APT retry) and error_handler.func (fallback
cleanup prompt) are updated with the same pattern.
* fix(zammad): configure Elasticsearch for LXC container startup
- Set discovery.type: single-node (required for single-node ES)
- Set xpack.security.enabled: false (not needed in local LXC)
- Set bootstrap.memory_lock: false (fails in unprivileged LXC)
- Add startup wait loop (up to 60s) to ensure ES is ready before
Zammad installation continues
Fixes #12301-related recurring Elasticsearch startup failures
* refactor(api): eliminate duplicate traps, harden error handling & telemetry
Phase 1 - Structural:
- Remove api_exit_script() and 5 inline traps from build.func
- error_handler.func is now the sole trap owner via catch_errors()
- Update api.func comment reference (api_exit_script -> on_exit)
Phase 2 - Quality:
- Add stop_spinner() + cursor restore to error_handler(), on_interrupt(),
on_terminate(), on_hangup() to prevent spinner/cursor artifacts
- Enhance _send_abort_telemetry() with error text (last 20 log lines),
duration calculation, and 2 retry attempts (was fire-and-forget)
- Harden json_escape() to also strip DEL (0x7F) character
* fix(build): show spinner during post_update_to_api to prevent Ctrl+Z abort
post_update_to_api can take up to 33 seconds worst-case (3 curl attempts
x 10s timeout + sleep delays). Without any terminal output during this
time, users think the script is stuck and press Ctrl+Z, which prevents
the recovery menu from ever appearing.
Add msg_info spinner before both post_update_to_api calls in the failure
path (initial report + final force retry after recovery menu).
* fix(build): prevent SIGTSTP from killing recovery dialog
- Replace msg_info/stop_spinner with plain echo for telemetry reporting
The background spinner process in non-interactive shells (bash -c)
can trigger SIGTSTP, stopping the entire process group before the
recovery dialog appears. Plain echo avoids this.
- Add trap '' TSTP at failure path entry to ignore suspension signals
Prevents Ctrl+Z or terminal-related SIGTSTP from interrupting the
recovery menu. Restored with trap - TSTP before exit.
- Root cause: msg_info starts a background process (spinner &) that
is not properly detached in non-interactive shells where job control
(set -m) is OFF. The disown builtin has no effect without job
control, leaving the spinner in the same process group. This can
cause terminal I/O conflicts during the 33-second post_update_to_api
retry window, resulting in [2]+ Stopped.
* fix(test): initialize colors and remove illegal local in test harness
- Call load_functions() after sourcing core.func to initialize
color/formatting/icon variables (RD, GN, YW, CL, TAB, etc.)
- Remove 'local' keyword from top-level scope (not inside function)
- Default REPO_SOURCE to ref_api instead of main
* chore: remove test-recovery-dialog.sh from branch
* Revert "fix(zammad): configure Elasticsearch for LXC container startup"
This reverts commit 10e450b72f.
* fix(build): show telemetry status only in verbose mode
Telemetry reporting is an implementation detail that doesn't help
the user during failure recovery. Wrap echo statements with
VERBOSE check so they only appear when verbose mode is enabled.
* Enhance telemetry, signal handling, and logs
Improve failure telemetry and signal handling across the installer: add get_full_log() to collect/strip/truncate install logs and include them in API payloads with a truncated retry; add CONTAINER_INSTALLING flag around lxc-attach and stop containers on abort to avoid orphaned "installing/configuring" records; introduce _send_abort_telemetry() (curl fallback for container context) and _stop_container_if_installing() helpers; centralize and simplify EXIT/ERR/INT/TERM/HUP traps and handlers (including a new on_hangup handler) and update VM scripts to report numeric exit codes. Also ensure best-effort log collection is performed and tweak error categorization for certain signals.
* Include full log in error telemetry
Use get_full_log (up to 120KB) to populate the error telemetry field so the API receives the full installation trace; fall back to get_error_text (last ~20 lines) if the full log is empty. Removed collection and inclusion of a separate install_log field from the JSON payloads and simplified the retry payloads/comments accordingly. The change ensures error reports contain the complete trace while avoiding duplicate large log fields and keeps graceful failure handling (get_full_log || true).
* Anonymize IP addresses in get_full_log
Mask IPv4 addresses in logs when collecting full log output: added a sed step that replaces the last two octets with "x.x" to avoid exposing full IPs (GDPR). Also updated the comment to reflect anonymization; existing steps that strip carriage returns and ANSI escape sequences remain in place before truncating with head -c.
Prevent host-side error_handler from being triggered during in-container install/recovery by delaying re-enabling set -Eeuo pipefail and the ERR trap in misc/build.func until after install/recovery completes; add explanatory comments. Update misc/error_handler.func to fall back to BUILD_LOG if container-internal log path is unavailable, show the last 20 log lines when present, refine container vs host detection (check INSTALL_LOG file and /root), copy INSTALL_LOG into /root and write a .failed flag with the exit code for host-side detection, and ensure full-log output and container removal prompt are shown appropriately in host context. Tweak misc/core.func silent() output to include a "Full log" path and adjust formatting.
Introduce post_progress_to_api() in misc/api.func — a non-blocking, fire-and-forget curl ping (gated by DIAGNOSTICS and RANDOM_UUID) that updates telemetry status to "configuring". Wire this progress ping into multiple scripts (alpine-install.func, install.func, build.func, core.func) at key milestones (container start, network ready, customization, creation, cleanup) and replace/deduplicate some earlier post_to_api calls. Also update error_handler.func to always report failures immediately via post_update_to_api to ensure failures are captured even before/after container lifecycle.
Prevent hangs when pulling logs from containers by wrapping pct pull calls with timeout (8s) and running ensure_log_on_host under timeout (10s). Always send telemetry (post_update_to_api) before attempting best-effort log collection so status is reported even if log retrieval blocks. Update EXIT/ERR/SIGHUP/SIGINT/SIGTERM traps and consolidate error/interrupt handlers to use the new timeouted log collection. Changes in misc/build.func and misc/error_handler.func.
* fix: send telemetry BEFORE log collection in signal handlers
- Swap ensure_log_on_host/post_update_to_api order in on_interrupt, on_terminate, api_exit_script, and inline SIGHUP/SIGINT/SIGTERM traps
- For signal exits (>128): send telemetry immediately, then best-effort log collection
- Add 2>/dev/null || true to all I/O in signal handlers to prevent SIGPIPE
- Fix on_exit: exit_code=0 now reports 'done' instead of 'failed 1'
- Root cause: pct pull hangs on dying containers blocked telemetry updates, leaving 595+ records stuck in 'installing' daily
* feat: add execution_id to all telemetry payloads
- Generate EXECUTION_ID from RANDOM_UUID in variables()
- Export EXECUTION_ID to container environment
- Add execution_id field to all 8 API payloads in api.func
- Add execution_id to post_progress_to_api in install.func and alpine-install.func
- Fallback to RANDOM_UUID when EXECUTION_ID not set (backward compat)
* fix: correct telemetry type values for PVE and addon scripts
- PVE scripts (tools/pve/*): change type 'tool' -> 'pve'
- Addon scripts (tools/addon/*): fix 4 scripts that wrongly used 'tool' -> 'addon'
(netdata, add-tailscale-lxc, add-netbird-lxc, all-templates)
- api.func: post_tool_to_api sends type='pve', default fallback 'pve'
- Aligns with PocketBase categories: lxc, vm, pve, addon
* fix: persist diagnostics opt-in inside containers for addon telemetry
- install.func + alpine-install.func: create /usr/local/community-scripts/diagnostics
inside the container when DIAGNOSTICS=yes (from build.func export)
- Enables addon scripts running later inside containers to find the opt-in
- Update init_tool_telemetry default type from 'tool' to 'pve'
* refactor: clean up diagnostics/telemetry opt-in system
- diagnostics_check(): deduplicate heredoc (was 2x 22 lines), improve whiptail
text with clear what/what-not collected, add telemetry + privacy links
- diagnostics_menu(): better UX with current status, clear enable/disable
buttons, note about existing containers
- variables(): change DIAGNOSTICS default from 'yes' to 'no' (safe: no
telemetry before user consents via diagnostics_check)
- install.func + alpine-install.func: persist BOTH yes AND no in container
so opt-out is explicit (not just missing file = no)
- Fix typo 'menue' -> 'menu' in config file comments
* fix: no pre-selection in telemetry dialog, link to telemetry-service README
- Add --defaultno so 'No, opt out' is focused by default (user must Tab to Yes)
- Change privacy link from discussions/1836 to telemetry-service#privacy--compliance
* fix: use radiolist for telemetry dialog (no pre-selection)
- Replace --yesno with --radiolist: user must actively SPACE-select an option
- Both options start as OFF (no pre-selection)
- Cancel/Exit defaults to 'no' (opt-out)
* simplify: inline telemetry dialog text like other whiptail dialogs
* improve: telemetry dialog with more detail, link to PRIVACY.md
- Add what we collect / don't collect sections back to dialog
- Link to telemetry-service/docs/PRIVACY.md instead of README anchor
- Update config file comment with same link
* Ensure API update is sent on script exit
Add exit-time telemetry handling across scripts to avoid orphaned "installing" records. Introduce local exit_code capture in api_exit_script and cleanup handlers and, when POST_TO_API_DONE is true but POST_UPDATE_DONE is not, post a final status (marking failures on non-zero exit codes, or marking done/failed in VM cleanups based on exit code). Changes touch misc/build.func, misc/vm-core.func and various vm/*-vm.sh cleanup functions to reliably send post_update_to_api on normal or early exits.
* Update api.func
* fix(telemetry): add missing exit codes to explain_exit_code()
- Add curl error codes: 4, 5, 8, 23, 25, 30, 56, 78
- Add code 10: Docker/privileged mode required (used in ~15 scripts)
- Add code 75: Temporary failure (retry later)
- Add BSD sysexits.h codes: 64-77
- Sync error_handler.func fallback with canonical api.func
* fix(telemetry): improve error reporting with structured error strings and better categorization
- Add build_error_string() that creates structured format:
'exit_code=N | description\n---\n<last 20 log lines>'
- Fix categorize_error() to map ALL known exit codes:
- Added: shell(1,2), proxmox(200-231), service(150-154),
database(170-193), runtime(243-249), signal(139,141,143)
- Split timeout from network (28 was in both)
- Added DPKG(255) to dependency category
- Update all API functions to use build_error_string():
post_update_to_api, post_update_to_api_extended,
post_tool_to_api, post_addon_to_api
- Add ensure_log_on_host() calls to on_exit, on_interrupt,
on_terminate handlers to prevent race condition where
telemetry reports before container log is pulled to host
* fix(ui): improve error output formatting and remove redundant log paths
- error_handler: Use msg_info/msg_ok/msg_warn for container cleanup
instead of raw echo with manual ANSI codes
- error_handler: Add ❓ icon before 'Remove broken container?' prompt
- error_handler: Indent log output with TAB for visual consistency
- build.func: Use msg_custom for installation log path display
- build.func: Use msg_info → msg_ok for container removal flow
- build.func: Use msg_warn for 'kept for debugging' message
- core.func/vm-core.func: Remove redundant container-internal log
path display (📋 View full log) since combined log on host is
the canonical location shown after failure
Enhance post_update_to_api to support a "force" mode and robust retry logic: add a 3rd-arg bypass to duplicate suppression, capture a short error summary, and perform up to three POST attempts (full payload, shortened error payload, minimal payload) with HTTP code checks and small backoffs. Mark POST_UPDATE_DONE on success (or after three attempts) to avoid infinite retries. Also invoke post_update_to_api with the "force" flag from cleanup paths in build.func and error_handler.func so a final status update is attempted after cleanup.
The catch_errors() function in CT scripts overrides the API telemetry
traps set by build.func. This caused on_exit, on_interrupt, and
on_terminate to never call post_update_to_api, leaving telemetry
records permanently stuck on 'installing'.
Changes:
- on_exit: Report orphaned 'installing' records on ANY exit where
post_to_api was called but post_update_to_api was not
- on_interrupt: Call post_update_to_api('failed', '130') before exit
- on_terminate: Call post_update_to_api('failed', '143') before exit
All calls are guarded by POST_UPDATE_DONE flag to prevent duplicates.
* Remove Go API and extend misc/api.func
Delete the Go-based API (api/main.go, api/go.mod, api/go.sum, api/.env.example) and significantly enhance misc/api.func. The shell telemetry file now includes telemetry configuration, repo source detection, GPU/CPU/RAM detection, expanded explain_exit_code mappings, and refactored post_to_api/post_to_api_vm to send non-blocking telemetry to telemetry.community-scripts.org while respecting DIAGNOSTICS/DEV_MODE and adding richer metadata (cpu/gpu/ram/repo_source). Also updates header/author info and improves privacy/robustness and error handling.
* Start install timer and refine error reporting
Call start_install_timer during build startup and overhaul exit/error reporting.
Changes:
- Invoke start_install_timer early in misc/build.func to track install duration.
- Update api_exit_script comments to reference PocketBase/api.func and adjust ERR/SIGINT/SIGTERM traps to post numeric exit codes (use $? / 130 / 143) instead of command strings.
- Replace the previous explain_exit_code implementation with a conditional fallback: only define explain_exit_code if not already provided (api.func is the canonical source). Expanded and reorganized exit code mappings (curl, timeout, systemd, Node/Python/Postgres/MySQL/MongoDB, Proxmox, etc.).
- In error_handler: stop echoing the container log path (host shows combined log), and post a "failed" update to the API with the exit code before offering container cleanup.
Rationale: these changes make telemetry more consistent and robust (numeric codes), provide a safe fallback for exit descriptions when api.func isn't loaded, and ensure failures are reported to the API prior to any automatic cleanup.
* Report install start/failure to telemetry API
Add telemetry hooks in misc/build.func: call post_to_api at installation start to capture early or immediately-failing installs, and call post_update_to_api with status "failed" and the install exit code when a container installation fails. This improves visibility into install failures for monitoring/telemetry.
* core: enhance storage type validation and error codes
Improve storage validation for LXC container creation with
explicit checks for unsupported storage types.
Changes:
- Add validation for storage types that don't support containers:
- iscsidirect (exit 212) - VMs only
- iscsi/zfs (exit 213) - no rootdir support
- cephfs (exit 219) - use RBD instead
- pbs (exit 224) - backups only
- Add connectivity check for network storage (linstor, rbd, nfs, cifs)
- Simplify storage content validation using pvesm status
- Reorganize Proxmox error codes (200-231) for consistency
- Update error messages to be more descriptive and actionable
This helps users identify storage compatibility issues early
before container creation fails with cryptic errors.
* Update build.func
* Refactor Core
Refactored misc/alpine-install.func to improve error handling, network checks, and MOTD setup. Added misc/alpine-tools.func and misc/error_handler.func for modular tool installation and error management. Enhanced misc/api.func with detailed exit code explanations and telemetry functions. Updated misc/core.func for better initialization, validation, and execution helpers. Removed misc/create_lxc.sh as part of cleanup.
* Delete config-file.func
* Update install.func
* Refactor stop_all_services function and variable names
Refactor service stopping logic and improve variable handling
* Refactor installation script and update copyright
Updated copyright information and adjusted package installation commands. Enhanced IPv6 disabling logic and improved container customization process.
* Update install.func
* Update license comment format in install.func
* Refactor IPv6 handling and enhance MOTD and SSH
Refactor IPv6 handling and update OS function. Enhance MOTD with additional details and configure SSH settings.
* big core refactor
* Enhance IPv6 configuration menu options
Updated IPv6 Address Management menu options for clarity and added a new option for fully disabling IPv6.
* Update default Node.js version to 24 LTS
* Update misc/alpine-tools.func
Co-authored-by: Michel Roegl-Brunner <73236783+michelroegl-brunner@users.noreply.github.com>
* indention
* remove debugf and duplicate codes
* Update whiptail backtitles and error codes
Removed '[dev]' from whiptail --backtitle strings for consistency. Refactored custom exit codes in build.func and error_handler.func: updated Proxmox error codes, shifted MySQL/MariaDB codes to 260-263, and removed unused MongoDB code. Updated error descriptions to match new codes.
* comments
* Refactor error handling and clean up debug comments
Standardized bash variable checks, removed unnecessary debug and commented code, and clarified error handling logic in container build and setup scripts. These changes improve code readability and maintainability without altering functional behavior.
* Update build.func
* feat: Improve LXC network checks and LINSTOR storage handling
Enhanced LXC container network setup to check for both IPv4 and IPv6 addresses, added connectivity (ping) tests, and provided troubleshooting tips on failure. Updated storage validation to support LINSTOR, including cluster connectivity checks and special handling for LINSTOR template storage.
---------
Co-authored-by: Michel Roegl-Brunner <73236783+michelroegl-brunner@users.noreply.github.com>