core: Execution ID & Telemetry Improvements (#12041)

* fix: send telemetry BEFORE log collection in signal handlers - Swap ensure_log_on_host/post_update_to_api order in on_interrupt, on_terminate, api_exit_script, and inline SIGHUP/SIGINT/SIGTERM traps - For signal exits (>128): send telemetry immediately, then best-effort log collection - Add 2>/dev/null || true to all I/O in signal handlers to prevent SIGPIPE - Fix on_exit: exit_code=0 now reports 'done' instead of 'failed 1' - Root cause: pct pull hangs on dying containers blocked telemetry updates, leaving 595+ records stuck in 'installing' daily * feat: add execution_id to all telemetry payloads - Generate EXECUTION_ID from RANDOM_UUID in variables() - Export EXECUTION_ID to container environment - Add execution_id field to all 8 API payloads in api.func - Add execution_id to post_progress_to_api in install.func and alpine-install.func - Fallback to RANDOM_UUID when EXECUTION_ID not set (backward compat) * fix: correct telemetry type values for PVE and addon scripts - PVE scripts (tools/pve/*): change type 'tool' -> 'pve' - Addon scripts (tools/addon/*): fix 4 scripts that wrongly used 'tool' -> 'addon' (netdata, add-tailscale-lxc, add-netbird-lxc, all-templates) - api.func: post_tool_to_api sends type='pve', default fallback 'pve' - Aligns with PocketBase categories: lxc, vm, pve, addon * fix: persist diagnostics opt-in inside containers for addon telemetry - install.func + alpine-install.func: create /usr/local/community-scripts/diagnostics inside the container when DIAGNOSTICS=yes (from build.func export) - Enables addon scripts running later inside containers to find the opt-in - Update init_tool_telemetry default type from 'tool' to 'pve' * refactor: clean up diagnostics/telemetry opt-in system - diagnostics_check(): deduplicate heredoc (was 2x 22 lines), improve whiptail text with clear what/what-not collected, add telemetry + privacy links - diagnostics_menu(): better UX with current status, clear enable/disable buttons, note about existing containers - variables(): change DIAGNOSTICS default from 'yes' to 'no' (safe: no telemetry before user consents via diagnostics_check) - install.func + alpine-install.func: persist BOTH yes AND no in container so opt-out is explicit (not just missing file = no) - Fix typo 'menue' -> 'menu' in config file comments * fix: no pre-selection in telemetry dialog, link to telemetry-service README - Add --defaultno so 'No, opt out' is focused by default (user must Tab to Yes) - Change privacy link from discussions/1836 to telemetry-service#privacy--compliance * fix: use radiolist for telemetry dialog (no pre-selection) - Replace --yesno with --radiolist: user must actively SPACE-select an option - Both options start as OFF (no pre-selection) - Cancel/Exit defaults to 'no' (opt-out) * simplify: inline telemetry dialog text like other whiptail dialogs * improve: telemetry dialog with more detail, link to PRIVACY.md - Add what we collect / don't collect sections back to dialog - Link to telemetry-service/docs/PRIVACY.md instead of README anchor - Update config file comment with same link
2026-02-25 06:25:56 +01:00 · 2026-02-18 10:24:06 +01:00
parent b4a5d28957
commit b439960222
37 changed files with 183 additions and 137 deletions
--- a/misc/error_handler.func
+++ b/misc/error_handler.func
@@ -329,6 +329,8 @@ error_handler() {
 # - Cleans up lock files if lockfile variable is set
 # - Exits with captured exit code
 # - Always runs on script termination (success or failure)
+# - For signal exits (>128): sends telemetry FIRST before log collection
+#   to prevent pct pull hangs from blocking status updates
 # ------------------------------------------------------------------------------
 on_exit() {
  local exit_code=$?
@@ -337,14 +339,24 @@ on_exit() {
  # post_to_api was called ("installing" sent) but post_update_to_api was never called
  if [[ "${POST_TO_API_DONE:-}" == "true" && "${POST_UPDATE_DONE:-}" != "true" ]]; then
    if declare -f post_update_to_api >/dev/null 2>&1; then
-      # Ensure log is accessible on host before reporting
-      if declare -f ensure_log_on_host >/dev/null 2>&1; then
-        ensure_log_on_host
-      fi
-      if [[ $exit_code -ne 0 ]]; then
-        post_update_to_api "failed" "$exit_code"
+      if [[ $exit_code -gt 128 ]]; then
+        # Signal exit: send telemetry IMMEDIATELY (container may be dying, pct pull could hang)
+        post_update_to_api "failed" "$exit_code" 2>/dev/null || true
+        # Then try log collection (non-critical, best-effort)
+        if declare -f ensure_log_on_host >/dev/null 2>&1; then
+          ensure_log_on_host 2>/dev/null || true
+        fi
      else
-        post_update_to_api "failed" "1"
+        # Normal exit: collect logs first for better error details
+        if declare -f ensure_log_on_host >/dev/null 2>&1; then
+          ensure_log_on_host 2>/dev/null || true
+        fi
+        if [[ $exit_code -ne 0 ]]; then
+          post_update_to_api "failed" "$exit_code"
+        else
+          # exit_code=0 is never an error — report as success
+          post_update_to_api "done" "0"
+        fi
      fi
    fi
  fi
@@ -356,22 +368,26 @@ on_exit() {
 # on_interrupt()
 #
 # - SIGINT (Ctrl+C) trap handler
+# - Reports to telemetry FIRST (time-critical: container may be dying)
 # - Displays "Interrupted by user" message
 # - Exits with code 130 (128 + SIGINT=2)
+# - Output redirected to /dev/null fallback to prevent SIGPIPE on closed terminals
 # ------------------------------------------------------------------------------
 on_interrupt() {
-  # Ensure log is accessible on host before reporting
-  if declare -f ensure_log_on_host >/dev/null 2>&1; then
-    ensure_log_on_host
-  fi
-  # Report interruption to telemetry API (prevents stuck "installing" records)
+  # CRITICAL: Send telemetry FIRST before any cleanup or output
+  # If ensure_log_on_host hangs (e.g. pct pull on dying container),
+  # the status update would never be sent, leaving records stuck in "installing"
  if declare -f post_update_to_api >/dev/null 2>&1; then
-    post_update_to_api "failed" "130"
+    post_update_to_api "failed" "130" 2>/dev/null || true
+  fi
+  # Best-effort log collection (non-critical after telemetry is sent)
+  if declare -f ensure_log_on_host >/dev/null 2>&1; then
+    ensure_log_on_host 2>/dev/null || true
  fi
  if declare -f msg_error >/dev/null 2>&1; then
-    msg_error "Interrupted by user (SIGINT)"
+    msg_error "Interrupted by user (SIGINT)" 2>/dev/null || true
  else
-    echo -e "\n${RD}Interrupted by user (SIGINT)${CL}"
+    echo -e "\n${RD}Interrupted by user (SIGINT)${CL}" 2>/dev/null || true
  fi
  exit 130
 }
@@ -380,23 +396,27 @@ on_interrupt() {
 # on_terminate()
 #
 # - SIGTERM trap handler
+# - Reports to telemetry FIRST (time-critical: process being killed)
 # - Displays "Terminated by signal" message
 # - Exits with code 143 (128 + SIGTERM=15)
 # - Triggered by external process termination
+# - Output redirected to /dev/null fallback to prevent SIGPIPE on closed terminals
 # ------------------------------------------------------------------------------
 on_terminate() {
-  # Ensure log is accessible on host before reporting
-  if declare -f ensure_log_on_host >/dev/null 2>&1; then
-    ensure_log_on_host
-  fi
-  # Report termination to telemetry API (prevents stuck "installing" records)
+  # CRITICAL: Send telemetry FIRST before any cleanup or output
+  # Same rationale as on_interrupt: ensure status gets reported even if
+  # ensure_log_on_host hangs or terminal is already closed
  if declare -f post_update_to_api >/dev/null 2>&1; then
-    post_update_to_api "failed" "143"
+    post_update_to_api "failed" "143" 2>/dev/null || true
+  fi
+  # Best-effort log collection (non-critical after telemetry is sent)
+  if declare -f ensure_log_on_host >/dev/null 2>&1; then
+    ensure_log_on_host 2>/dev/null || true
  fi
  if declare -f msg_error >/dev/null 2>&1; then
-    msg_error "Terminated by signal (SIGTERM)"
+    msg_error "Terminated by signal (SIGTERM)" 2>/dev/null || true
  else
-    echo -e "\n${RD}Terminated by signal (SIGTERM)${CL}"
+    echo -e "\n${RD}Terminated by signal (SIGTERM)${CL}" 2>/dev/null || true
  fi
  exit 143
 }