Iām going to assume that you have already thought this through, and you have every reason to believe that your program will be more stable by trying to try again after SIGSEGV - meaning segfaults, there are problems with dangling pointers and others that may also distort unpredictable locations in the address space of your process without segfault.
Since you considered this with extreme caution, and you determined (somehow) that the particular method of segfaults of the application cannot hide the corruption of the accounting data used to cancel and restart the threads, and that you have the perfect cancellation logic for these threads (also unusual rarely), release and solve the problem.
The Linux SIGSEGV handler executes in the failure command stream (signal man 7). We cannot call pthread_self () since it is not safe for an asynchronous signal, but on the Internet it seems to agree that syscall (man 2 syscall) is safe, so we can get the thread ID through syscall SYS_gettid. Therefore, we will support matching pthread_t (pthread_self) with pid (gettid ()). Since write () is also safe, we can block SEGV, write the current thread id down the pipe, and then pause until pthread_cancel completes us.
We also need a monitor flow to keep track of when things go pear-shaped. The monitor thread monitors the end of the read for information about the completed thread and can restart it.
Because I think that applying for SIGSEGV is stupid, I'm going to name the structures here that do daft_thread_t etc. someone_please_fix_me represents your broken code. Monitor flow is the main (). When a segfaults stream, it is captured by the signal handler, writes its identifier down the pipe; the monitor reads the handset, cancels the thread using pthread_cancel and pthread_join and restarts it.
#include <assert.h> #include <errno.h> #include <pthread.h> #include <signal.h> #include <stdlib.h> #include <stdio.h> #include <string.h> #include <sys/syscall.h> #define MAX_DAFT_THREADS (1024) // arbitrary #define CHECK_OSCALL(call, onfail) { \ if ((call) == -1) { \ char buf[512]; \ strerror_r(errno, buf, sizeof(buf)); \ fprintf(stderr, "% s@ %d failed: %s\n", __FILE__, __LINE__, buf); \ onfail; \ } \ } /*********************** daft thread accounting *****************/ typedef void* (*threadproc_t)(void* arg); struct daft_thread_t { threadproc_t start_routine; void* start_routine_arg; pthread_t pthread; pid_t tid; }; struct daft_thread_accounting_info_t { int monitor_pipe[2]; pthread_mutex_t info_lock; size_t daft_thread_count; struct daft_thread_t daft_threads[MAX_DAFT_THREADS]; }; static struct daft_thread_accounting_info_t g_thread_accounting; void daft_thread_accounting_info_init(struct daft_thread_accounting_info_t* inf) { memset(inf, 0, sizeof(*inf)); pthread_mutex_init(&inf->info_lock, NULL); CHECK_OSCALL(pipe(inf->monitor_pipe), abort()); } struct daft_thread_wrapper_data_t { struct daft_thread_t* thread_info; }; static void* daft_thread_wrapper(void* arg) { struct daft_thread_t* wrapper = arg; wrapper->tid = gettid(); return (*wrapper->start_routine)(wrapper->start_routine_arg); } static void start_daft_thread(threadproc_t proc, void* arg) { struct daft_thread_t* info; pthread_mutex_lock(&g_thread_accounting.info_lock); assert (g_thread_accounting.daft_thread_count < MAX_DAFT_THREADS); info = &g_thread_accounting.daft_threads[g_thread_accounting.daft_thread_count++]; pthread_mutex_unlock(&g_thread_accounting.info_lock); info->start_routine = proc; info->start_routine_arg = arg; CHECK_OSCALL(pthread_create(&info->pthread, NULL, daft_thread_wrapper, info), abort()); } static struct daft_thread_t* find_thread_by_tid(pid_t thread_id) { int k; struct daft_thread_t* info = NULL; pthread_mutex_lock(&g_thread_accounting.info_lock); for (k = 0; k < g_thread_accounting.daft_thread_count; ++k) { if (g_thread_accounting.daft_threads[k].tid == thread_id) { info = &g_thread_accounting.daft_threads[k]; break; } } pthread_mutex_unlock(&g_thread_accounting.info_lock); return info; } static void restart_daft_thread(struct daft_thread_t* info) { void* unused; CHECK_OSCALL(pthread_cancel(info->pthread), abort()); CHECK_OSCALL(pthread_join(info->pthread, &unused), abort()); info->tid = 0; CHECK_OSCALL(pthread_create(&info->pthread, NULL, daft_thread_wrapper, info), abort()); } /************* signal handling stuff **************/ struct sigdeath_notify_info { int signum; pid_t tid; }; static void sigdeath_handler(int signum, siginfo_t* info, void* ctx) { int z; struct sigdeath_notify_info inf = { .signum = signum, .tid = gettid() }; z = write(g_thread_accounting.monitor_pipe[1], &inf, sizeof(inf)); assert (z == sizeof(inf)); // or else SIGABRT. Are we handling that too? Hope not. pause(); // returning doesn't do us any good. } static void register_signal_handlers() { struct sigaction sa = {}; sa.sa_sigaction = sigdeath_handler; sa.sa_flags = SA_SIGINFO; CHECK_OSCALL(sigaction(SIGSEGV, &sa, NULL), abort()); CHECK_OSCALL(sigaction(SIGBUS, &sa, NULL), abort()); } pid_t gettid() { return (pid_t) syscall(SYS_gettid); } /** This is the code that segfaults randomly. Kwality with a 'k'. */ static void* someone_please_fix_me(void* arg) { char* i_think_this_address_looks_nice = (char*) 42; sleep(1 + rand() % 200); i_think_this_address_looks_nice[0] = 'q'; // ugh return NULL; } // main() will serve as the monitor thread here int main() { int k; struct sigdeath_notify_info death; daft_thread_accounting_info_init(&g_thread_accounting); register_signal_handlers(); for (k = 0; k < 200; ++k) { start_daft_thread(someone_please_fix_me, (void*) k); } while (read(g_thread_accounting.monitor_pipe[0], &death, sizeof(death)) == sizeof(death)) { struct daft_thread_t* info = find_thread_by_tid(death.tid); if (info == NULL) { fprintf(stderr, "*** thread_id %u not found\n", death.tid); continue; } fprintf(stderr, "Thread %u (%d) died of %d, restarting.\n", death.tid, (int) info->start_routine_arg, death.signum); restart_daft_thread(info); } fprintf(stderr, "Shouldn't get here.\n"); return 0; }
If you have not thought about this: Attempting to recover from SIGSEGV is extremely risky - I am categorically against this. Themes share the address space. A thread that may be damaged can also corrupt data from other threads or global accounting data, such as malloc () accounting. A safer approach ā provided that the faulty code is irreparably broken but must be used ā is to quarantine the faulty code outside the process, such as fork (), before invoking the broken code. Then you have to catch SIGCLD and handle the process that usually crashed or ended, along with a number of other pitfalls, but at least you don't need to worry about accidental corruption. Of course, the best option is to fix the bloody code so that you do not observe segfaults.