Error handling in eConsoleAppContainer
prl 13 Nov 2018
There are some problems with error handling in eConsoleAppContainer.
- If the execvp() call in bidirpipe() returns (e.g. fails with ENOENT or EPERM), the child process prints "[eConsoleAppContainer] Finished <cmdname>" to the debug log when it was actually not possible to run the command, and exits the child process with status 0, when it should exit the child with non-zero status. Exiting with status 127 would be compatible with the status returned from "sh -c" when the command isn't found.
- If the child is terminated with a signal, the appClosed() PSignal1 is invoked with a retval of 0, unless the process was terminated by a SIGKILL that was sent by eConsoleAppContainer::kill(). If the child is terminated by a SIGKILL from any other source, retval is 0. Similarly retval is 0 if the child is terminated by SIGINT, even if the SIGINT was sent by eConsoleAppContainer::sendCtrlC().
I propose that if the execvp() call in bidirpipe() returns, the message printed to the debug lok should be something like "[eConsoleAppContainer] <cmdname> failed to execute: <perror string>", and the call of _exit() shoud be changed from _exit(0) to _exit(127).
I think that the retval argument to the appClosed() PSignal1 should be the child's exit status if the child exited normally, and -signalvalue if the child was terminated by a signal. I have run through most of the uses of eConsoleAppContainer in the Beyonwiz enigma2 code and I can't see any problem with this change (retval is mostly ignored or tested for being non-zero). The only issue I see is if there is code that explicitly tests for retval == -1 to detect termination of the child by eConsoleAppContainer::kill().
betacentauri 15 Nov 2018
To 1: Why return 127? I see no added value to use 127. Just return that an error occured.
I would propose something like this (untested!!):
diff --git a/lib/base/console.cpp b/lib/base/console.cpp index 3e97200..74acf73 100644 --- a/lib/base/console.cpp +++ b/lib/base/console.cpp @@ -15,6 +15,7 @@ int bidirpipe(int pfd[], const char *cmd , const char * const argv[], const char int pfdout[2]; /* from parent to child */ int pfderr[2]; /* stderr from child to parent */ int pid; /* child's pid */ + int ret; if ( pipe(pfdin) == -1 || pipe(pfdout) == -1 || pipe(pfderr) == -1) return(-1); @@ -41,8 +42,13 @@ int bidirpipe(int pfd[], const char *cmd , const char * const argv[], const char if (cwd && chdir(cwd) < 0) eDebug("[eConsoleAppContainer] failed to change directory to %s (%m)", cwd); - execvp(cmd, (char * const *)argv); + ret = execvp(cmd, (char * const *)argv); /* the vfork will actually suspend the parent thread until execvp is called. thus it's ok to use the sh + if (ret == -1) + { + // print error message + return ret; + } eDebug("[eConsoleAppContainer] Finished %s", cmd); _exit(0); } @@ -115,7 +121,6 @@ int eConsoleAppContainer::execute(const char *cmdline, const char * const argv[] pid = bidirpipe(fd, cmdline, argv, m_cwd.empty() ? 0 : m_cwd.c_str()); if ( pid == -1 ) { - eDebug("[eConsoleAppContainer] failed to start %s", cmdline); return -3; }
betacentauri 15 Nov 2018
Is the return value important? Why not just use WIFSIGNALED to know that a signal terminated the child and then return -1 to be compatible with the current code?
Erik Slagter 15 Nov 2018
Is it relevant to know whether the proces was aborted due to a signal or due to an exit? Personally I don't think so, the only thing that's relevant is whether it succeeded or not. Please note that stuff like segmentation faults and division by zero conditions are also delivered as signal (although they mostly mean there is a bug in the program).
I agree the current code can't work, it won't detect a process resulting in an error. It will only detect a failed fork (e.g. due to insufficient resources). I don't know what bidirpipe does here though. Does it only create a bidirectional pipe, like the name suggests, or does it also system/exec/etc a new process? That would be strange, because that's already done at this point? Why then is the "cmdline" and "argv" passed?
BTW the Linux manual says: The use of vfork() is unnecessary on Linux and also prone to all sorts of race conditions and undefined behaviour. I'd say, get rid of it and use fork() instead and secure defined behaviour. The parent will need to fetch the pid of the child though, and wait() for it.
MiLo 15 Nov 2018
BTW the Linux manual says: The use of vfork() is unnecessary on Linux and also prone to all sorts of race conditions and undefined behaviour. I'd say, get rid of it and use fork() instead and secure defined behaviour. The parent will need to fetch the pid of the child though, and wait() for it.
The reason that vfork is being used here is that fork() will typically run out of address space. Enigma2, if it has been running for a few weeks, has allocated a huge amount of memory pages, and a fork() at that point will run out of address space because of that. That's why python's os.system() is also unreliable, it also uses the normal fork, and thus it usually works when you quickly test it, but it'll invariably fail on systems that have been up and running for a long time.
littlesat 15 Nov 2018
Edited by littlesat, 15 November 2018 - 19:49.
WanWizard 15 Nov 2018
It takes some effort (and sometimes abuse) to do the right thing. But in the end, we're in charge.
littlesat 15 Nov 2018
Erik Slagter 15 Nov 2018
BTW the Linux manual says: The use of vfork() is unnecessary on Linux and also prone to all sorts of race conditions and undefined behaviour. I'd say, get rid of it and use fork() instead and secure defined behaviour. The parent will need to fetch the pid of the child though, and wait() for it.
The reason that vfork is being used here is that fork() will typically run out of address space. Enigma2, if it has been running for a few weeks, has allocated a huge amount of memory pages, and a fork() at that point will run out of address space because of that. That's why python's os.system() is also unreliable, it also uses the normal fork, and thus it usually works when you quickly test it, but it'll invariably fail on systems that have been up and running for a long time.
Following your reasoning enigma2 has some huge memory leaks. I guess we'll have to address that then.
Read the manual, it says Linux uses copy-on-write so only the descriptors are copied, not the pages themselves. Linux has been doing that for ages. So as long as you follow the fork() by an immediate exec(), the netto result is the same, but then without the great risk on ondefined behaviours and race conditions. The use of vfork() is discouraged on Linux.
Erik Slagter 15 Nov 2018
That is right... but at the end I can completely redo it... (like e.g. flash image via the ui....)
No, they should do it, not you.
littlesat 15 Nov 2018
Now on topic... is there really something bad about the eConsoleAppContainer
Edited by littlesat, 15 November 2018 - 21:18.
samsamsam 15 Nov 2018
Edited by samsamsam, 15 November 2018 - 22:11.
prl 16 Nov 2018
To 1: Why return 127? I see no added value to use 127. Just return that an error occured.
I would propose something like this (untested!!):
diff --git a/lib/base/console.cpp b/lib/base/console.cpp index 3e97200..74acf73 100644 --- a/lib/base/console.cpp +++ b/lib/base/console.cpp @@ -15,6 +15,7 @@ int bidirpipe(int pfd[], const char *cmd , const char * const argv[], const char int pfdout[2]; /* from parent to child */ int pfderr[2]; /* stderr from child to parent */ int pid; /* child's pid */ + int ret; if ( pipe(pfdin) == -1 || pipe(pfdout) == -1 || pipe(pfderr) == -1) return(-1); @@ -41,8 +42,13 @@ int bidirpipe(int pfd[], const char *cmd , const char * const argv[], const char if (cwd && chdir(cwd) < 0) eDebug("[eConsoleAppContainer] failed to change directory to %s (%m)", cwd); - execvp(cmd, (char * const *)argv); + ret = execvp(cmd, (char * const *)argv); /* the vfork will actually suspend the parent thread until execvp is called. thus it's ok to use the sh + if (ret == -1) + { + // print error message + return ret; + } eDebug("[eConsoleAppContainer] Finished %s", cmd); _exit(0); } @@ -115,7 +121,6 @@ int eConsoleAppContainer::execute(const char *cmdline, const char * const argv[] pid = bidirpipe(fd, cmdline, argv, m_cwd.empty() ? 0 : m_cwd.c_str()); if ( pid == -1 ) { - eDebug("[eConsoleAppContainer] failed to start %s", cmdline); return -3; }
That code appears to forget that inside the
if (pid == 0) /* child process */
statement, you're running in the child process, which is still a copy of the enigma2 process, since the execvp() failed. Unix exec calls *only* return if they fail. The correct thing to do IMO is to replace:
eDebug("[eConsoleAppContainer] Finished %s", cmd); _exit(0);
with:
eDebug("[eConsoleAppContainer] Failed to execute %s", cmd); _exit(127);
That allows the parent enigma2 process's waitpid() in eConsoleAppContainer::readyRead() to be able to detect that the exec failed. I'm not terribly fussed about the return code other than that it should be non-zero. I suggested 127 because that's what the shell uses as a return code in similar circumstances, i.e. the return code of the shell when you run
sh -c noSuchProgram
is 127:
prl$ sh -c noSuchProgram; echo $? sh: noSuchProgram: command not found 127 prl$
There is no need to check the return value of execvp(). If it returns, the execvp() failed. From The Fine Manual (execvp(3)): "The exec() functions return only if an error has occurred." It would make sense to also check errno and add the string name for the system error to the eDebug().
prl 16 Nov 2018
Is it relevant to know whether the proces was aborted due to a signal or due to an exit? Personally I don't think so, the only thing that's relevant is whether it succeeded or not. Please note that stuff like segmentation faults and division by zero conditions are also delivered as signal (although they mostly mean there is a bug in the program).
I agree the current code can't work, it won't detect a process resulting in an error. It will only detect a failed fork (e.g. due to insufficient resources). I don't know what bidirpipe does here though. Does it only create a bidirectional pipe, like the name suggests, or does it also system/exec/etc a new process? That would be strange, because that's already done at this point? Why then is the "cmdline" and "argv" passed?
The code at the moment fails to deliver any error indication in the appClosed() PSignal if the execvp() call fails (i.e. it tried to execute a non-existent program, or at least one that's not in $PATH) or if the program terminated abnormally by a signal.
I know that signals are used for both inter-process communication and for program internal errors like SIGSEGV and SIGFPE. But if they are not caught and handled by the program, they are still failures to run to completion, and should be flagged as an error in appClosed(). Currently an error on signal termination is only flagged ifthe child process is killed using eConsoleAppContainer::kill(), though not if it gets a SIGKILL from aiywhere else.
The birdirpipe() is an example of the classical Unix pattern: create pipes / fork / close unused pipe fds / exec named program in child process. The cmd and cmdline arguments respectively name the program to run and provide its runtime arguments.
If you're unsure how it all hangs together: vfork() (and fork()) duplicate the running process and so creates a new process (but vfork() does it in a more memory-efficient manner in most cases). Both processes remain runnable and executing the same code: initially they only differ by the return value of vfork() (or fork()).
The execvp() call (indeed all variants of exec()) then replaces the image of the currently running program (normally the child process) with a new program image, in the file named in the cmd argument in this case, and starts it running with its "commandline" arguments (passed in cmdline in this case).
If the execvp() call fails, the correct action almost always is to exit the child process with a non-zero exit status as soon as possible (after perhaps printing some sort of error message).
The issues I'm trying to address have nothing at all to do with whether the parent process uses vfork() or fork() to create the new process. The choice to use vfork() is the correct one, but for reasons that have nothing to do with what I'm trying to fix.
prl 16 Nov 2018
... is there really something bad about the eConsoleAppContainer
Yes: it fails to return an error status if the program can't be run, or if the program terminates because of a signal, other than SIGKILL sent by eConsoleAppContainer::kill().
prl 16 Nov 2018
To 2: You mean using WIFSIGNALED and WTERMSIG to get the signal and return this?
Yes, precisely (well, -WTERMSIG(stat)).
Is the return value important? Why not just use WIFSIGNALED to know that a signal terminated the child and then return -1 to be compatible with the current code?
The reason I suggest using -signum rather than -1 is that there seems to be no good reason to hide the signal number from the recipient(s) of the appClosed() PSignal.
Old code that does
def onAppClosed(self, retval): if retval: # error in child process else: # child process succeeded
Will still work.
Most uses of eConsoleAppContainer::execute() that I've looked at simply ignore the retval in the callback anyway.
The potential issue with using -signum stems from the fact that some code does things like:
retval = self.container.execute(... command arguments ...) if retval: self.onAppClosed(retval)
Where it's assumed that the values passed to onAppClosed() via the PSignal method don't overlap in incompatible ways with values returned by eConsoleAppContainer::execute().
littlesat 16 Nov 2018
Yes: it fails to return an error status if the program can't be run, or if the program terminates because of a signal, other than SIGKILL sent by eConsoleAppContainer::kill().
Thanks... sounds like this indeed need to be fixed... I even prefer that is always 'goes' through onAppClosed... an not using the retval at the execute command at al...
betacentauri 16 Nov 2018
To 1 again: Yes, I overlooked it.
But I still think using 127 is not a good idea. If somebody looks at it in 1 year he can't understand why 127 is used. An execvp failure can be caused by several issues like binary not there, binary is not executable, filesystem is mounted with noexec,... . Either return e.g. the errno or just use:
exit(EXIT_FAILURE)
To 2: For me ok. As far as I can see the extra information won't hurt.