About ctserv

This document describes the "ctserv" server for managing cross-threading jobs at BMERC. It is very much under construction, and is liable to be inconsistent, incoherent, and incomplete.

Why would users at BMERC who are not involved in cross-threading jobs want to know about this server? Because it offers the general user a way of suspending a running cross-threading job temporarily. While the job is suspended, one can get real work done without competing with Lisp for memory and CPU time. (Believe me, you would find it hard to compete with Lisp on that score.) You don't need to know the thread password, and we don't lose time or data (or not much, anyway).

Table of Contents

  1. About ctserv
  2. Table of Contents
  3. Using ctserv
    1. The job state machine (simplified)
    2. Suspending and resuming
    3. Sending requests to the server
  4. Configuring ctserv
    1. Security issues
    2. Directory structure
    3. Starting and stopping jobs
    4. Host file format
      1. Standard user-defined host options
      2. System-defined host options
    5. Job file format
      1. Standard user-defined job options
      2. Options defined by convention
      3. System-defined job options
    6. Creating a new job
    7. The job state machine (detailed)
    8. Substitution forms
    9. Script files and Lisp
  5. Maintaining ctserv
    1. Starting and stopping ctserv
  6. Known ctserv bugs

Using ctserv

Conceptually, ctserv is quite simple. For each host known to it, ctserv runs a series of jobs. Each job specifies a script along with other options; ctserv runs this script until it completes, and then moves on to the next. To record what is happening for each job, ctserv maintains a state machine. Each job (and hence each state machine) operates independently.

The job state machine (simplified)

The
full state machine has around ten states. (Most of the hair, and therefore most of the additional state machine complexity, is a consequence of determining when the script has completed, when it has encountered a transient error and needs to be restarted, etc.) Only three states are of general interest. Transitions between these states occur principally due to explicit requests from users, sent via one of several suspension commands, in either emacs or the shell. But note that not all transitions are instantaneous. Requests that depend in turn on being in a certain state may therefore make the state machine internals visible.

Note that each host may have at most one "current" job. All other jobs assigned to the host will be idle (or something equivalent); they cannot be running or suspended. [This restriction may eventually be lifted, but it doesn't really make sense to try to run more than one Lisp job at a time, even on a multiprocessor box. Lisp jobs are intensely memory-hungry. -- rgr, 31-Jan-96.]

Suspending and resuming

A suspend request for a given host causes that host's current threading job (if any) to be stopped temporarily. This period has been defined by executive fiat as three hours; there is no way to request a different duration, though one can revoke a suspension before it expires by sending a resume request. After this period, the job is automatically restarted. The rationale behind automatic expiration after a fixed period is that it prevents the "I forgot to restart it on my way out the door" error, which can cause a whole night (or weekend) worth of work to be lost. Three hours is a compromise between requiring requests to be sent too frequently, and losing many hours of potential computation if the interval is set too long.

More than one user may have a suspension request in effect at a time; the job will resume when the last request expires (or is revoked). (You could say that all sentences are served concurrently.) Requests are kept track of by email address. The rationale behind maintaining requests from multiple users is that one should not have to know whether one is the last person to leave before resuming. If other people are using that machine, then let them suspend it themselves.

15 minutes before a suspension request expires, you will receive a notice by email to that effect, reminding you to renew it if you so desire. It is better for all concerned to renew before the end of the suspension, so that the cross-threading job is not fruitlessly restarted and users are not pointlessly annoyed by being thrust into the background by a thrashing, paging behemoth.

Since suspensions apply to the host, and not the job, it is possible to reassign that job to another host, and even to assign another job in its place, during a suspension. Such operations do not affect the status or duration of suspension requests. In particular, starting a job on a host with pending suspensions sends the state directly from idle to suspended.

Despite the automatic expiration feature, if you should happen to leave earlier than you expect, we would be grateful if you would send an explicit resume request to start the cross-threading job up again.

A word to the wise: The server logs all transactions, so anti-social uses of the server are readily detectable. We know where your office is.

Sending requests to the server

There are several ways of submitting requests to ctserv:

The M-x ctserv-request emacs command prompts for one or more requests on a single line in the minibuffer. Requests are of the form "verb noun" (though one could argue that "status" is not a verb). Any number of requests may be submitted at once, separated by semicolons.

For each invocation of any of the commands described above, a response message is generated and sent by email. Each individual request within the mail message has its response delimited by "*** verb noun" lines.

NB: No request should take more than a few minutes to answer. If you do not get a response to your request after this time, it probably means that the server has crashed. Please send mail to Bob Rogers <rogers@darwin.bu.edu> to get it restarted (and please do not flood the queue with messages!).

The following requests are supported:

The suspend and resume requests always operate on hosts, though they may affect the hosts' current jobs. The start and stop requests operate on jobs, except that if a host is specified, the job is determined implicitly.

Configuring ctserv

This section is for those users in charge of keeping a cross-threading run going, and who therefore need to create and modify job files.

Security issues

These are summarized simply:
  1. Make sure that job and script files are owned by and may only be written by the thread user. This means that the directories in which they reside must also be writable only by thread. This goes especially for the jobs directory; the ability to create files in that directory is the ability to run arbitrary jobs under thread on our machines. [As a future enhancement, the server may insist on this before accepting a start request. -- rgr, 31-Jan-96.]
  2. Similarly, the top-level ctserv directory must be owned by thread, and not permit writing by anyone else. Otherwise, random users could start/stop the server (see the "Starting and stopping ctserv" section for more details).
  3. Make sure that, when writing a substitution form, you do not rescan any piece of text (even an email address) that came from a user request. Doing so opens up a trojan horse, by which a rogue user could cause the server to evaluate arbitrary code.

Directory structure

[to be filled in. -- rgr, 26-Jan-96.]

Starting and stopping jobs

The start request accepts a job or host name, or both, but it must have at least one, and it must be able to find the other from whichever argument it is given. (Both files are checked to see if a new version has appeared on disk, and are reverted as early as possible in the startup sequence so as to reflect their new contents.)
  1. If given both a host and a job, then ctserv attempts to start the indicated job on that host.
  2. If given only a job, then the following ways of finding a host are tried in this order:
    1. The job may have a current host (i.e. if it was recently run).
    2. Otherwise, if the job appears on the job-queue option of exactly one host, then it is considered to `belong' to that host.
    3. [It used to be that the job could specify the host option explicitly. I de-supported this because it was a little cheesy; the host option is normally bashed by ctserv, so requiring the user to specify it could be confusing. -- rgr, 8-Mar-96.]
    4. Otherwise, the request is in error.
  3. If given only a host, then ctserv first looks for a new host file on disk, reverting if found. Then, the following sources of potential jobs are checked in this order:
    1. If the host has a current job that is in either the idle or error state, it uses the current job.
    2. Else, it looks for the next idle job queued for that host (ignoring any queued jobs in the error state).
    3. Otherwise, the request is in error.
Once ctserv has a job and host, the following conditions must be met in order for the request to succeed:
  1. The host must not have another current job, or if it does, that job must not be running (i.e. not in the running or halting or suspended or suspending states).
  2. The job must not be running (by the same definition) on any other host.
  3. If there is a new version of the job file on disk (and there must be a new version if the job is currently in one of the completed or specification-error states), it is reverted, and its new values must be legitimate (i.e. reverting must not have put the job into the specification-error state).
If any suspension requests are in effect for the host, the job does not start right then, but transitions to the suspended state, from which it will be started subsequently when all suspensions expire.

Stopping a job is simpler. If a host is specified, then its current job is clearly meant. [If I ever extend this to handle more than one job on a host at a time, then stopping a host will mean stop all jobs on that host. -- rgr, 1-Feb-96.]

When a job is explicitly stopped, ctserv does not try to start the next job on the job queue.

Host file format

Host files must live in the host directory (currently ~thread/code/ctserv/hosts/) and have file names of the form "hostname.host". They look suspiciously like email messages; there is a header section, followed by a body (the body is ignored). There can be no blank lines before or within the header. Headers may appear in any order, and are of the form "identifier: value", but (in accordance with RFC822) the option identifier can include any character other than whitespace or colon, and the value can be split over several lines provided that the continuation lines are indented (tab is conventional) and are not entirely blank. The identifier must start in column one; there can be no leading whitespace. The resulting values are all stored internally as strings (unless otherwise converted into something else), and have all line breaks removed (i.e. the newline and leading whitespace is turned into a single blank). This means that you are likely to get bizarre results if you try to include comments anywhere; there is no way to include comments in the options, alas. [finish. -- rgr, 31-Jan-96.]

Standard user-defined host options

These are supplied by the user in the host file. [We may want to add some standard options to describe Unix dependencies, such as which lisp to run, and to define "policy", such as for restricting users who can submit certain requests concerning that host. -- rgr, 6-Feb-96.]

System-defined host options

These can be used in substitution forms, with the limitations mentioned. ctserv defines these when needed; attempting to define them in the host file could have unexpected consequences.

Job file format

Job files must live in the job directory (currently ~thread/code/ctserv/jobs/), have file names of the form "job-name.job", and are in the same format as host files.

Standard user-defined job options

The set of standard user-defined option identifiers for job files is described below. Most of them have reasonable defaults. Other options may be included as well, but are ignored by the server. They may be used in substitution forms for scripts and reply messages. [Unfortunately, spelling errors in standard option names go undetected. -- rgr, 25-Jan-96.]

See also the set of mail message headers [ref? -- rgr, 6-Feb-96]. These may be used in constructing replies.

Options defined by convention

These are not touched by the server (and will remain undefined if not specified in the job file). Their documentation here is for the purpose of establishing a convention.

System-defined job options

These can be used in substitution forms, with the limitations mentioned. ctserv defines these when needed; attempting to define them in the job file could have unexpected consequences.

Creating a new job

The simplest way of doing this is to copy an old job file and then edit it appropriately. When you are done (be sure to change the value of the file-name-token option!), write it out, and send a "start jobname" request to the server.

If there is already a job running on the targeted host, you should edit that host file, changing its jobs option so that the new job is enqueued (either by putting the job name at the end of the list, or after the currently running job). The server will notice that the host file has changed on disk, and will revert it before checking to see what the next job should be.

Note that, through the magic of substitution forms, you can reuse the same script template for a series of jobs, using the options to customize the template.

The job state machine (detailed)

[Ought to have a graphic here. -- rgr, 25-Jan-96.]

Note that not all state transitions are instantaneous. Transitions that record the consequences of user actions (like running -> suspending) are effectively instantaneous, in the sense that an immediate status query (included as part of the same request) will show the change. Transitions that happen as a consequence of external actions (principally the suspending -> suspended transition, which happens when the process exits) will not be seen until after the message is processed, and so will appear to happen later. One cannot resume from the suspending state, so trying to suspend and resume in the same request (pointless, but theoretically legal) will not work. The user gets a

The foo job on sewall is suspending, and cannot be resumed
message. Such request combinations therefore make the state machine details visible, which unfortunately dilutes the point of showing the simplified state machine.

Substitution forms

An important aspect of the psa-server technology is the ability to create customized output files from a template or form file. In the form, all "boilerplate" entries as plain text, interspersed with substitution forms that are evaluated as emacs Lisp expressions and then inserted (depending on the form and the result, as detailed below). Substitution forms are of the format
     $(foo)
The entire form including the dollar sign is deleted, plus the following newline if the resulting line is emptly, and then the form is evaluated according to the rules below.

A substitution form can be any of the following:

[finish: psa-value, ctserv-value, etc. -- rgr, 31-Jan-96.]

Many conditional expressions are of the form

     $(if (condition)
	  (insert "something or other")
	  (delete-char -1))
The delete-char is because when the form is deleted, it leaves an empty line (assuming the $ is in the first column). Thus, if (condition) is true, the insertion leaves "something or other" on a line of its own. The usual intent in this case is leave out the "something or other" line altogether, so (delete-char -1) joins it to the previous line. [This is automatically handled in the case of comments, and should be generalized. But I have shied away from doing so for compatibility reasons. -- rgr, 31-Jan-96.]

When inserting buffers and files, use psa-insert-buffer and psa-insert-file rather than insert-buffer and insert-file; the psa versions leave point at the end of the insertion, where you want it.

Script files and Lisp

Scripts run under ctserv are expected to do the following things:

1. They must do something useful with the enable-file option in order to suspend when required to.

2. They must output to the file named by the output-file option, so that post-mortem analyses of unexpected exits can use this output to try to figure out what happened.

3. They must be csh scripts (there is a kludge that uses the "/bin/csh -f script-file-name" to find the process id). [more stuff . . . -- rgr, 4-Mar-96.]

An important thing to remember (at least for CMU Common Lisp) is that the Lisp process' stderr must be redirected somewhere (to /dev/null if not to a file). If not, then Lisp gets a SIGPIPE error when (e.g.) the warn function tries to write to the stderr that got closed when the server exited.

Maintaining ctserv

This section is intended for emergencies, i.e. the server goes haywire and I'm not around to maintain it. If you don't need to know this, you don't want to read this.

Starting and stopping ctserv

The server normally runs on gamow. To start it, invoke "start-server ~/code/ctserv" on gamow (or the machine on which you wish the server itself to run). To stop it, invoke "stop-server ~/code/ctserv" (on any machine). You must be logged in to thread to do either of these (but "su thread" also works). Also, the full path (without a trailing slash!) must be given in the "server" argument to these scripts; defaulting from the current directory will not do (really, this is a psa-request bug). The scripts may be more reliable if invoked on a Sparcstation; I haven't tested them on the Alphas.

At present, all jobs that do not actually have a process running come up in the idle state. It should treat those jobs that still have enable files as if they had randomly exited, and implicitly restart those that can be started. But the present behavior is user-friendly.

The server is implemented using psa-request technology, and uses code in the ~psa/bin/ directory. See the ~psa/bin/README file for more details on psa-request server maintenance.

Known ctserv bugs

Bugs that are still current are marked by "***"; other bugs are kept for historic reasons.
  1. The output from the server-status command comes out in the wrong order. I think this may be an emacs bug in nested I/O redirection. -- rgr, 26-Jan-96. [Wrong; it was an inappropriate use of with-output-to-temp-buffer, fixed by creating a temp buffer and explicitly binding standard-output to it. -- rgr, 15-Feb-96.]
  2. *** Transient restart for cross-threading jobs is still being implemented. At present, there is a hack that checks the error message, and even treats a segmentation violation as transient if there has been at least one successful threading since Lisp started. Parameterization is poor; the present solution has a great deal of CMU-CL dependence. I suspect this will be a matter of successive approximation. -- rgr, 26-Jan-96.
  3. *** If you run the server interactively (in testing mode), then quit and run it the background, the interactive version will not notice any changes in job state if you then restart it. To work around this, one must delete all host and job buffers before restarting. (Probably not of interest to anyone but me.) -- rgr, 26-Jan-96.
  4. All suspension information for all hosts is lost when the server goes down. -- rgr, 26-Jan-96. [Fixed. -- rgr, 15-Feb-96.]
  5. *** One cannot resume from the suspending state. Since this is a short-lived transient state, this should not be much of a problem. Fixing it requires writing code to handle process exit in the suspending state. -- rgr, 26-Jan-96.
  6. If you start a completed job (that found its way into the idle state when the server came back up again) without giving it new work to do, it will spin its wheels uselessly. -- rgr, 26-Jan-96. [I think I have fixed this now. -- rgr, 13-Feb-96.]
  7. Piddling little text problems. [nothing pending. -- rgr, 31-Jan-96.]
  8. If a script dies immediately on startup, ctserv doesn't notice. The state goes to idle, the enable file still exists. [I think I've fixed this by always transitioning to running; subsequent failure to find the process should be detected as an error. -- rgr, 7-Feb-96.] [But is this really the best thing to do? (See the ctserv-start-internal function.) Consider an intermediate starting state to distinguish startup errors from errors during otherwise normal running . . .]
  9. There was no good way to enqueue job files. [fixed by the host file job-queue option. -- rgr, 7-Feb-96.]
  10. *** There is no way to include comments in the job file options. Furthermore, newlines within option values are lost, possibly leading to bizarre results in substitution forms. Both problems should be fixed together.
  11. *** The server should trash requests more than (say) an hour old when it starts up. Or, possibly just enter suspension requests at the earlier of current time or message time.
  12. Reinstate "status all". -- rgr, 12-Feb-96. [did this via the status-all command. -- rgr, 13-Feb-96.] [but the status commands should be combined; see below. -- rgr, 27-Mar-96.]
  13. Starting doesn't seem to notice suspensions. -- rgr, 13-Feb-96. [Fixed. -- rgr, 16-Feb-96.]
  14. *** It is difficult at present to estimate things like computational efficiency, since job statistics are not uniformly recorded. I should probably hack some sort of logging using the psa-request transaction database stuff. -- rgr, 15-Feb-96.
  15. The timer process sometimes quits for no apparent reason. This causes suspensions to stay on the books forever, and makes the process-exit-checker go away. -- rgr, 5-Mar-96. [Seem to have clobbered this for good. -- rgr, 27-Mar-96.]
  16. Doing start hostname fails to find the host's current job. [Actually, it was never documented to do this. But it turns out to be more intuitive to prefer the current job, so I fixed (and documented) it that way. -- rgr, 8-Mar-96.]
  17. Contrary to documentation, doing halt job1 host; start job2 host does not complain about the running job1, resulting in two jobs running on host at the same time. [Now fixed; I just hadn't bothered to check for this! -- rgr, 8-Mar-96.]
  18. *** The fact that suspended jobs become idle when the server gets restarted is no end of annoying. Maybe the server should write a suspension state summary when exiting. -- rgr, 8-Mar-96.
  19. *** The status commands should be combined, i.e. "status jobname server all-hosts" all at once. -- rgr, 27-Mar-96.

Bob Rogers <rogers@darwin.bu.edu>
BioMolecular Engineering Research Center
Boston University
36 Cummington St
Boston MA 02215
Last modified: Fri Nov 7 15:05:07 EST 1997