Comprehensive Structure Data Base

I am using this rather grand moniker to describe Temple's collection of protein structural information about the expanded set of proteins that can be used to discover statistical relationships that might be useful in an improved score function.

[First major modification: I have decided to store amino acids and PDB residue indices as internal numbers, rather than 3-letter abbreviations and PDB strings respectively. This should make it easier to compute environments and counts, without overly complicating report generation. -- rgr, 17-Jun-96.]

[I have now moved the scripts to ~thread/code/bin, where the other BMERC software tools under my care live, and have started to keep a separate development directory so I don't step on people's toes. -- rgr, 30-Jul-96.]

[The segment tables and related information are now in place, though not quite complete. -- rgr, 28-Oct-96.]

Table of Contents

  1. Comprehensive Structure Data Base
  2. Table of Contents
  3. Overview
  4. Relations
    1. Database fields and datatypes
      1. The aa_index datatype
      2. The aa_abbrev datatype
      3. The core total index
      4. The pdb_index datatype
      5. The pdbres datatype
      6. The window datatype
      7. The segment type
    2. Protein table
    3. Core segment table
    4. Singleton table
    5. Pairwise tables
  5. Scripts
    1. Setting up
    2. perl expressions
      1. Expression keyword arguments
    3. perl scripts
    4. Extracting singleton data
    5. Extracting pairwise data
      1. Pairwise extraction examples
    6. Extracting segment data
    7. Pairwise script examples
  6. Internals
    1. Database creation
  7. Known bugs/deficiencies


The principal object is to incorporate as much machine-usable data as possible about our set of 217 new cores (or however many it is this week). We can therefore expect to have huge datafiles that are not directly human-browsable, so an important collateral goal is to avoid getting buried by the data. Accordingly, data fields are given symbolic names, which the user passes to scripts which hide the internal structure. (Indeed, the data may be stored internally in a different form than the way it is presented.) The database can be rebuilt automatically with added information, possibly in a different storage format, but the script interface will remain the same.

At the same time, I have no wish to reinvent a whizzy English relational database interface. (I don't even want to reinvent SQL!) Users will have to write snippets of perl in order to do anything useful; my job is just to make it so we don't have to rewrite all our scripts if we want to add a few new fields.

As an example of what we are trying to accomplish, a simple query could be phrased in English as:

Count (i.e. build a histogram of) all residues that see a neighboring tryptophan on the same piece of secondary structure.
[I still can't quite phrase this in an executable form; it depends on having the segment stuff in place. -- rgr, 11-Jul-96.] A refinement might be:
Do the same, but build three separate histograms, one each for all alpha, all beta, and mixed alpha-beta proteins.
(The definition of "neighboring" may be qualitative, as in "has a pairwise entry in the database"; if quantitative, as in "has an interaction energy of at least X", then the threshold can be adjusted, as long as it is at least as stringent as the minimum value used to decide whether to put a pairwise entry in the database.)


Conceptually, the database has four relations (database tables):
  1. A table of information which depends only on the locus (e.g. sequence length, structural classification, number of core segments);
  2. A singleton table, of information that depends on locus and a single residue index (e.g. amino acid, exposure); and
  3. A pairwise table, of information that depends on locus and two residue indices (e.g. interaction energy).
  4. A core segment table, keyed by locus and segment number.
In practice, the singleton table is split by locus, in order to make it easier to filter on proteins meeting certain criteria. The pairwise table is further split on subclasses of data, since pairwise information tends to be sparse, and combining disparate pairwise sets is likely to make the database files balloon considerably.

It would probably be useful to treat the information stored as ATOM records in the PDB files as an additional relation keyed on locus, residue, and atom type (e.g. "CD1"). We could then do interesting things with coordinate information as well.

Database fields and datatypes

Field names are in lowercase, and include the relation name (the same as the suffix of the files they live in) as a prefix. In the descriptions, the names are followed by the field type in parentheses, which are hyperlinked to the field type descriptions below. If no type field appears, the value is a string, about which I have nothing further to say.

[I think it may be necessary to expose a little perl here. In particular, the names of computed fields should probably start with "&", and noncomputed fields with "$", in order to simplify implementation. -- rgr, 12-Jun-96.]

The aa_index datatype

Amino acids are assigned an internal code based on the alphabetical position of their standard single-letter abbreviation, starting with 0 for Alanine. This numbering was chosen to be identical with the internal encoding used by the stat and needle programs. Even though neither of these programs rely on sharing a common internal encoding, it will probably help to avoid human error to use the same convention.

The aa_abbrev datatype

This is the standard 3-letter amino acid abbreviation, in upper case to conform to PDB usage.

The core total index

The core total index, or CTI, is the 1-based index of a residue in a core model, counting only those residues that appear in the core. In other words, if the last residue of a segment has CTI = n, then the first residue of the next segment has CTI = n+1. The CTI numbering scheme is used in
environment files and other places that need to concisely refer to specific residues in a core file.

The pdb_index datatype

The residues that appear in the PDB ATOM records are numbered sequentially from 0. See the
pdbres datatype for a discussion of why this is necessary.

The pdbres datatype

This is an alphanumeric sequence number, exactly as if taken from (1-based) columns 23 through 27 of the ATOM field of the PDB file. In general, this will not be an integer, since there are sequence numbers like "235A". The first four characters are digits, with blank padding on the left, and the fifth and last character is either a letter suffix or blank. (This format definition should make it easier to extract the correct residues from the PDB entry. Note also that alphanumeric sorting will reproduce the order the residues appear in the PDB.)

The window datatype

[Need some explanation of Loredana's window numbering convention. I know that window 1 points back to the alpha carbon, leaving 2, 3, and 4 as the windows of interest, but that's all I know. -- rgr, 28-Jun-96.]

Note that 0 is not a valid window, and so is used to indicate "none." It is possible to get 0 as a window for amino acids that are normally beta-branched if the side chain was not resolved crystallographically. In other words, no CG1 atom means that the $sing_gamma1_window will be zero, regardless of whether the residue is alanine or valine.

The segment type

[Note: Other code calls this the "segment designator," which may be preferable; the terminology should be standardized. -- rgr, 21-Aug-96.]

Segment types are strings that convey the nature of the segment. All helices are of type "H"; all other segments are strands of various sorts. The labels fall into three classes:

  1. Helices are uniformly denoted by "H".
  2. Open-sheet strands are denoted "En", where n is the sheet number. (Sheet numbers are assigned arbitrarily in order of the first residue in the sheet.)
  3. Cyclical strands (i.e. strands comprising a barrel structure) are denoted by "Cxxyyzz", where xx is the number of this strand, yy is the number of the previous strand in cyclical order, and zz is the number of the next strand in cyclical order. Numbers are assigned to strands in the order of their first residue, whether or not they appear in a barrel.
Note that sheet indices and the strand indices used to name barrel segments are numbered for the PDB entry as a whole, and not strand-by-strand. This should work fine for most code, which only cares whether strand residues are part of the same sheet or not. (In principle, barrels are trickier, since one cannot assume that a core model includes only one barrel; one must trace the barrel around to find if two strands are on the same barrel. In practice, 1tie seems to be the only model with more than one barrel, and it appears to be pathological in any case. -- rgr, 21-Aug-96.)

[And many score functions care only whether a segment is a helix or a strand. (What code uses this, anyway? Crippen? Jernigan? MRF?). The code that generates these labels is also under development; the assignment is not always unique, so our heuristics may change. -- rgr, 31-Jul-96.]

Protein table

This contains data for whole protein structures, lives in the all.prot file, and is keyed only on $prot_locus.

[Note: The protein table is not implemented yet; waiting on segment definition. -- rgr, 2-Jul-96.]

[May need special fields for PDB locus vs chain id. -- rgr, 19-Jul-96.]

Core segment table

These live in the locus.seg files, with $seg_locus and $seg_segment constituting the key. There are special-purpose scripts to extract subsets of the core segments from singleton and pairwise tables. (And probably other scripts to extract loops . . . )

[Note: The segment table is not implemented yet; waiting on segment definition. -- rgr, 19-Jul-96.]

[Right now I generate segment files from Sophia's .core files. If we want to drive core generation from csdb, rather than the reverse at present, we should build .seg files from .sing DSSP information, then build the core files from either (or both). Until I get this figured out, I won't be able to fill in the segment-related fields. -- rgr, 28-Jun-96.]

Singleton table

These are stored in locus.sing files, with sing_locus and sing_pdbres making up the key. [Right now the key is implicit, composed of the first two fields; do I need an explicit key so I can more easily use join? Should I have the extraction script manufacture such a key on output? Or maybe that should just be a computed field that a user can ask for if needed? -- rgr, 11-Jun-96.]

Loredana's tetrahedral data (based on her descriptions):

Segment data. If available, the $segment_data_available_p variable will be true, and the following additional fields will be defined. [not yet implemented. -- rgr, 28-Oct-96.]

Pairwise tables

These are kept in locus.pair files, with $pair_locus, $pair_res1_pdb_index, and $pair_res2_pdb_index making up the key. The script should be used to access values in these files. In addition, the following "pairwise" fields beginning "pair_res1_*" and "pair_res2_*" are defined (via implicit joins) as the values for the corresponding singleton records. All singleton fields are defined, using the convention that "sing" in the singleton field name is replaced by "pair_res1" for the first residue, and "pair_res2" for the second.

Pairwise energy terms.

Pairwise segment terms. Note that $pair_res*_cti is really a singleton value, rather than a segment value. For loop residues, all of these values will be null strings, which is the boolean value for 'false' in perl.


Setting up

Before doing anything else, it is necessary to do

     source ~thread/code/bin/csdb-setup
in order to establish environment variables used by the other scripts. It puts the ~thread/code/bin/ directory on your PATH if it is not already there, which also makes
the core generation tools available.

perl expressions

[useful references? perl expressions can be tough to figure out for the uninitiated . . . -- rgr, 9-Jul-96.]

Expression keyword arguments

These keyword arguments are accepted by both the and scripts. They are listed in the order in which they are first evaluated. Any of these arguments can be repeated any number of times; all expressions are evaluated at the appropriate time in the order specified. At least one -e parameter must be supplied.

perl scripts

[notes on writing your own scripts -- skip on first reading. -- rgr, 12-Jul-96.]

In a script, only the &evaluate_expression subroutine needs to be defined. [Should perhaps be called &process_record or something similar. -- rgr, 12-Jul-96.]

Extracting singleton data takes one or more perl expressions that are supplied as initial keyword arguments, and which can include conditionals, print statements, random computation, etc. [ keyword . . . ] [ locus . . . ]
So far, the only keywords defined are the
expression keywords, which are evaluated at the appropriate times while processing the singleton records for the specified loci; the principal one of interest is the -e keyword, which is evaluated for each singleton record.

If no loci are supplied on the command line, they are read from the standard input, one per line. It is also possible to supply filenames, though the extension is ignored; this is convenient for shell wildcards.

At present, the data files live in the ~thread/structure/data/test/ directory (or wherever the $CSDB_DATA_DIR environment variable happens to point). This is assumed to be the current directory for the examples below (but please don't write anything there!). [bug: this location is obsolete. -- rgr, 20-Aug-97.]

For example, -e 'print(join("\t", $sing_locus, $sing_aa_index, \
	$secondary_structure_codes{$sing_dssp_ss}), "\n");' *.sing
produces a tab-delimited table of locus, amino acid internal index, and numeric code for the smoothed DSSP secondary structure assignments, for all residues in all proteins, on the standard output. The expression at the start of the second line is interesting: "$secondary_structure_codes{$sing_dssp_ss}" maps $sing_dssp_ss (a string containing a single character) to a numeric value.

If we only wanted data for DSSP helices, we could use: \
	-e 'if ($sing_dssp_ss eq "H") { \
		print(join("\t", $sing_locus, $sing_aa_index, \
			   $secondary_structure_codes{$sing_dssp_ss}), \
		      "\n"); }' \
The only real difference between this and the last example (besides the improved indentation) is the presence of the conditional "if ($sing_dssp_ss eq "H") {" at the start of the expression -- and of course the matching } at the end. Note that perl uses "eq" for string comparison (as distinguished from "==" for numeric comparison).

Extracting pairwise data

To examine or extract pairwise records from the database, use the script. It's usage is similar to that of the script: [ keyword . . . ] [ locus . . . ]
So far, the only keywords defined are expression keywords. The locus arguments are as for

Pairwise extraction examples

Here's an example that just counts the number of ALA-ALA pairs within 8.5 Ångstroms that see each other in 1aaj. We need two expressions here, one to count and one to print the result. Note that we are using the internal aa_index values for the amino acids.

    gamow% \
	     -e 'if ($pair_visible_p && $pair_cb_distance <= 8.5 \
		     && $pair_res1_aa_index == 0 && $pair_res2_aa_index == 0) \
		     { $count++ }' \
	     -finally 'print "$count\n";' 1aaj.pair
Two peculiarities of perl are immediately apparent. The nicer quirk is that we did not need a -initially expression to initialize $count. Somewhat less convenient is that fact that although the perl if statement resembles the C if statement, even the simple $count++ expression that we wanted evaluated when the test is true must be enclosed in curly braces. Also, to get the count value printed on a line of its own, it was necessary to include an explicit newline -- the "\n" inside the double quotes in the -finally expression.

So the answer is 4, but why should we really believe that? Let's list the PDB residue numbers for each case (since we know that the number of them is manageable). The command and resulting output now looks like this:

    gamow% \
	     -e 'if ($pair_visible_p && $pair_cb_distance <= 8.5 \
		     && $pair_res1_aa_index == 0 && $pair_res2_aa_index == 0) \
		     { print "$pair_res1_pdbres $pair_res2_pdbres\n"; }' \
      12    14 
      17    20 
      20    77 
      59    66 
Looking in the PDB file for 1aaj, we see that the indicated residues are all alanines. (The 'extra' spaces appear in the output because they are part of the pdbres fields.)

One way to build a histogram in perl is to increment the elements of an array, instead of a single count. (perl takes care of extending the array to the right size, and treats unitialized elements of the array as zeros or null strings, depending on context, just as for scalars.) The following example counts same-residue pairs (regardless of distance or any measure of interaction) by amino acid. All such pairs are accumulated into the @count array (which is distinct from the scalar variable $count, though one must use $count[$i] to access the $i'th element of @count).

    gamow% \
	     -e 'if ($pair_res1_aa_index == $pair_res2_aa_index) \
		     { $count[$pair_res1_aa_index]++; }' \
	     -finally 'print join(" ", @count), "\n";' 1aaj.pair
    26  1 9 1 8 6 3 7 5 10 1 11   3 13 44  6
Notice that the unitialized elements are still uninitialized, and show up as null strings rather than zero. To correct this, we can explicitly initialize them using the -initially keyword:

    gamow% \
	     -initially 'foreach $i (0..19) { $count[$i] = 0; }' \
	     -e 'if ($pair_res1_aa_index == $pair_res2_aa_index) \
		     { $count[$pair_res1_aa_index]++; }' \
	     -finally 'print join(" ", @count), "\n";' 1aaj.pair
    26 0 1 9 1 8 6 3 7 5 10 1 11 0 0 3 13 44 0 6
Alternatively, we could leave the @count array uninitialized, and then print a more readable table of only those values that are defined:

    gamow% \
	     -e 'if ($pair_res1_aa_index == $pair_res2_aa_index) \
		     { $count[$pair_res1_aa_index]++; }' \
	     -finally 'foreach $i (0..19) { \
			  if (defined($count[$i])) { \
			       print "$aa_3_letter_codes[$i]  $count[$i]\n"; \
			  } \
		       }' \
    ALA  26
    ASP  1
    GLU  9
    PHE  1
    GLY  8
    HIS  6
    ILE  3
    LYS  7
    LEU  5
    MET  10
    ASN  1
    PRO  11
    SER  3
    THR  13
    VAL  44
    TYR  6
(But since unitialized values are treated as zero, "if ($count[$i]) { . . . }" would have worked in the conditional as well.) Here's what it looks like for the whole database:

    ALA  1161
    CYS  365
    ASP  315
    GLU  235
    PHE  601
    GLY  728
    HIS  112
    ILE  991
    LYS  215
    LEU  2208
    MET  157
    ASN  308
    PRO  286
    GLN  177
    ARG  268
    SER  567
    THR  494
    VAL  1318
    TRP  92
    TYR  384

[An example where I turn some random counts into probabilities by normalizing would be nice to have, too. Probably easier to do this with a singleton example. On the other hand, I'd like to see what these numbers mean; why so many LEU-LEU, and so few CYS-CYS? -- rgr, 10-Jul-96.]

Extracting segment data

Like the script, takes one or more expression keywords supplied as the initial arguments, followed optionally by the loci of interest: [ keyword . . . ] [ locus . . . ]
The segment table fields are defined as described above for each successive evaluation of the -e keyword.

For example, the following invocation prints a histogram of segment lengths: -e '$hist[$seg_length]++;' \
	-finally 'for ($len = 0; $len <= $#hist; $len++) { \
		      if ($hist[$len]) { \
			  print "$len\t$hist[$len]\n"; } }' \
[Note that $#hist returns the index of the last element of the $hist[] array.]

[should also present the ~thread/code/csdb/scripts/ script. -- rgr, 23-Oct-96.]

Pairwise script examples

When the amount of perl code passed to grows past a certain limit, the script interface becomes cumbersome. It turned out to be somewhat self-defeating to break the last example over 9 lines in order to make it more readable; the added backslashes didn't help. Finally, keeping shell versus perl quoting conventions straight in order to make sure that the dollar signs are interpreted in the right place can be difficult and hard-to-read as well.

So let us recast our pairwise example as a pure perl script.


    # Sample pairwise perl script.

    push(@INC, $ENV{"CSDB_SCRIPTS"});
    require "";

    sub evaluate_expression {
	if ($pair_visible_p && $pair_cb_distance <= 8.5
	    && $pair_res1_aa_index == $pair_res2_aa_index) {

    sub finally {
	foreach $i (0..19) { 
	    if (defined($count[$i])) { 
		print "$aa_3_letter_codes[$i]  $count[$i]\n"; 

If we put this in a file called and change its permissions so that it is executable, the initial #! line will allow us to execute it directly. Blank lines have been inserted for readability, and a certain amount of "boilerplate" has been added at the top to load the supporting CSDB pairwise code. (If you have forgotten to source the csdb-setup script before invoking one of these scripts, perl will complain that it "Can't locate in @INC . . .".) The -e and -finally expressions have been turned into evaluate_expression and finally subroutines, respectively. And the backslashes in front of the newlines have been eliminated altogether.

Note that only evaluate_expression and finally subroutines are defined; the others (initially, before_protein, and after_protein) are not called if not defined (though &do_pairwise_records will refuse to proceed if evaluate_expression is not defined).

The final line, a call to the &do_pairwise_records subroutine, results in the same treatment of locus arguments as for Executing the script is simple, and happily produces the same result for 1aaj:

    gamow% 1aaj
    ALA  4
    HIS  2
    LYS  1
    LEU  3
    MET  2
    PRO  2
    SER  2
    THR  5
    VAL  9
    TYR  2
We can easily extend this example to print per-protein subtotals of the same-AA pairwise counts by defining the before_protein and after_protein subroutines. Using the tabular format of an earlier example, we need a loop in before_protein to reset a distinct per-protein array to zero, and an after_protein that prints the result:

    sub before_protein {
	foreach $i (0..19) { $protein_count[$i] = 0; }

    sub after_protein {
	print $pair_locus, "\t", join(" ", @protein_count), "\n";
Note that we've added $pair_locus followed by a tab character to the print statement so that each line gets a label.

Finally, in addition to this new code, we must modify evaluate_expression to update @protein_count:

    sub evaluate_expression {
	if ($pair_res1_aa_index == $pair_res2_aa_index) {
If we let the &finally subroutine stand as is, this is a truncated version of what we get by running this example on the full pairwise database:

    gamow% *.pair
    1aaj	4 0 0 0 1 0 2 0 1 3 2 0 2 0 0 2 5 9 0 2
    1abmA	3 0 0 1 2 4 8 5 1 14 0 4 2 1 0 0 1 1 3 3
    1add	2 0 2 6 5 1 4 8 3 12 1 2 4 1 1 2 2 15 0 1
    1aep	12 0 0 2 0 0 0 2 1 19 0 0 0 3 0 2 1 1 0 0
    1ahc	9 0 0 2 2 1 0 10 1 23 0 4 1 0 1 3 2 4 0 3
    1arb	6 3 1 0 0 9 2 8 0 8 0 3 1 1 1 11 8 2 1 1
    1ast	1 2 2 0 1 2 3 9 0 2 2 2 0 0 1 2 1 2 0 5
    1atr	16 0 3 2 10 8 0 8 3 14 0 3 2 1 2 3 4 23 0 0
    1avhA	2 0 2 3 5 2 0 3 2 31 1 0 0 2 4 3 3 3 0 3
    1babB	2 0 0 1 3 4 0 0 1 18 0 0 1 1 0 0 0 10 0 0
    1bbpA	0 2 2 0 0 1 1 0 0 1 0 2 1 0 0 3 1 13 1 2
    1bet	1 12 0 0 1 0 0 1 1 1 0 0 0 0 1 5 8 4 0 0
    1bfg	2 0 0 2 0 4 0 1 3 10 0 0 0 1 2 0 0 0 0 1
    1bllE	27 0 2 4 6 12 1 10 5 21 7 2 9 2 2 3 5 22 0 1
    1bmdA	28 0 2 2 4 5 0 4 0 23 1 4 2 0 3 2 3 14 0 0
    . . .
    7pti	2 2 0 0 1 3 0 0 0 0 0 0 1 0 0 0 0 0 0 0
    8acn	15 3 5 4 4 13 2 24 6 40 4 7 5 2 8 11 10 17 0 4
    8atcA	4 0 1 0 3 1 0 4 1 43 2 1 1 2 3 5 5 8 0 0
    8i1b	0 0 0 1 2 0 0 0 1 19 1 4 0 2 1 1 1 1 0 1
    8rxnA	0 6 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 2
    8tlnE	11 0 3 1 3 6 2 8 0 5 0 1 1 0 0 4 6 11 0 9
    9ldtA	3 0 1 3 0 4 0 17 2 20 1 1 1 1 0 6 2 26 1 1
    ALA  1207
    CYS  366
    ASP  320
    GLU  241
    PHE  607
    GLY  737
    HIS  112
    ILE  1007
    LYS  216
    LEU  2236
    MET  158
    ASN  313
    PRO  288
    GLN  177
    ARG  272
    SER  572
    THR  501
    VAL  1348
    TRP  104
    TYR  385


Database creation

In the ~thread/code/csdb/scripts/ directory, there are a number of scripts that start with "make-". These are all "internal" in the sense that they are not expected to work anywhere but at BMERC; they are run when appropriate by the makefile in the data directory. Since these scripts are never copied into the ~thread/code/bin/ directory, ~thread/code/csdb/scripts/ must be on $PATH when running the makefile. There is a ~thread/code/csdb/scripts/csdb-test-setup file that can be "sourced" to do this.

Adding new database fields

[In progress. -- rgr, 8-Aug-97.]

In order to add a database field, the following steps must be taken. You should also search for "CHANGE-SINGLETON-FIELDS" or "CHANGE-PAIRWISE-FIELDS" or "CHANGE-SEGMENT-FIELDS" in all source files; either the code or the documentation may not be completely up-to-date.

  1. In, insert the new field name into the @csdb_singleton_fields or @csdb_segment_fields or @csdb_pairwise_fields) array, as appropriate, in the position it will take in the tuple.
  2. Update &destructure_singleton_record and &get_pairwise_singleton_records to reflect the new variable names.
  3. Recode the appropriate table construction function (e.g. to stuff data (or at least a placeholder) into the tuple at the right place.
  4. Rebuild the database.

Known bugs/deficiencies

Might it become necessary to support multiple PDB versions of a single locus? In that case we might want to always qualify the locus with the PDB version number . . . and might want our own "captive" copy of the PDB entry in any case. -- rgr, 11-Jun-96.

Bob Rogers <>
Last modified: Wed Aug 20 16:44:38 EDT 1997