Systematic fold recognition analysis of the sequences encoded by

Document technical information

Format pdf
Size 723.9 kB
First found May 22, 2018

Document content analysis

Category Also themed
Language
English
Type
not defined
Concepts
no text concepts found

Persons

Carl Perkins
Carl Perkins

wikipedia, lookup

Jim Thorpe
Jim Thorpe

wikipedia, lookup

Organizations

Places

Transcript

Protein Engineering vol.11 no.12 pp.1129–1135, 1998
Systematic fold recognition analysis of the sequences encoded by
the genome of Mycoplasma pneumoniae
Rita Grandori1
Center for Applied Molecular Engineering, Institute for Chemistry and
Biochemistry, University of Salzburg, A-5020 Salzburg, Austria
1Present
address: Institute of Chemistry, Johannes Kepler University,
Altenberger Strasse 69, A-4040 Linz, Austria.
E-mail: [email protected]
A robust tool for fold recognition was applied to the
systematic analysis of the sequences below 200 residues
encoded by the genome of Mycoplasma pneumoniae. The
goal was to determine the additional information gain
achievable in genome analysis by fold recognition, beyond
the intrinsic limits of homology studies. A list of 124
sequences encoding for soluble proteins or domains not
homologous to each other, or to proteins with known threedimensional structure, was analyzed, resulting in significant
Z scores for the energy of the structural models in 12 of
these cases. This result indicates that systematic application
of fold recognition techniques to the analysis of structurally
unassigned soluble proteins can lead to high-confidence
structural predictions with an efficiency of about 10%, a
relevant contribution besides the complementary approach
of homology analysis. Four of the predictions presented
include mapping of the putative active site of the target
sequence and lead to the detection of probable catalytic
and binding residues. The data are discussed with reference
to the functional implications of the structural models and
to the results reported for the homologous genome of
Mycoplasma genitalium.
Keywords: active site analysis/genome analysis/knowledgebased potentials/protein function prediction/protein structure
prediction
Introduction
The development of highly efficient sequencing techniques
has made possible the recent accomplishment of exhaustive
sequencing projects, including already the entire genome of
several organisms (Fleischmann et al., 1995; Fraser et al.,
1995, 1997, 1998; Bult et al., 1996; Goffeau et al., 1996;
Himmelreich et al., 1996; Kaneko et al., 1996; Blattner et al.,
1997; Klenk et al., 1997; Kunst et al., 1997; Smith et al.,
1997; Tomb et al., 1997; Cole et al., 1998; Deckert et al.,
1998). Genome projects provide raw data with a tremendous
information potential that could bring important contributions
to our understanding of the metabolism, regulation and evolution of biological systems. The interpretation of DNA sequencing results is limited, at present, by our tools for the prediction
of structural and functional features of the encoded protein
products. This task has become relatively easy in the presence
of homology to already characterized proteins, but it represents
one of the central challenges of theoretical biology whenever
the prediction cannot rely on sequence similarity.
At the present level of expansion of databases, homology
© Oxford University Press
analysis leads, on average, to a functional assignment or at
least coarse-grained classification, for ~60% of the open
reading frames (ORFs) of newly sequenced genomes (see
references above), and also to a structural assignment for
10–15% of the cases (Rost, 1998). Complementary approaches
to protein structure prediction, like fold recognition methods,
are expected to push forward these limits to the information
we can extract from the results of genome projects. Indeed,
fold recognition methods try to detect structural similarities
that could not be detected on the sequence level, searching for
known protein structures that might represent a good template
for a given sequence of unknown native conformation.
A knowledge-based fold recognition method (Sippl, 1993a;
Flöckner et al., 1995; Sippl and Flöckner, 1996) was applied
in this work to the systematic analysis of the ORFs of the
genome of Mycoplasma pneumoniae (Himmelreich et al.,
1996). The method makes use of mean force potential functions
for residue-pair interactions and solvent accessibility to evaluate the fitness of a given sequence for alternative conformations.
The input sequence is threaded on to a database of different
known protein structures. The output is a list of sequence–
structure alignments ranked on the basis of their relative
energies, where low-energy conformations represent candidate
native-like models for that sequence. This method has been
demonstrated to be a robust tool for fold recognition and to
perform with high alignment accuracy (Flöckner et al., 1995,
1997; Levitt, 1997).
The choice of M.pneumoniae genome was motivated by its
small size and by the availability of an exhaustive and accurate
sequence analysis (Himmelreich et al., 1996). Hence this
genome represents a good system to test how much new
structural and functional information we can gain by fold
recognition methods, beyond the limits of the most sophisticated analysis of sequence homology.
The M.pneumonaie genome has 677 predicted ORFs. Of
these, 376 have an assigned putative function on the basis of
sequence homology, 198 show sequence homology only to
proteins with unknown function and 103 show no significant
sequence homology to any sequence in the database (Himmelreich et al., 1996). This genome contains 155 sequences that
show significant sequence similarity (.25% identity for more
than 80 residues aligned; Sander and Schneider, 1991) to
proteins with known three-dimensional (3D) structure
(F.Domingues, personal communication). In this work an
attempt was made to identify new high-confidence templates
for all the structurally unassigned sequences below 200 amino
acids and to interpret the structural models with reference to
the putative function of the protein. During this project,
Eisenberg and co-workers performed a similar analysis on the
homologous genome of M.genitalium (Fischer and Eisenberg,
1997).The present results are discussed in comparison with
the reported data for M.genitalium.
Materials and methods
The amino acid sequences encoded by the genome of
M.pneumoniae were downloaded from the internet site
1129
R.Grandori
http://www.zmbh.uni-heidelberg.de/M pneumoniae and
sequences from M.genitalium were downloaded from the
internet site http://www.tigr.org/tdb/mdb/mgdb/mgdb.html.
Protein sequences are named in this work with the original
identifier of the corresponding ORFs. The PDB identifiers are
used for structures, followed when necessary by the letter
denoting the subunit chain and the number denoting one of
multiple structural models.
Only sequences of M.pneumoniae below 200 amino acids
were considered, in order to limit the analysis to the range of
highest reliability of fold recognition methods, owing to
the technical difficulties of detecting multi-domain structures
(Levitt, 1997). Only one sequence longer than 200 amino acids
was examined (A05_orf290). This sequence is 290 residues
long but, since it belongs to a relatively large homology group
in the category of functionally unassigned sequences that
would have not been represented among the target sequences
of the present work and since three putative domains shorter
than 200 residues could be recognized in it, the sequences
corresponding to such domains were included in the list of the
target sequences in the attempt to identify new functions.
Sequences with predicted trans-membrane regions according
to the authors of the sequence (Himmelreich et al., 1996) and
to the program ProSEQ (M.Jaritz, unpublished work) were
considered only when at least one putative soluble domain
could be identified (a stretch longer than 50 residues without
trans-membrane regions). In those cases, the fragment corresponding to the putative soluble domain was used independently
for fold recognition. Every sequence was searched for the
presence of putative structural domains also by homology
search with FastA (Pearson and Lipman, 1988) against the
Swissprot database. Whenever a homology domain could be
detected within a sequence, the corresponding fragment was
analyzed independently, as well as the complementary one.
Sequence homology within the M.pneumoniae genome was
also taken into account; a group of homologous sequences is
considered as one fold recognition target. Sequences homologous (.25% identity, .80 residues aligned) to proteins with
known 3D structure were also filtered out from the list of
target proteins.
The program ProFIT (Sippl, 1993a; Flöckner et al., 1995,
1997; Sippl and Flöckner, 1996) was used for fold recognition.
Each target sequence was threaded on to a database containing
1700 structures. This database is a representative subset of the
Brookhaven database, still containing redundancy for each
structural class. ProFIT scores (Sippl, 1993b; Sippl and Jaritz,
1994) were used for the ranking of the sequence–structure
alignments and were the main criterion for the evaluation of
the outputs. However, ProFIT outputs were also inspected by
the application of several other criteria of judgement: the
quality of the alignments (number and position of gaps), the
structural similarity among the highest scoring templates, the
agreement between the models and the secondary structure
prediction (when multiple homologous sequences were available), the energy profiles of the models as calculated by the
program ProSA (Sippl, 1993b) and the consistency of ProFIT
results for homologous sequences, when available. Only those
models that generated consistent evidence by these different
kinds of test were considered for further analysis. The program
ProSUP (Feng and Sippl, 1996) was used for structure comparison.
The goodness of the selected models was finally evaluated
by the calculation of the Z scores for their energy values by
1130
the program ProSA (Sippl, 1993b). The raw models generated
by ProFIT were refined by the program Modeler (S̆ali and
Blundell, 1993) to add side-chains, close gaps and model
loops. The energy of the models was then calculated by means
of the combined residue-pair and surface potential functions
(Sippl, 1993a) and compared with the distribution of the energy
values obtained by threading the target sequence on to a
polyprotein of known structures. The Z scores were calculated
according to the equation Z 5 E – Ē/σ, where E is the energy
of the model and Ē and σ are the mean and the standard
deviation of the random distribution, respectively. A Z score
value of –5 was considered as the threshold for high-confidence models.
Results
The list of ORFs from M.pneumoniae was filtered, as described
in Materials and methods, by size, sequence hydrophobicity
and homology relations, resulting in a total of 124 fold
recognition targets analyzed in this work. Of these 124
sequences, 57 have a putative function by homology, 40 have
uncharacterized homologous sequences and 27 are so far
unique sequences.
High confidence fold recognition results were obtained for
12 sequences of M.pneumoniae, as summarized in Table I. Six
of these proteins belong to the functionally assigned category,
five to the functionally unassigned category and one to the
group of unique sequences. The sequence identity between
each target and its predicted template is below 20%. The Z
scores for the energies of the models, calculated with reference
to the energy distribution of alternative folds for each target
sequence, range between –5 and –8.6, indicating significant
low energies of the proposed models (Sippl, 1993b).
The alignments obtained by fold recognition were inspected
for matching of functional residues of the template protein
and for conservation of sequence signatures that could reflect
similarities in the function of the two proteins. In four cases
of sequences belonging to the functionally assigned category,
these results bring some new insights about the putative
active site of the molecule, indicating probable catalytic and
binding residues.
Despite the extensive sequence similarity between the genomes of M.pneumoniae and M.genitalium, there is a very small
overlap between the predictions presented here and those
reported by Eisenberg and co-workers (Fischer and Eisenberg,
1997), which can be partially explained by the size limit
applied to the ORFs examined in this work. Indeed, almost
all the predictions reported by Eisenberg and co-workers refer
to proteins longer that 200 residues. Only two sequences are
given a structural assignment in both studies: B01_orf178 of
M.pneumoniae (MG030 of M.genitalium) and P01_orf193 of
M.pneumoniae (MG335 of M.genitalium). In the first case the
prediction is the same, whereas in the second case the predicted
folds belong to the same SCOP superfamily (Murzin et al.,
1995) but present some relevant differences.
The prediction results for those cases with functional
implications, which include B01_orf178 and P01_orf193, are
discussed below in greater detail. All the alignments and
models resulting from this work are available from the author.
B01_orf178 (uracil phosphoribosyltransferase)
A putative transmembrane region could be detected in this
protein by the authors of the sequence (Himmelreich et al.,
1996) but it was not confirmed by the program ProSEQ for
Fold recognition for M. pneumoniae sequences
Table I. Summary of the structural predictions
ORF
Domain
(function)
A19_orf200
B01_orf178
D02_orf152
F10_orf100B
F11_orf133
P01_orf193
(no function)
A05_orf290
B01_orf108
C09_orf159
F10_orf153
P01_orf197
(unique)
GT9_orf113
10–76
Homologa
Model
Classb
% IDEc
Z scores
Fctd
MG264
MG030
MG396
MG232
MG276
MG335
1ukd.-.1hgx.A.1bmt.A.1htt.A.1oro.A.1gua.A.-
α/β
α/β
α/β
α/β
α/β
α/β
17
14
14
15
20
18
–7.1
–6.4
–6.7
–6.9
–6.6
–6.7
1
1
MG125
MG029
MG207
MG230
MG333
1ctf.-.1put.-.1
4fxn.-.4fxn.-.1qrd.A.-
α1β
α1β
α/β
α/β
α/β
17
13
11
17
18
–6.3
–6.4
–5.0
–6.9
–8.6
1arn.-.-
β
14
–5.1
1
1
aClosest homologous sequence.
bStructural class of the template
cPercentage
dFunctional
according to the SCOP database (Murzin et al., 1995).
of identical residues between the target sequence and the template, calculated on the alignments obtained by ProFIT (Sippl, 1993a).
implications discussed in this work.
sequence analysis (M.Jaritz, unpublished work) using a multiple sequence alignment. Therefore, the whole sequence was
used as input for fold recognition. The result is 1hgx.A.- as a
template for this protein, in agreement with Eisenberg’s result;
1hgx.A.- corresponds to one subunit of the homodimeric
hypoxanthine–guanine–xanthine
phosphoribosyltransferase
(HGXPRTase) from Tritrichomonas foetus (Somoza et al.,
1996).
Phosphoribosyltransferases (PRTases) catalyze the reversible
transfer of a 5-phosphoribosyl group between pyrophosphate
and a nitrogenous base. B01_orf178 and HGXPRTase have
similar activity, although different specificity belonging to the
pyrimidine and purine salvage pathway, respectively. Recently
reported structures for enzymes of nucleotide synthesis show
that several PRTases with very little sequence homology and
different specificity share a common fold (Smith, 1995).
The common core of the PRTase fold (residues 10–131
of HGXPRTase) consists of a five-stranded parallel β-sheet
surrounded by three or four α-helices. Additional structural
elements at the N- and C-termini of the molecule build up the
so-called ‘hood’ subdomain, that carries the determinants for
the base specificity of these enzymes. PRTases show only one
region of sequence conservation, which follows the consensus
[LIVMFYWCTA]–[LIVM]–[LIVMA]–[LIVMFC]–[DE]–
D–[LIVMS]–[LIVM]–[STAVD]–[STAR]–[GAC]–X–[STAR]
(Bairoch, 1992) and which encompasses the 59-phosphatebinding loop of the active site.
As shown in Figure 1A, sequence B01_orf178 is aligned to
both the core and the ‘hood’ of the template structure. A
ribbon drawing of the resulting structural model for B01_orf178
is shown in Figure 2A. The target sequence presents only one
mismatch to the consensus sequence of PRTases (Figure 1A),
suggesting that this region of B01_orf178 (residues 96–108)
is also involved in 59-phosphate binding. In particular, on the
basis of the protein-GMP contacts in 1hgx (Somoza et al.,
1996), it can be predicted that the side chains of Thr 105 and
Thr 108 in B01_orf178 make hydrogen bonds with the
59-phosphate group of the substrate, like the aligned threonine
residues of HGXPRTase and that Asp 100 is involved in the
binding of the ribosyl moiety, by analogy to Glu 87 of
HGXPRTase. The second acidic residue (Asp 88 in
HGXPRTase), usually conserved in PRTases, is not present in
this loop of B01_orf178. Furthermore, the residue corresponding to Thr 41 of HGXPRTase is always a lysine or an
arginine in the other PRTases and it is probably involved in
pyrophosphate binding (Somoza et al., 1996). The lack of this
basic residue in HGXPRTase is thought to account for the
high Km for pyrophosphate of this enzyme (Somoza et al.,
1996). Similarly to all the other known PRTases, B01_orf178
has an arginine residue at this position. Other active site
contacts suggested by the alignment on the basis of the GMP
complex of HGXPRTase are hydrophobic interactions of Met
102 and Tyr 162 with the base. These results suggest that
uracil-PRTases share the general fold and configuration of the
active site observed so far in other PRTases. By analogy to
other known PRTases (Eads et al., 1994; Scapin et al., 1995;
Somoza et al., 1996), it is expected that B01_orf178 will also
form dimers in solution. Nevertheless, the variability of the
position and nature of the residues making up the dimer
interface in PRTases (Somoza et al., 1996) does not allow one
to derive further support for this hypothesis from sequence
inspection. The same limitation applies to the interactions
between the ‘hood’ and the base, which involve main-chain
atoms.
F11_orf133 (adenine phosphoribosyltransferase)
F11_orf133 and B01_orf178 have similar putative functions,
despite their low sequence similarity (16% identity). The
prediction result for F11_orf133 (Figure 2D) is the orotate
phosphoribosyltransferase (OPRTase) from Escherichia coli
(1oro; Henriksen et al., 1996), another protein of the PRTases
family, structurally related to HGXPRTase. In the results
reported in Figure 1B, the sequence of F11_orf133 is aligned
to the core region of OPRTase, excluding the structural
elements of the ‘hood’ subdomain (residues 1–41 and 184–
213 of OPRTase). F11_orf133 perfectly matches the PTRases
consensus signature (residues 76–88 of F11_orf133, Figure
1B). On the basis of the active-site interactions reported for
OPRTases (Scapin et al., 1995; Henriksen et al., 1996),
including those with complexed 5-phosphoribosyl-1-pyrophosphate (Scapin et al., 1995), the alignment reported in
Figure 1B supports prediction of the 59-phosphate binding site
1131
R.Grandori
Fig. 1. Sequence–structure alignments obtained by the program ProFIT. The upper sequence corresponds to the target protein, the lower sequence corresponds
to the template. The secondary structure of the template, as calculated by the program ProFIT, is reported below its sequence (e, β-strand; h, α-helix). The
sequence signatures referred to in the text are highlighted in gray. Vertical bars indicate residue identities, pairs of dots indicate residue similarities (scores
higher than 0 according to the Blosum62 substitution matrix; Henikoff and Henikoff, 1992).
of F11_orf133 (residues 80–88), and also detection of the
putative pyrophosphate-binding region (residues 17–20 and
39–43). In particular, all the basic residues of OPRTases known
to be required for catalysis (Grubmeyer et al., 1993) and/or to
make contacts with the pyrophosphate (Scapin et al., 1995)
are conserved in F11_orf133 (Arg19, Arg39, Lys40 and Lys43).
The two aspartates involved in ribose binding are also conserved (Asp80 and Asp81 of F11_orf133), whereas the two
threonines probably making hydrogen bonds to the 59-phosphate (Thr85 and Thr88 of F11_orf133) are shifted by one
residue from the active Thr128 and Thr131 of OPRTase. On
the basis of these findings, it seems likely that F11_orf133
shares the fold and the active residues typical of the core of
PRTases. It would be of interest to investigate further the base
specificity of F11_orf133 and its structural determinants with
the lack of the ‘hood’ subdomain.
P01_orf193 (hypothetical GTP-binding protein)
In agreement with the hypothetical GTP-binding activity of
this protein (Himmelreich et al., 1996), the results presented
here indicate that this is a ras-related protein, with 1gua.A.(rap1A; Nassar et al., 1995) being the highest scoring template
(Figure 2E). The prediction reported by Eisenberg and coworkers is, instead, 1gky (Stehle and Schulz, 1992), the yeast
1132
guanylate kinase (GK). Rap1A and GK are both α/β proteins
but present different architectures. Whereas rap1A consists of
a single domain folded around a central β-sheet, GK folds
into two domains each centered on a distinct β-sheet. The
minor domain of GK, containing the GMP-binding site, has
no counterpart in the so-called G fold of the ras-like proteins
(Stehle and Schulz, 1992). The two proteins also catalyze
different reactions, where GTP and ATP are, respectively, the
hydrolyzable substrates. The major domain of GK shows some
structural similarities to the G fold, particularly in the
N-terminal βα motif that includes the P loop typical of both
GTP- and ATP-binding proteins (Saraste et al., 1990). The Nterminal region of P01_orf193 (residues 27–34) contains the
generalized consensus sequence for the P loop G–X(4)–G–K–
[TS] (Saraste et al., 1990), which is aligned to the corresponding region of rap1A (Figure 1C). Nonetheless, there is also
good sequence conservation in other regions involved in GTP
binding, that are typical of ras-related proteins (Valencia et al.,
1991) and not of guanylate kinases. These are the NKXD
and the SAK/L boxes (residues 136–139 and 168–170 of
P01_orf193, respectively), which are involved in the guanine
base binding and a DXXG box (residues 70–73 of P01_orf193)
shifted by three residues from the corresponding box in rap1A.
Fold recognition for M. pneumoniae sequences
Fig. 2. Ribbon representation of the structural models presented in this work. The raw models generated by ProFIT were refined by the program Modeler (S̆ali and
Blundell, 1993). The images were produced using the program RasMol (Sayle and Milner-White, 1995) and rendered using the program MolScript (Kraulis, 1991).
Terminal residues of the input sequence not aligned to the template were not included in the models. A, B01_orf178/1hgx.A.-; B, D02_orf152/1bmt.A.-; C,
F10_orf100B/1htt.A.-; D, F11_orf133/1oro.A.-; E, P01_orf193(res. 18–188)/1gua.A.-; F, A05_orf290(res. 10–76)/1ctf.-.-; G, A19_orf200(res. 1–187)/1ukd.-.-; H,
B01_orf108/1put.-.1; I, C09_orf159(res. 1–136)/4fxn.-.-; J, F10_orf153(res. 20–153)/4fxn.-.-; K, P01_orf197/1qrd.A.-; L, GT9_orf113/1arn.-.-.
1133
R.Grandori
The residue following the DXXG box (Gln61 in ras p21 and
T61 in rap1A) has been implicated in the catalytic mechanism
as the activator (via hydrogen bonding) of the water molecule
that performs the nucleophilic attack on the γ-phosphate of
GTP (Pai et al., 1990; Nassar et al., 1995). The presence of a
threonine residue at this position in rap1A accounts for its
reduced intrinsic GTPase activity relative to ras p21 (Frech
et al., 1990). It is interesting that the corresponding residue in
P01_orf193 would be a tyrosine (Tyr 74). In conclusion,
inspection of the sequence alignment strengthens the prediction
of 1gua as a good template for this sequence and allows at
the same time assignment of the corresponding functional
residues to P01_orf193. Therefore, P01_orf193 is with high
probability a regulatory GTP-binding protein. Nevertheless,
its sequence shows poor conservation of the effector region
of rap1A and ras p21 (residues 32–40 of rap1A; Nassar et al.,
1995), and also no match to the GTP-binding elongation
factors signature (Bairoch, 1992), suggesting that it plays a
distinct physiological role.
A19_orf200 (nucleotide-binding protein)
The predicted template for this sequence (Figure 2G) is
1ukd.-.-, the UMP/CMP kinase (UK) from slime mold
(Scheffzek et al., 1996). A glycine-rich motif typical of the
P loop of nucleotide-binding proteins could be already detected
in its N-terminal region (see Swissprot:y264_mycge). In the
alignment reported in Figure 1D the glycine-rich sequence of
A19_orf200 (residues 7–16) matches the corresponding region
of UK at the end of the first β strand. The high sequence
conservation observable in this region, including the lysine
and threonine side-chains (residues 13 and 15 of A19_orf200,
respectively) that make hydrogen bonds with the phosphoryl
groups of the competitive inhibitor in 1ukd, suggests that
A19_orf200 binds ATP in a similar fashion to UK. The lysine
of the P loop corresponding to Lys13 of A19_orf200 has been
shown to be essential for the catalytic activity of adenylate
kinases (Reinstein et al., 1990; Tian et al., 1990; Byeon et al.,
1995), a family of enzymes highly related to UK (Scheffzek
et al., 1996). Most of the hydrophobic residues of the
UMP-binding site of UK (residues 31, 35, 62, 63, 68, 89
of A19_orf200) appear to be conserved. Nevertheless,
A19_orf200 shows poor conservation of the sequence signature
of adenylate and UMP/CMP kinases (residues 85–96 of UK;
Bairoch, 1992), and also low sequence similarity to the region
corresponding to the ‘lid’ of UK (residues 130–140 of UK),
which carries several catalytically important residues. Therefore, A19_orf200 is with high probability a kinase, but its
functional equivalence to UK remains ambiguous.
Discussion
This work represents one of the first attempts at the systematic
application of fold recognition techniques to genome databases,
although the analysis is restricted by the size limit of 200
amino acids. The implementation of automated procedures for
fragmentation of the input sequence will allow one to extend
systematic fold recognition analysis to large protein sequences.
The genome of M.pneumoniae contains 144 non-homologous
sequences below 200 amino acids encoding for soluble proteins.
Of these, only 20 (14%) show significant sequence similarity
to proteins with known 3D structure and can, therefore, be
assigned a structure by normal sequence analysis. Highconfidence fold recognition results were obtained for another
12 sequences, representing an additional 8% (10% of the target
1134
sequences) to the intrinsic limits of structural prediction by
sequence homology. Although the efficiency of structural
prediction might decrease for large protein sequences, these
results show that fold recognition techniques can make a
significant contribution to the developing new discipline of
functional genomics (Hieter and Boguski, 1997).
The results obtained in this work also show that fold
recognition can provide, in several cases, useful information
about the function and the active site of the target protein.
Both processes of evolutionary divergence and convergence
can generate structurally related proteins containing similar
active residues in the lack of overall sequence similarity.
Detection of such cases is of great predictive importance
but cannot rely on sequence–sequence, rather on sequence–
structure, alignment methods. Inspection of residue conservation on the alignments generated by fold recognition led, in
this study, to the identification of putative active residues of
four proteins, providing heuristic models for future functional studies.
The results presented here were compared with those
reported for the homologous genome of M.genitalium (Fischer
and Eisenberg, 1997). With the exception of one, so far unique,
sequence, the proteins predicted in this work have a close
homolog in M.genitalium, which generates ProFIT results
consistent with those obtained with the corresponding sequence
from M.pneumoniae. Nevertheless, only two of these sequences
appear among the predictions reported for M.genitalium. The
data obtained in this work are in agreement with one of these
predictions (B01_orf178) but strongly suggest an alternative
model for the second one (F11_orf133). The comparison of
the two sets of results also indicates that two sequences
shorter than 200 residues (MG287/F11_orf84 and MG353/
G12_orf109) and reported among the predictions for M.genitalium are not included in the results presented here. In the case
of MG287/F11_orf84, the reason is that these sequences show
significant sequence similarity (~25% identity) to the acyl
carrier protein (1acp) and therefore were not targets of the
present work. The second sequence, instead, did not generate
high-confidence fold recognition results with the method
employed in this work. The reported prediction for MG353 is
1huea, the histone-like protein HU. It is interesting that HU
is not a compact globular protein; rather, it contains two
extended, partially disordered β ribbons (Tanaka et al., 1984).
It is therefore expected that the program ProFIT, based on
distance analysis of globular proteins, will fail to detect cases
of this kind.
A critical problem in fold recognition is the method for the
evaluation of the output. Extensive human inspection of the
results, as performed in this work, is an approach that allows
for high sensitivity, although it is time consuming. On the
other hand, fully automated fold-assignment procedures allow
for high efficiency but require the application of stringent score
thresholds, resulting in lower sensitivity. Highly automated
procedures will be required for efficient application of fold
recognition techniques to large databases, such as genome
sequences. On the other hand, this work emphasizes the
importance of complex and articulated evaluation systems.
Therefore, the optimization of the scoring systems to maximize
both accuracy and sensitivity represents an urgent demand in
this field in order to develop higher automated procedures for
functional genomics.
Fold recognition for M. pneumoniae sequences
Note added in proof
After submission of this paper, a related paper, ‘Homologybased fold predictions for Mycoplasma genitalium proteins’
by M.Huynen, T.Doerks, F.Eisenhaber, C.Orengo, S.Sunyaev,
Y.Yuan and P.Bork, was published (J. Mol. Biol., 1998, 280,
323–326). By use of the program PSI-BLAST, the authors
obtained identical or very similar structural predictions to
those presented here for the sequences MG030, MG276,
MG335, MG264 and MG333. The authors also report a higher
efficiency (23% of the target residues) of homology-based
structural prediction for new protein sequences than previously
achieved by other methods.
Acknowledgments
Francisco Domingues (CAME, Salzburg) is thanked for useful discussions
and technical assistence, Manfred Sippl (CAME, Salzburg) for providing the
opportunity to carry out this study and Jannette Carey (Princeton University)
for critical reading of the manuscript. This work was supported by FWF grant
P11601GEN.
Scapin,G., Ozturk,D.H., Grubmeyer,C. and Sacchettini,J.C. (1995)
Biochemistry, 34, 10744–10754.
Scheffzek,K., Kliche, W., Wiesmüller,L. and Reinstein,J. (1996) Biochemistry,
35, 9716–9727.
Sippl,M.J. (1993a) J. Comput.-Aided Mol. Des., 7, 473–501.
Sippl,M.J. (1993b) Proteins, 17, 335–362.
Sippl,M.J. and Flöckner,H. (1996) Structure, 4, 15–19.
Sippl,M.J. and Jaritz,M. (1994) In Bohr,H. and Brunak,S. (eds), Protein
Structure by Distance Analysis. IOS Press, Amsterdam, pp. 113–134.
Smith,D.R. et al. (1997) J. Bacteriol., 179, 7135–7155.
Smith,J.L. (1995) Curr. Opin. Struct. Biol., 5, 752–757.
Somoza,J.R., Chin,M.S., Focia,P.J., Wang,C.C. and Fletterick,R.J. (1996)
Biochemistry, 35, 7032–7040.
Stehle,T. and Schulz,G.E. (1992) J. Mol. Biol., 224, 1127–1141.
Tanaka,I., Appelt,K., Dijk,J., White,S.W. and Wilson,K.S. (1984) Nature ,310,
376–381.
Tian,G., Yan,H., Jiang,R.-T., Kishi,F., Nakazawa,A. and Tsai,M.-D. (1990)
Biochemistry, 29, 4296–4304.
Tomb,J.-F. et al. (1997) Nature, 388, 539–547.
Valencia,A., Kjeldgaard,M., Pai,E.F. and Sander,C. (1991) Proc. Natl Acad.
Sci. USA, 88, 5443–5447.
Received March 11, 1998; revised August 12, 1998; accepted August 17, 1998
References
Bairoch,A. (1992) Nucleic Acids Res., 20, 2013–2018.
Blattner,F.R. et al. (1997) Science, 277, 1453–1474.
Bult,C.J. et al. (1996) Science, 273, 1058–1073.
Byeon,I.J.L., Shi,Z.T. and Tsai,M.-D. (1995) Biochemistry, 34, 3172–3182.
Cole,S.T. et al. (1998) Nature, 393, 537–544.
Deckert,G. et al. (1998) Nature, 392, 353–358.
Eads,J.C., Scapin,G., Xu,Y., Grubmeyer,C. and Sacchettini,J.C. (1994) Cell,
78, 325–334.
Feng,Z.-K. and Sippl,M.J. (1996) Folding Des., 1, 123–132.
Fischer,D. and Eisenberg,D. (1997) Proc. Natl Acad. Sci. USA, 94, 11929–
11934.
Fleischmann,R.D. et al. (1995) Science, 269, 496–512.
Flöckner,H., Braxenthaler,M., Lackner,P., Jaritz,M., Ortner,M. and Sippl,M.J.
(1995) Proteins, 23, 376–386.
Flöckner,H., Domingues,F. and Sippl,M.J. (1997) Proteins, Suppl. 1, 129–133.
Fraser,C.M. et al. (1995) Science, 270, 397–403.
Fraser,C.M. et al. (1997) Nature, 390, 580–586.
Fraser,C.M. et al. (1998) Science, 281, 375–388.
Frech,M., John,J., Pizon,V., Chardin,P., Tavitian,A., Clark,R., McCormick,F.
and Wittinghofer,A. (1990) Science, 249, 169–171.
Goffeau,A. et al. (1996) Science, 274, 546–567.
Grubmeyer,C., Segura,E. and Dorfman,R. (1993) J. Biol. Chem., 268,
20299–20304.
Henikoff,S. and Henikoff,J.G. (1992) Proc. Natl Acad. Sci. USA, 89,
10915–10919.
Henriksen,A., Aghajari,N., Jensen,K.F. and Gajhede,M. (1996) Biochemistry,
35, 3803–3809.
Hieter,P. and Boguski,M. (1997) Science, 278, 601–602.
Himmelreich,R., Hilbert,H., Plagens,H., Pirkl,E., Li,B.-C. and Herrmann,R.
(1996) Nucleic Acids Res., 24, 4420–4449.
Kaneko,T. et al. (1996) DNA Res., 3, 109–136.
Klenk,H.-P. et al. (1997) Nature, 390, 364–370.
Kraulis,P.J. (1991) J. Appl. Crystallogr., 24, 946–950.
Kunst,F. et al. (1997) Nature, 390, 249–256.
Levitt,M. (1997) Proteins, Suppl. 1, 92–104.
Murzin,A.G., Brenner,S.E., Hubbard,T. and Chothia,C. (1995) J. Mol. Biol.,
247, 536–540.
Nassar,N., Horn,G., Herrmann,C., Scherer,A., McCormick,F. and
Wittinghofer,A. (1995) Nature, 375, 554–560.
Pai,E.F., Krengel,U., Petsko,G.A., Goody,R.S., Kabsch,W. and Wittinghofer,A.
(1990) EMBO J., 9, 2351–2359.
Pearson,W.R. and Lipman,D.J. (1988) Proc. Natl Acad. Sci. USA, 85, 2444–
2448.
Reinstein,J., Schlichting,I. and Wittinghofer,A. (1990) Biochemistry, 29,
7451–7459.
Rost,B. (1998) Structure, 6, 259–263.
S̆ali,A. and Blundell,T.L. (1993) J. Mol. Biol., 234, 779–815.
Sander,C. and Schneider,R. (1991) Proteins, 9, 56–68.
Saraste,M., Sibbald,P.R. and Wittinghofer,A. (1990) Trends Biochem. Sci., 15,
430–434.
Sayle,R.A. and Milner-White,E.J. (1995) Trends Biochem. Sci., 20, 374.
1135
×

Report this document