Protein domains are the structural and functional units of proteins. The ability to parse proteins into different domains is important for effective classification, understanding of protein structure, function and evolution and is hence biologically relevant. Several computational methods are available to identify domains in the sequence. Domain finding algorithms often employ stringent thresholds to recognize sequence domains. Identification of additional domains can be tedious involving intense computation and manual intervention but can lead to better understanding of overall biological function. In this context, the problem of identifying new domains in the unassigned regions of a protein sequence assumes a crucial importance. We report the availability of a convenient server for the domain prediction in unassigned regions in proteins (PURE) which can be accessed at "http://caps.ncbs.res.in/PURE/":http://caps.ncbs.res.in/PURE/
Introduction
Protein domains are the structural and functional units of proteins and represent one of the most useful levels to understand protein function. Analysis of proteins at the level of domain families has had a profound impact on the study of individual proteins. The availability of information on additional co-existing domains can provide knowledge about the overall function of a protein. The word ‘domain’ was coined by Wetlaufer 1 to suggest the presence of compact substructures within protein folds: such domains are called ‘structural domains’. Soon, biochemical experiments proved that some domains are capable to be an independent folding unit: such domains are called ‘folding domains’. Amongst several proteins involved in signal transduction, individual domains might be responsible for carrying out a specific function such that the whole gene is capable of multi-tasking in response to the requirements and be recruited in particular biochemical pathways: such domains are called ‘functional domains’. In many instances, these domains and their boundaries might coincide. They can be defined using multiple criteria, or combinations of criteria, including evolutionary conservation, discrete functionality and the ability to fold independently 2. Several reports have reviewed the domain architectures of members of protein families to suggest overall function of whole proteins (for example 3-6, for phosphatases).
Protein domain discovery using various computational approaches has been progressing steadily over the past 35 years 7. The identification of protein domains within a polypeptide chain can be achieved in several ways. Methods applied by classification databases such as the Dali Domain Dictionary 8, CATH 9, SCOP 10, DIAL 11, HOMSTRAD 12 employ structural data to locate and assign domains. Such structural data could be queried using objective algorithms (for example, SCOP information is organized in SUPERFAMILY database as Hidden Markov Models (HMM) 13). Identification of domains at the sequence level most often relies on the detection of global and local sequence alignments between a given target sequence and domain sequences found in databases such as Pfam 14. These organized databases of sequence domain families can also be queried using objective HMM algorithms. Continuing efforts to improve domain identification have produced wealth of different algorithms like the very recently developed DOMAC 15, DOMPRED 16 or DomainDiscovery 17 (see Bioinformatics Link Directory for a comprehensive list of available servers in this field18). However, difficulties in elucidating the domain content of a given sequence still arise when searching the target sequence against sequence or structural databases resulting in a lack of significant matches. For example, Mycoplasma genitalium is a small genome with 483 proteins but only 386 protein sequences have known Pfam hits with 56% residue coverage 19 emphasizing the need to further explore other methods for domain assignment from sequence. These are largely due to reasons such as high evolutionary divergence, distant similarities and incomplete or unequal representation of protein families in sequence space 20, 21. Though, similar approaches of integrating multiple, sensitive database search to detect distant homologues has been reported as a successful method to establish remote homology22, 23, we have recently shown that it is possible to enhance prediction of domains by 25% through indirect connections, namely consulting the domain architecture of sequence homologues 24.
In this paper, we report the availability of a bioinformatics protocol as a web server called PURE (domain Prediction in unassigned regions in proteins), which will enhance the domain predictions. PURE protocol utilizes the concept of Intermediate Sequence Search (ISS) 22 to assign functional domain to a given unassigned region with the help of connecting sequences. Whereas the concept of ISS is classically applied for sequence similarity at the whole protein level22, in this protocol, we make use of the property that homologous sequences could adopt similar domain architectures and where sequence homology exists at different domains, such an extrapolation of domain architectures is plausible.
Unassigned regions in proteins are examined in the non-redundant sequence (NR) database, using PSI-BLAST, for homologues at stringent thresholds. Representative homologues, identified at the unassigned region, are traced back to their full-length sequences and subject to time-intensive hmmpfam search that enables the delineation of domain architectures of all homologues. Where a direct hmmpfam search could not yield any relationship to a pre-existing protein domain family, we have earlier shown that as much as 25% of connections could be obtained by following the indirect assignment of domains. These are termed as ‘indirect connections’. Indirect connections between the query and distantly related domain is established through a powerful procedure using PSI-BLAST hits which are individually routed through a rigorous hmmpfam search against Pfam database 25. In addition, PURE server automatically provides other structurally relevant findings such as the prediction of coiled coils and transmembrane helices.