ImmunoDB Support Page
- The main page allows you to select your family and view annotations and phylogenetic data for one or all species.
- For a given family, links to the phylogenetic data are given at the top of the page (including a previously published tree in most cases), followed by the tables of annotation data.
- Annotated phylogenetic trees are presented as pdf files, for 3-way Dmel-Agam-Aaeg and 4-way analysis including Cpip in most cases.
- Colour codes: Dmel - BLUE, Agam - RED, Aaeg - ORANGE , Cpip - PURPLE
- Links to files - some browsers will automatically ask whether you want to open or download the file, others require the user to right click on the link in order to download it.
- The trees are in NEWIC format, you will need a tree-viewing programme like NJplot or TreeView in order to view them, you should be able to set your browser to automatically open the tree files using one of these programmes.
- The 'Muscle Alignment' - may be viewed in any alignment viewer (ClustalX or BioEdit).
- The 'Gblock View' - an html page showing the full alignment (Muscle), highlighting the conserved cores which are extracted by Gblocks.
- The 'Gblocked PhyML Tree' - the Maximum Likelihood (PhyML) tree computed from the conserved core extracted by Gblocks.
- The 'Gapless NJ Tree' - Neighbour-Joining (ClustalW) tree computed from the full sequence alignment with all gap positions removed.
- The 'Tree-Puzzle Tree' - the consensus Maximum Likelihood (Tree-Puzzle) tree computed from the 1000 bootstrap sample PhyML trees, all branches with less than 50% support are collapsed.
- NB: ImmunoDB IDs (IDB.IDs) (the first column of the annotation tables) have usually been replaced in the tree files
and annotated tree files with known or proposed gene names where they exist, or where no name exists, IDB.IDs may be displayed.
- You can export the cDNA and Protein sequences (where available) for a given family for a given species by clicking on the link above each annotation table 'Export cDNA or Protein Sequences'.
The strategy detailed below aims to identify genes or protein domains which are under similar evolutionary pressures and thereby indicate conserved functions. The domain-based trees may deviate from gene genealogies but are more likely to reflect the domain functional loads. The resulting trees should be interpreted with reference to the species tree.
Multiple sequence alignments were computed using Muscle except in a few cases like the PGRPs, TOLLs (TIR domains), and SCRAs (SRCR domains), where domains rather than whole genes were required to be aligned, the sequences were aligned to an HMM profile in order to optimise the local alignment of the domain.
Neighbour-Joining (NJ) trees were built from the gapless columns (all gapped columns excluded) of full-sequence alignments (and domain-based alignments where appropriate) using ClustalW.
We used Gblocks to extract confidently aligned regions which show conservation across the family, where all gap positions are excluded as they cannot be properly modelled with respect to amino acid substitutions. A conserved block was required to have a minimum of three conserved (present in more than half of the sequences) and a maximum of 10 non-conserved positions.
Maximum Likelihood (ML) allows for differential substitution rates between lineages and at different sites and is therefore well-suited to the analysis of distantly related sequences. The sequence relationships were estimated in terms of amino acid substitutions using PhyML.
The robustness of the NJ and PhyML trees was estimated by performing bootstrap analysis with 1000 samples. The PhyML trees were built from the non-gapped conserved core of the sequence alignments computed by Gblocks, and attempt to show our best guess at describing the functional relationships in these immune-related gene families. These trees suffer when the conserved core is small and highly conserved, resulting in only a few variable positions and leading to low bootstrap support.
In order to increase the size of the conserved cores, genes which disrupt the alignment with gaps (due to fragmented predictions) were removed. Branch points with low support should be considered as multifurcating points to indicate the uncertainty of the relationships and thereby avoid presenting misleading trees. Reconciliation of gene and species trees can suggest gene duplications and losses which may have occurred to produce a given gene tree, consulting these may help to resolve unclear branch placements.
Gene Model Status Levels:
A) Confirmed with high confidence and expert-refined cDNA supplied.
The protein is confirmed as a member of the gene family, and the refined cDNA is supplied based on any additional evidence including ESTs and other
experimental data that support the refined cDNA. Feedback at this level will be the most comprehensive and therefore the most useful.
B) Confirmed with high confidence, no refinement required.
The protein is confirmed as a member of the gene family with high confidence and there is no additional evidence to suggest that the cDNA requires refinement.
C) Confirmed with uncertainties.
Despite uncertainties regarding the gene model, and a lack of additional evidence upon which to base refinements, the protein is confirmed as a member of the gene family - for example where a family-defining domain is clearly present.
D) Confirmed but truncated (usually at 3' or 5' ends).
The protein may be confirmed as a member of the gene family, however, the family-signature may be truncated and there is a lack of additional evidence upon which to base refinements which could correct such a truncation.
E) Rejected - not a member of the gene family
The protein is not a member of the gene family.
F) Under revision
The gene model is still under revision.
The rules from the Anopheles Immunity Gene Family Analysis were as follows:
1. The names are mnemonic symbols, designed for easy recall. They do not aim to summarize all current information, which in any case is incomplete and subject to errors (orthology, function, chromosomal location).
2. To avoid errors in electronic communication all names consist exclusively of capital letters of the Latin alphabet and Arabic numerals; no punctuation marks, dashes etc. are used.
3. To minimize the length the formal names do not include taxonomic initials. If similarly named genes of two organisms are being compared, taxonomic initials can be added for convenience, but do not constitute part of the name (e.g. AgTEP to be easily distinguished from DmTep).
4. Roman letters and numerals indicate protein, italics indicate gene or RNA.
5. The name is based on sequence similarities and carries no functional implications, which must be determined experimentally.
6. The name consists of two to three contiguous fields, as follows:
- The first field includes three to five letters and is an abbreviation of the highest sequence grouping used, usually a protein family, e.g. CLIP (for Clip-domain serine protease).
- The second field, if present, includes one or more letters identifying a subgroup such as subfamily (e.g. CLIPD), or class (e.g. SCRB).
- The third field enumerates each gene by using consecutive numerals (e.g. SCRB1,… 12).
- Sometimes the third field numeral can be preceded by letter(s) indicating gene types within a subgroup (e.g. SCRBQ1, for a gene belonging to the SCRB Class, and to the croquemort type).
- For historical reasons, in certain families, the third field can also enumerate by letters rather than numerals (e.g. PGRPLA, for gene A of the Long subfamily in the PGRP family).
7. It is recommended that names previously used in the literature or in database submissions be gradually replaced by systematic names, following consultation with the original author. Historical names or names that may be developed eventually to indicate experimentally verified function or orthology can be used as synonyms.
Additional nomenclature rules devised to undertake the naming of Aedes and Culex genes:
A. Where orthology is clear (robust bootstrap support) in the gene tree, Aedes and Culex genes will be named according to their Anopheles orthologues. Where orthology is tentative in the gene tree, it can be further investigated by examining the specific clade to decide whether there is enough confidence to assign orthology.
B. Where Aedes and/or Culex expansions relative to an Anopheles gene are clear (robust bootstrap support) in the gene tree, Aedes and/or Culex genes will be named using the number from the corresponding Anopheles gene, suffixed with uppercase letters A, B, C etc. E.g. AgHPX8 and AaHPX8A, AaHPX8B, AaHPX8C, etc.
C. The remaining Aedes and Culex genes which are neither clear orthologues (point A above), nor clear expansions (point B above) relative to Anopheles, will be named according to the rules defined in points 1-6 above, starting with the number following the highest number assigned to an Anopheles gene.
Family XYZ in Anopheles consists of nine members XYZ1-9.
Family XYZ in Aedes consists of eleven members to be named.
Five of the Aedes genes can be assigned names based on clear orthology to Anopheles XYZ1, 2, 3, 5 and 8.
The XYZ4 gene is clearly expanded by recent duplication in Aedes, giving two genes which will be named XYZ4A, and XYZ4B.
The remaining four Aedes genes will then be named XYZ10, 11, 12 and 13.
Thus XYZ1, 2, 3, 5 and 8 are 1:1 orthologues, XYZ4 is duplicated in Aedes to give 4A and 4B, while Anopheles XYZ6, 7 and 9 are specific to Anopheles and Aedes XYZ10, 11, 12 and 13 are specific to Aedes.
Insect Species Tree: view tree
ClustalW - Multiple Sequence Alignments: ClustalW website
Muscle - Multiple Sequence Alignments: Muscle website
HMMER - Aligning Domains To An HMM Profile: HMMER website
Gblocks - Extracting Conserved Cores: Gblocks website
PhyML - Building Maximum Likelihood Trees: PhyML website
Download TreeView To View And Manipulate Trees: TreeView website
Download NJplot To View And Manipulate Trees: NJplot website
Download ClustalX To View And Manipulate Alignments: ClustalX website
Download BioEdit To View And Manipulate Alignments: BioEdit website