Help
Simple Bank Query
This form makes direct queries to BactPepDB. Different parameters can be combined to focus the search:
- Source organism: the sequenced genome (organism) the peptides are predicted from. Can be used to restrict the search to a particular Genus, Species or organism. Clicking on the Genus list will make the list of available Species for this Genus to appear. Subsequently clicking on the Species / Subspecies list will make the list of available strains to appear. It is then possible to restrict the search to a particular Strain / Substrain. The user also has the possibility to include peptides predicted from plasmids or not by selecting the right option in the Include plasmids menu.
- Sequence features:
- Sequence length: will restrict the search to a length range (in amino acids). Peptide size in BactPepDB is presently limited between 10 and 80 amino acids.
- Regular expression: can possibly be used to match a particular type of sequence motives. Be warned however that an Bank query performed with a regular expression will take considerably more time to be processed. More help about regular expressions can be found further down.
- Signal peptide sequence: implies the peptide should contain a signal peptide sequence according to the prediction performed by SignalP). Be warned however that the prediction is performed for both Gram+ and Gram- systematically.
- Transmembrane segment: implies the peptide should contain transmembrane segment(s) according to the prediction performed by TMHMM).
- Homologous structure available in the PDB: will limit the peptides returned to those having a match in the PDB according to BLASTp. It is also possible to select the e-value threshold to define the significance of the similarity to PDB files.
- Conserved among multiple species of a genus: will limit the peptides returned to those conserved across species in a same genus. Conservation is calculated using BLASTp among the peptides of a genus, using 50% sequence identity and 50% coverage values. Be warned however that this option is only relevant in case of genera containing more than one species.
- Peptide status: makes possible to restrict the search to peptides already annotated in GenBank or to those corresponding to new predicted peptides.
- Predicted peptide features: These features are only accessible when the Predicted peptides only box is checked. It makes possible to specify informations such as the location of the predicted gene (coding or intergenic region) and for peptides in the intergenic regions, if it is a pseudogene (i.e. gene in the intergenic regions homologous to larger genes) or not.
Regular expressions
BactPepDB allows the following regular expression metacharacters:
. | match any character |
? | match zero or one |
* | match zero or more |
+ | match one or more |
{n} | match n times |
{m,n} | match m through n times |
{n,} | match n or more times |
^ | beginning of line |
$ | end of line |
[abc] | match one of enclosed chars |
[^xyz] | match any char not enclosed |
For example:
C.{5}C
will return all peptides which contain two cysteines separated by 5 residues.
BLASTp Search
A simple BLASTp search on the content of the database is implemented. Above parameters can be combined to focus the search.
Text area
Query sequence(s) to be used should be pasted in the 'Search' text area. Accepted formats are FASTA and bare sequence. It is possible to specify several sequences, in which case the FASTA format is mandatory.
1. FASTA
A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line (defline) is distinguished from the sequence data by a greater-than (">") symbol at the beginning. An example sequence in FASTA format is:
>gi|383782783|ref|YP_005467350.1| putative CsrA-like regulator [Actinoplanes missouriensis 431] MLVLTRRAGESVMIGDDVVITVLEARGDVIRLGIQAPRDVQVHREEVYRELQDANREAASPTEDAVHALT RMLEKSDPDE
Blank lines are not allowed in the middle of FASTA input. The accepted amino acid codes are:
A alanine P proline B aspartate/asparagine Q glutamine C cystine R arginine D aspartate S serine E glutamate T threonine F phenylalanine U selenocysteine G glycine V valine H histidine W tryptophan I isoleucine Y tyrosine K lysine Z glutamate/glutamine L leucine X any M methionine * translation stop N asparagine - gap of indeterminate length
For multiple sequences query, the FASTA format is mandatory:
>sequence#1 LSKPLVRHAAHVENNEGHLRQQVTFNNGREPN >sequence#2 MAVNGNNCHEAQSMPTNSDEVYIKKYIRRRISYLQT >sequence#3 MKRSILLKDTHFHDFPFILPVIIFICRINTAFPSSST
Note that the * at the end of the sequences is not mandatory.
2. Bare sequence
This may be just lines of sequence data, without the FASTA definition line, e.g.:
MLVLTRRAGESVMIGDDVVITVLEARGDVIRLGIQAPRDVQVHREEVYRELQDANREAASPTEDAVHALT RMLEKSDPDE
Blank lines are not allowed in the middle of bare sequence input.
E-value
This setting specifies the statistical significance threshold for reporting matches against database sequences. A value of 10 means that 10 such matches are expected to be found merely by chance, according to the stochastic model of Karlin and Altschul (1990). If the statistical significance ascribed to a match is greater than the EXPECT threshold, the match will not be reported. Lower EXPECT thresholds are more stringent, leading to fewer chance matches being reported.
Search results
After running a search, the user is provided with a list of entries corresponding to the parameters used for this query.
Parameters used
A remainder of the different parameters used for the query is provided to keep track of the searches that were performed.
Search summary
A brief summary of the number of hits matching the different criteria is provided. The user also has the possibility to download the complete result set as a csv file. This file contains the numerical information (number of transmembrane segments, number of disulfide bonds, D-score for signalP, etc...) that is not shown in the search results tab for clarity purpose.
Search results
A list of the hits with the different values is returned. This list can be sorted interactively using each criterion as sort key. The different columns are the following:
- Peptide id: The BactPepDB database unique accession number (e.g. BPDB:0000001).
- Organism: The bacteria genome from which the gene was extracted.
- Molecule: The molecule of DNA from which the gene was extracted. Can be chromosomal or plasmidic (plasmid id will be shown).
- Sequence: The amino acid sequence of the peptide.
- Length (L): The peptide sequence length, in amino acids.
- Reference: A link to the GenBank entry, if available.
- Signal Peptide prediction (SP): Possible presence of a signal peptide sequence, as predicted by SignalP.
- Transmembrane Segments prediction (TM): Possible presence of (a) transmembrane segment(s), as predicted by TMHMM.
- Disulfide bonds prediction (SS): Possible presence of (a) disulfide bond(s), as predicted by DIpro.
- PDB homologs (PDB): Existence of Protein Data Bank entries homologous to the peptide according to BLASTp.
- Conservation (C): Peptide conservation among the different species of a genus. : Conserved in 2 species (weakly conserved), : conserved in at least 3 species (strongly conserved).
CSV file
The downloadable CSV file contains the following informations:
- Peptide id: The BactPepDB database unique accession number (e.g. BPDB:0000001).
- Organism: The bacteria genome from which the gene was extracted.
- Reference: The GenBank entry reference, if available.
- Intergenic: Location of the predicted gene (true: intergenic region, false: coding region).
- Pseudogene: If the predicted gene was detected in an intergenic region, determines if it is a pseudogene or not.
- SignalP(gram+): Possible presence of a signal peptide sequence for Gram+ bacteria, as predicted by SignalP (yes / no).
- ypos Cleavage position (for Gram+ bacteria). Further informations about SignalP output can be found here.
- delta: D-score for SignalP (for Gram+ bacteria). Further informations about SignalP output can be found here.
- cutoff: Cut-off value retained for D-score (for Gram+ bacteria). Further informations about SignalP output can be found here.
- SignalP(gram-): Possible presence of a signal peptide sequence for Gram- bacteria, as predicted by SignalP (yes / no).
- ypos: Cleavage position (for Gram- bacteria). Further informations about SignalP output can be found here.
- delta: D-score for SignalP (for Gram- bacteria). Further informations about SignalP output can be found here.
- cutoff: Cut-off value retained for D-score (for Gram- bacteria). Further informations about SignalP output can be found here.
- #TM: Number of transmembrane segments, according to TMHMM.
- #SSBonds: Number of disulfide bonds, according to DIpro.
- Conservation/#Species: Number of species the peptide is conserved in / number of species in the genus.
- BestPDBHit: Protein Data Bank structure homolog to the peptide sequence with the highest e-value.
- e-value: e-value for the best Protein Data Bank homolog.
- Length: Size of peptide sequence, in amino acids.
- Sequence: Peptide sequence, in amino acids.
External Tools
Links to external ressources performing prediction about peptide bioactivity are provided. The Download sequences button can be used to download sequences in order to paste them into one of these external tools which are the following: PeptideRanker, AntiBP2 and CAMP.
BactPepDB entry content
Each entry contains several informations:
Peptide summary
Peptide id
The BactPepDB database unique accession number (e.g. BPDB:0000001).
Status
Already annotated in GenBank or new predicted peptide.
Product
GenBank annotation.
Sequence
The amino acid sequence of the peptide.
Length
The peptide sequence length, in amino acids.
Homologous PDB
The Protein Data Bank entries homologous to the peptide according to BLASTp.
Identical to
Other BactPepDB entries having sequences identical to this peptide.
Similar to
Other BactPepDB entries in the same genus having sequences similar to this peptide.
Gene information
Source organism
The bacteria genome from which the gene was extracted.
Strand
Indicates if the feature is located on the plus strand or minus strand.
CDS
Coding sequence; region of nucleotides that corresponds with the sequence of amino acids in a protein (location includes start and stop codons).
RBS
Ribosome binding site location.
Miscellaneous
Other informations about gene location.
DNA Sequence
The nucleotide coding sequence.
Predicted features
Transmembrane segments
Possible transmembrane segments, as predicted by TMHMM.
Signal peptide (gram+)
Possible signal peptide sequence, as predicted by SignalP for Gram+ bacteria. D-score represents the discrimination score. Cutoff for D-score is also provided.
Signal peptide (gram-)
Possible signal peptide sequence, as predicted by SignalP for Gram- bacteria.
Disulfide bonds
Possible disulfide bonds, as predicted by DIpro.
Predicted secondary structure
The secondary structure, as predicted by PsiPred (reference), overlapped on the sequence. The following code is used:
Predicted local structure profile
A local structure or Structural Alphabet (SA) predicted profile. SA can ben seen as a generalized secondary structure. The x axis of the profile corresponds to the amino acid sequence. The y axis corresponds to the predicted probabilities of each of the 27 local conformations of the structural alphabet. The conformations are sorted from the most helical (red - bottom) to the most extended (beta strands - green - top). Blue conformations correspond to coil. Each column corresponds to a fragment of 4 amino acids. The larger the sum of the probabilities in red (resp. green, blue) the more helical (resp. extended, coiled) the local conformation of the peptide is expected. In the example below, the region corresponding to the P P I sequence is predicted to be in extended conformation.
Note: SA letters are 4 residue length, so the profile is of the size of the amino acid sequence minus 3. For instance, the first barplot of the profile shown above corresponds to the amino acid sequence GRCT, the second to RCTK, etc...
Homologous sequences
List of intra-genus homologous sequences
Additional tool(s)
BLASTp search
A BLAST of the peptide sequence can be performed against the whole database.
Tutorials
These short tutorials will show you how to browse the database.
Tutorial #1
Filling the search form
First, open the Search page and check the Simple bank query radio button.
In the Source organism section, select Rhizobium in the Genus menu. This will populate the Species / Subspecies submenu with all the species available for this genus. Then, select etli in the Species / Subspecies menu. This will populate the Strain / Substrain menu with all the strains available for this species. Finally, select CFN 42 in the Strain / Substrain menu.
In the Peptide status section, check the Predicted peptides only radio button to get only peptides predicted by BactGeneSHOW.
In the Predicted peptide features section, check the Intergenic regions radio button. Then check the Non-pseudogene radio button.
In the Sequence features section, enter 30 in the Minimum Length and 50 in the Maximum Length to get all peptides of size between 30 and 50 residues.
Leave the Other features section as default, then click Search. Please wait and don't hit back while search is performed, this can take a while.
Browsing the results
You are provided with a list of entries corresponding to the parameters entered previously. Check your parameters you entered are correct in the search summary section.
Click on the Show # entries menu and set it to 100 to show all results at once.
Click on the L column header until peptides are sorted by size from smaller (30) to larger ones (50).
Now it is time to learn how to use filters. Scroll to the bottom of page and check the 2nd and last (5th) checkbox to the right in order to filter rows by showing only cross-species conserved peptides (C) for which a transmembrane segment (TM) was detected. This should narrow your search to only 13 peptides. Now click on the second menu under the Molecule column to select the "symbiotic plasmid p42d" which should narrow the list to only one peptide.
Click on the BPDB:0051461 entry or the MFVRVMQDAVASRFLLVILILMVGTVVSVLLVRPSDDSRTGRVERSL sequence to access peptide entry.
Peptide entry content
You get access to all information gathered about this peptide (see BactPepDB entry content for help).
Scrolling down to the Homologous sequences found in genus Rhizobium will show you all peptides belonging to the Rhizobium genus that presents a percentage of similarity of 50% or more to the peptide currently browsed. Click on the Id% column header until peptides are sorted by similarity from highest to lowest. Currently BPDB:0051461 presents 2 homologous sequences belonging to 2 different strains. Subsequently clicking on one of these peptide id or sequence will bring you to their respective entry.
Tutorial #2
Using the BLASTp search
Open the Search page and click on the Reset this form button to set back all parameters to their default values. Then, check the BLASTp Search radio button. In the Enter query sequence(s) textbox, copy/paste the following sequence in FASTA format:
>gi|507519673|ref|YP_008041287.1| hypothetical protein AHML_00775 [Aeromonas hydrophila ML09-119] MMPHLIEINSSLLFDEYLQSLGVPQTQLDQEQDIYLQERHLAAVRQIQGELKFYLRASALTRQ
Set the e-value to 0.01. Leave the other sections as default and click Search.
Browsing the BLASTp results
You are provided with a list of BLAST hits with their corresponding alignments. If you scroll down to the bottom of the page you will see the seqLogo image which shows you the weight of each letter at each position of the multiple alignment. You can download this multiple alignment by clicking the Download alignment button above.