BactPepDB

Site Navigation

The Home page contains:
- General information about the data and tools used to create BactPepDB.
- General statistics about BactPepDB.
- References and contact informations.
The Search page proposes two possibilities to query BactPepDB: the BLASTp search stands for identifying sequences of BactPepDB similar to a query; the Simple Bank query form stands for direct queries of BactPepDB using multiple criteria.
The Help page contains detailed explanations on informations found with BactPepDB and how to use the search engine.

Simple Bank Query

This form makes direct queries to BactPepDB. Different parameters can be combined to focus the search:

Source organism: the sequenced genome (organism) the peptides are predicted from. Can be used to restrict the search to a particular Genus, Species or organism. Clicking on the Genus list will make the list of available Species for this Genus to appear. Subsequently clicking on the Species / Subspecies list will make the list of available strains to appear. It is then possible to restrict the search to a particular Strain / Substrain. The user also has the possibility to include peptides predicted from plasmids or not by selecting the right option in the Include plasmids menu.
Sequence features:
- Sequence length: will restrict the search to a length range (in amino acids). Peptide size in BactPepDB is presently limited between 10 and 80 amino acids.
- Regular expression: can possibly be used to match a particular type of sequence motives. Be warned however that an Bank query performed with a regular expression will take considerably more time to be processed. More help about regular expressions can be found further down.
- Signal peptide sequence: implies the peptide should contain a signal peptide sequence according to the prediction performed by SignalP). Be warned however that the prediction is performed for both Gram+ and Gram- systematically.
- Transmembrane segment: implies the peptide should contain transmembrane segment(s) according to the prediction performed by TMHMM).
- Homologous structure available in the PDB: will limit the peptides returned to those having a match in the PDB according to BLASTp. It is also possible to select the e-value threshold to define the significance of the similarity to PDB files.
- Conserved among multiple species of a genus: will limit the peptides returned to those conserved across species in a same genus. Conservation is calculated using BLASTp among the peptides of a genus, using 50% sequence identity and 50% coverage values. Be warned however that this option is only relevant in case of genera containing more than one species.
Peptide status: makes possible to restrict the search to peptides already annotated in GenBank or to those corresponding to new predicted peptides.
Predicted peptide features: These features are only accessible when the Predicted peptides only box is checked. It makes possible to specify informations such as the location of the predicted gene (coding or intergenic region) and for peptides in the intergenic regions, if it is a pseudogene (i.e. gene in the intergenic regions homologous to larger genes) or not.

Regular expressions

BactPepDB allows the following regular expression metacharacters:

.	match any character
?	match zero or one
*	match zero or more
+	match one or more
{n}	match n times
{m,n}	match m through n times
{n,}	match n or more times
^	beginning of line
$	end of line
[abc]	match one of enclosed chars
[^xyz]	match any char not enclosed

For example:

C.{5}C

will return all peptides which contain two cysteines separated by 5 residues.

BLASTp Search

A simple BLASTp search on the content of the database is implemented. Above parameters can be combined to focus the search.

Text area

Query sequence(s) to be used should be pasted in the 'Search' text area. Accepted formats are FASTA and bare sequence. It is possible to specify several sequences, in which case the FASTA format is mandatory.

1. FASTA

A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line (defline) is distinguished from the sequence data by a greater-than (">") symbol at the beginning. An example sequence in FASTA format is:

>gi|383782783|ref|YP_005467350.1| putative CsrA-like regulator [Actinoplanes missouriensis 431]
MLVLTRRAGESVMIGDDVVITVLEARGDVIRLGIQAPRDVQVHREEVYRELQDANREAASPTEDAVHALT
RMLEKSDPDE

Blank lines are not allowed in the middle of FASTA input. The accepted amino acid codes are:

A  alanine               P  proline       
B  aspartate/asparagine  Q  glutamine      
C  cystine               R  arginine      
D  aspartate             S  serine      
E  glutamate             T  threonine      
F  phenylalanine         U  selenocysteine      
G  glycine               V  valine        
H  histidine             W  tryptophan        
I  isoleucine            Y  tyrosine
K  lysine                Z  glutamate/glutamine
L  leucine               X  any
M  methionine            *  translation stop
N  asparagine            -  gap of indeterminate length

For multiple sequences query, the FASTA format is mandatory:

>sequence#1
LSKPLVRHAAHVENNEGHLRQQVTFNNGREPN
>sequence#2
MAVNGNNCHEAQSMPTNSDEVYIKKYIRRRISYLQT
>sequence#3
MKRSILLKDTHFHDFPFILPVIIFICRINTAFPSSST

Note that the * at the end of the sequences is not mandatory.

2. Bare sequence

This may be just lines of sequence data, without the FASTA definition line, e.g.:

MLVLTRRAGESVMIGDDVVITVLEARGDVIRLGIQAPRDVQVHREEVYRELQDANREAASPTEDAVHALT
RMLEKSDPDE

Blank lines are not allowed in the middle of bare sequence input.

E-value

This setting specifies the statistical significance threshold for reporting matches against database sequences. A value of 10 means that 10 such matches are expected to be found merely by chance, according to the stochastic model of Karlin and Altschul (1990). If the statistical significance ascribed to a match is greater than the EXPECT threshold, the match will not be reported. Lower EXPECT thresholds are more stringent, leading to fewer chance matches being reported.

Search results

After running a search, the user is provided with a list of entries corresponding to the parameters used for this query.

Parameters used

A remainder of the different parameters used for the query is provided to keep track of the searches that were performed.

Search summary

A brief summary of the number of hits matching the different criteria is provided. The user also has the possibility to download the complete result set as a csv file. This file contains the numerical information (number of transmembrane segments, number of disulfide bonds, D-score for signalP, etc...) that is not shown in the search results tab for clarity purpose.

Search results

A list of the hits with the different values is returned. This list can be sorted interactively using each criterion as sort key. The different columns are the following:

Peptide id: The BactPepDB database unique accession number (e.g. BPDB:0000001).
Organism: The bacteria genome from which the gene was extracted.
Molecule: The molecule of DNA from which the gene was extracted. Can be chromosomal or plasmidic (plasmid id will be shown).
Sequence: The amino acid sequence of the peptide.
Length (L): The peptide sequence length, in amino acids.
Reference: A link to the GenBank entry, if available.
Signal Peptide prediction (SP): Possible presence of a signal peptide sequence, as predicted by SignalP.
Transmembrane Segments prediction (TM): Possible presence of (a) transmembrane segment(s), as predicted by TMHMM.
Disulfide bonds prediction (SS): Possible presence of (a) disulfide bond(s), as predicted by DIpro.
PDB homologs (PDB): Existence of Protein Data Bank entries homologous to the peptide according to BLASTp.
Conservation (C): Peptide conservation among the different species of a genus. : Conserved in 2 species (weakly conserved), : conserved in at least 3 species (strongly conserved).

CSV file

The downloadable CSV file contains the following informations:

Peptide id: The BactPepDB database unique accession number (e.g. BPDB:0000001).
Organism: The bacteria genome from which the gene was extracted.
Reference: The GenBank entry reference, if available.
Intergenic: Location of the predicted gene (true: intergenic region, false: coding region).
Pseudogene: If the predicted gene was detected in an intergenic region, determines if it is a pseudogene or not.
SignalP(gram+): Possible presence of a signal peptide sequence for Gram+ bacteria, as predicted by SignalP (yes / no).
ypos Cleavage position (for Gram+ bacteria). Further informations about SignalP output can be found here.
delta: D-score for SignalP (for Gram+ bacteria). Further informations about SignalP output can be found here.
cutoff: Cut-off value retained for D-score (for Gram+ bacteria). Further informations about SignalP output can be found here.
SignalP(gram-): Possible presence of a signal peptide sequence for Gram- bacteria, as predicted by SignalP (yes / no).
ypos: Cleavage position (for Gram- bacteria). Further informations about SignalP output can be found here.
delta: D-score for SignalP (for Gram- bacteria). Further informations about SignalP output can be found here.
cutoff: Cut-off value retained for D-score (for Gram- bacteria). Further informations about SignalP output can be found here.
#TM: Number of transmembrane segments, according to TMHMM.
#SSBonds: Number of disulfide bonds, according to DIpro.
Conservation/#Species: Number of species the peptide is conserved in / number of species in the genus.
BestPDBHit: Protein Data Bank structure homolog to the peptide sequence with the highest e-value.
e-value: e-value for the best Protein Data Bank homolog.
Length: Size of peptide sequence, in amino acids.
Sequence: Peptide sequence, in amino acids.

External Tools

Links to external ressources performing prediction about peptide bioactivity are provided. The Download sequences button can be used to download sequences in order to paste them into one of these external tools which are the following: PeptideRanker, AntiBP2 and CAMP.

BactPepDB entry content

Each entry contains several informations:

Peptide summary

Peptide id

The BactPepDB database unique accession number (e.g. BPDB:0000001).

Status

Already annotated in GenBank or new predicted peptide.

Product

GenBank annotation.

Sequence

The amino acid sequence of the peptide.

Length

The peptide sequence length, in amino acids.

Homologous PDB

The Protein Data Bank entries homologous to the peptide according to BLASTp.

Identical to

Other BactPepDB entries having sequences identical to this peptide.

Similar to

Other BactPepDB entries in the same genus having sequences similar to this peptide.

Gene information

Source organism

The bacteria genome from which the gene was extracted.

Strand

Indicates if the feature is located on the plus strand or minus strand.

CDS

Coding sequence; region of nucleotides that corresponds with the sequence of amino acids in a protein (location includes start and stop codons).

RBS

Ribosome binding site location.

Miscellaneous

Other informations about gene location.

DNA Sequence

The nucleotide coding sequence.

Predicted features

Transmembrane segments

Possible transmembrane segments, as predicted by TMHMM.

Signal peptide (gram+)

Possible signal peptide sequence, as predicted by SignalP for Gram+ bacteria. D-score represents the discrimination score. Cutoff for D-score is also provided.

Signal peptide (gram-)

Possible signal peptide sequence, as predicted by SignalP for Gram- bacteria.

Disulfide bonds

Possible disulfide bonds, as predicted by DIpro.

Predicted secondary structure

The secondary structure, as predicted by PsiPred (reference), overlapped on the sequence. The following code is used:

helix

strand

random coil

Predicted local structure profile

A local structure or Structural Alphabet (SA) predicted profile. SA can ben seen as a generalized secondary structure. The x axis of the profile corresponds to the amino acid sequence. The y axis corresponds to the predicted probabilities of each of the 27 local conformations of the structural alphabet. The conformations are sorted from the most helical (red - bottom) to the most extended (beta strands - green - top). Blue conformations correspond to coil. Each column corresponds to a fragment of 4 amino acids. The larger the sum of the probabilities in red (resp. green, blue) the more helical (resp. extended, coiled) the local conformation of the peptide is expected. In the example below, the region corresponding to the P P I sequence is predicted to be in extended conformation.

Note: SA letters are 4 residue length, so the profile is of the size of the amino acid sequence minus 3. For instance, the first barplot of the profile shown above corresponds to the amino acid sequence GRCT, the second to RCTK, etc...

Homologous sequences

List of intra-genus homologous sequences

A precomputed BLASTp search among all peptides of a genus is available. This search was done using a sequence identity and coverage of above 50%.

Additional tool(s)

BLASTp search

A BLAST of the peptide sequence can be performed against the whole database.

Tutorials

These short tutorials will show you how to browse the database.

Tutorial #1

Filling the search form

First, open the Search page and check the Simple bank query radio button.

In the Source organism section, select Rhizobium in the Genus menu. This will populate the Species / Subspecies submenu with all the species available for this genus. Then, select etli in the Species / Subspecies menu. This will populate the Strain / Substrain menu with all the strains available for this species. Finally, select CFN 42 in the Strain / Substrain menu.

In the Peptide status section, check the Predicted peptides only radio button to get only peptides predicted by BactGeneSHOW.

In the Predicted peptide features section, check the Intergenic regions radio button. Then check the Non-pseudogene radio button.

In the Sequence features section, enter 30 in the Minimum Length and 50 in the Maximum Length to get all peptides of size between 30 and 50 residues.

Leave the Other features section as default, then click Search. Please wait and don't hit back while search is performed, this can take a while.

Browsing the results

You are provided with a list of entries corresponding to the parameters entered previously. Check your parameters you entered are correct in the search summary section.

Click on the Show # entries menu and set it to 100 to show all results at once.

Click on the L column header until peptides are sorted by size from smaller (30) to larger ones (50).

Now it is time to learn how to use filters. Scroll to the bottom of page and check the 2nd and last (5th) checkbox to the right in order to filter rows by showing only cross-species conserved peptides (C) for which a transmembrane segment (TM) was detected. This should narrow your search to only 13 peptides. Now click on the second menu under the Molecule column to select the "symbiotic plasmid p42d" which should narrow the list to only one peptide.

Click on the BPDB:0051461 entry or the MFVRVMQDAVASRFLLVILILMVGTVVSVLLVRPSDDSRTGRVERSL sequence to access peptide entry.

Peptide entry content

You get access to all information gathered about this peptide (see BactPepDB entry content for help).

Scrolling down to the Homologous sequences found in genus Rhizobium will show you all peptides belonging to the Rhizobium genus that presents a percentage of similarity of 50% or more to the peptide currently browsed. Click on the Id% column header until peptides are sorted by similarity from highest to lowest. Currently BPDB:0051461 presents 2 homologous sequences belonging to 2 different strains. Subsequently clicking on one of these peptide id or sequence will bring you to their respective entry.

Tutorial #2

Using the BLASTp search

Open the Search page and click on the Reset this form button to set back all parameters to their default values. Then, check the BLASTp Search radio button. In the Enter query sequence(s) textbox, copy/paste the following sequence in FASTA format:

>gi|507519673|ref|YP_008041287.1| hypothetical protein AHML_00775 [Aeromonas hydrophila ML09-119]
MMPHLIEINSSLLFDEYLQSLGVPQTQLDQEQDIYLQERHLAAVRQIQGELKFYLRASALTRQ

Set the e-value to 0.01. Leave the other sections as default and click Search.

Browsing the BLASTp results

You are provided with a list of BLAST hits with their corresponding alignments. If you scroll down to the bottom of the page you will see the seqLogo image which shows you the weight of each letter at each position of the multiple alignment. You can download this multiple alignment by clicking the Download alignment button above.

Home

Search

Help

Links

Help

Site Navigation

Simple Bank Query

Regular expressions

BLASTp Search

Text area

1. FASTA

2. Bare sequence

E-value

Search results

Parameters used

Search summary

Search results

CSV file

External Tools

BactPepDB entry content

Peptide summary

Peptide id

Status

Product

Sequence

Length

Homologous PDB

Identical to

Similar to

Gene information

Source organism

Strand

CDS

RBS

Miscellaneous

DNA Sequence

Predicted features

Transmembrane segments

Signal peptide (gram+)

Signal peptide (gram-)

Disulfide bonds

Predicted secondary structure

Predicted local structure profile

Homologous sequences

List of intra-genus homologous sequences

Additional tool(s)

BLASTp search

Tutorials

Tutorial #1

Filling the search form

Browsing the results

Peptide entry content

Tutorial #2

Using the BLASTp search

Browsing the BLASTp results

Please wait while processing your request...