Sequence Query

Introduction

The sequence query, in which one or more peptide molecular masses are combined with sequence, composition and fragment ion data, is potentially the most powerful search of all. The usual source of the sequence information is interpretation of an MS/MS spectrum. While it is very difficult to determine a complete and unambiguous peptide sequence from an MS/MS spectrum, it is often possible to find a series of peaks providing 3 or 4 residues of reliable sequence data.

This general approach was pioneered by Mann and co-workers at EMBL, who used the term "sequence tag" for the combination of a few residues of sequence data combined with molecular weight information [Mann, 1994]. They defined a sequence tag derived from an MS/MS spectrum as the mass of the precursor peptide, the mass of the first peak of the identified sequence ladder, a stretch of interpreted sequence, and the mass of the final peak of the ladder.

The sequence query mode of Mascot supports both standard and error tolerant sequence tags. It also allows arbitrary combinations of fragment ion mass values, amino acid sequence data and amino acid composition data to be searched. Mascot Distiller can be used to assist in the manual interpretation of sequence tags or call them automatically, (requires the Search Toolbox).

Although sequence queries and uninterpreted MS/MS data can be combined in a single search, the data set is more likely to consist of just one or more sequence tags. Even so, all of the material in the tutorial on searching uninterpreted MS/MS data applies equally to a sequence query search, except that the data set will usually be too small to estimate FDR using target/decoy.

Syntax

Each line entered into the query window must consist of one experimental peptide mass value, optionally followed by qualifiers for that peptide:

M seq(…) comp(…) ions(…) tag(…) etag(…)

M is an experimental mass value, seq(…) is AA sequence information, comp(…) is AA composition information, ions(…) contains MS/MS fragment mass and (optionally) intensity values, tag(…) is a sequence tag, etag(…) is an error tolerant sequence tag.

A line may contain zero, one, or many qualifiers. If there are multiple sequence tag qualifiers, and one or more is error tolerant, then all tags are treated as error tolerant.

N.B. ions(…), tag(…), and etag(…) qualifiers are scored probabilistically. That is, the more qualifiers that match, the higher the score, but all qualifiers are not required to match. In contrast, seq(…) and comp(…) are treated as filters. If a seq(…) or comp(…) qualifier fails to match, then the entire query is discarded. Hence, only include seq(…) or comp(…) qualifiers which are known with a high degree of confidence. Note that using a seq(…) qualifier in a Mascot search is not equivalent to a performing a Blast search.

Sequence information

The sequence information should be given in standard one letter code. It should be preceded by a prefix as outlined in the table below, to indicate what type of sequence it is. If no prefix is specified, the default is b-.

Prefix Meaning Example
b- N->C sequence seq(b-DEFG)
y- C->N sequence seq(y-GFED)
*- Orientation unknown seq(*-DEFG)
n- N terminal sequence seq(n-ACDE)
c- C terminal sequence seq(c-FGHI)

The examples will all match to a peptide with the sequence ACDEFGHI.

Note that *-DEFG will search for both DEFG and GFED.

Note also that y-GFED is written C-term to N-term, whereas c-FGHI is written N-term to C-term

Both lower and upper case characters may be used for amino-acids. An unknown amino acid may be indicated by an ‘X’. More than one amino acid may be specified for a position by putting them between square brackets. A line may contain several sequence information qualifiers. For example, the following query will match to a peptide with the sequence ACDEFGHI:

1234 seq(n-AC[DHK]) seq(c-HI) seq(*-GF)

Composition Information

Composition should consist of a number, followed by the corresponding amino acid between square brackets. An asterisk means "one or more". For example

comp(2[H]0[M]3[DE]*[K])

indicates a peptide which contains 2 histidines, no methionines, 3 acidic residues (glutamic or aspartic acid) and at least 1 lysine. Note that ‘X’ is not meaningful and so not allowed in a composition query.

Ions information

Mass and (optionally) intensity values from one or more ion series in the MS/MS spectrum of a peptide can be specified in an ions qualifier. Each ions qualifier can include a prefix to indicate what type of ion series the m/z values belong to.

Prefix Meaning Example
b- b series ions ions(b-m1:i1,m2:i2, …,mn:in)
y- y series ions ions(y-m1,m2, …,mn)
  unassigned ions(m1:i1,m2:i2, …,mn:in)

The inclusion of intensity values, separated from mass values by colons, is optional. If intensity values are not included, then the colons must also be omitted, as in the y series example. Mascot uses the intensity information to iteratively select sub-sets of the most intense peaks in order to optimise scoring discrimination.

Mass values do not need to be in order, or represent contiguous sequence ion ladders.

A line may contain several ions information qualifiers, for example:

1454.4 ions(b-610,707,804,1086) ions(y-2909) ions(2106,2632,2545)

Standard Sequence Tag

The sequence tag qualifier consists of the observed mass of the first peak of an identified sequence ladder, a stretch of interpreted amino acid sequence, and the observed mass of the final peak of the ladder. For example

1890.2 tag(1004.1, LSADTG, 1548.5)

Use of whitespace (tabs, spaces) inside the parentheses is optional, for readability. Case is not significant. Other qualifiers, including other sequence tags, may be included in the same query.

The syntax for the sequence string is similar to a seq(…) qualifier, but without the prefix. That is, [IL]SAXTG would be allowed. X means unknown, and is equivalent to [ACDEFGHIKLMNPQRSTUVWY]. There is little point in specifying X in a standard tag, but can be useful in an error tolerant tag.

In a tag, the sequence syntax is extended to describe alternative dimers, trimers, etc. For example: LSA[DT|M|F]G. The pipe symbol divides alternatives, so that the defined possibilities in this case are LSADTG, LSAMG, LSAFG. This provides a convenient way to represent the ambiguities that are found when trying to interpret a spectrum. A term in square brackets without pipe symbols defaults to the original sense of a character class. That is [IL] is identical to [I|L]. Note that alternatives delimited by pipe symbols are sequences, not character classes. [DT|M|F] is not the same as [DT|TD|M|F].

A tag may run in either direction, but the mass values are ‘glued’ to the ends of the tag. Hence, tag(1004, LSADTG, 1548) is the same as tag(1548, GTDASL, 1004) but different to tag(1548, LSADTG, 1004).

The observed fragment ion mass values can belong to any series, including doubly charged series if permitted by the precursor charge and instrument type. However, both fragment ion mass values must belong to the same series. That is, they can both be y or y++ or y-17 but one cannot be y and the other y-17.

If the tag includes an ambiguous sequence string and there are variable modifications or a wide peptide mass tolerance or no enzyme specificity, this may generate a very large number of possibilities. Such searches can take a long time to complete and are unlikely to give a high score.

It is not possible to mix ions(…) qualifiers and sequence tags in the same query.

Error Tolerant Sequence Tag

A sequence tag can match to a peptide despite there being an unsuspected modification or point mutation by allowing the mass values to ‘float’. For example, take the peptide GVQVETISPGDGR, MH+ = 1314.7 and the (b ion) sequence tag:

1314.7 tag(614.3,TISP,911.5)

If there was an unsuspected modification on the N-terminal side of the tag, which increased the mass by 100, this would affect both the fragment ion mass values in tandem. The tag interpreted from the spectrum would become:

1414.7 tag(714.3,TISP,1011.5)

On the other hand, if the unsuspected modification was on the C-terminal side of the tag, or if the fragment ions were y series ions, the fragment ion mass values would be unchanged, and the interpreted tag would be:

1414.7 tag(614.3,TISP,911.5)

By entering a sequence tag as an error tolerant sequence tag, using the keyword etag, you can have Mascot search for these possibilities automatically. When searching an etag, the peptide molecular weight constraint is relaxed and the fragment ion mass values must fit one of two possibilities. Either both values are unchanged or both values are shifted by the same amount as the peptide mass.

Because an etag sacrifices most of the specificity of a standard sequence tag, it is not permitted to combine it with a very wide peptide mass tolerance (> 1% or > 10 Da) or no enzyme specificity. Also, because the constraint on the peptide mass is dropped, if one tag is error tolerant, then any other tags for the same query are also treated as error tolerant, even if they have been entered as standard tags. Finally, it is not possible to mix ions(…) qualifiers and sequence tags.

Other Qualifiers

peptol(tolerance,unit) may be used to specify a mass tolerance for an individual query, over-riding the search form default. For example, peptol(10,%) or peptol(2,Da).

If you re-Search a Sequence Query from the results page, you may notice two additional qualifiers which are used internally by Mascot:

from(mass,charge) is used to track the original mass and charge state of the peptide, after it has been converted to a neutral, Mr value. For example, if the peptide charge state was specified to be 1+, the query 1234.5 would become 1233.492 from(1234.5,1+)

title(encoded title text) can be used to associate a text string with an individual query. If the text contains non alphanumeric characters, these must be Url encoded by conversion to %nn, where nn is the hexadecimal ASCII code for the character. For example, Sample(1) becomes Sample%281%29.

Example

Load a Sequence Query form, paste the following search into the query window, and submit the search.

TAXONOMY=. . . . . . . . . . lobe-finned fish and tetrapod clade
REPTYPE=Peptide
TOL=0.03
TOLU=%
ITOL=0.5
ITOLU=Da
CHARGE=2+
INSTRUMENT=ESI-TRAP
877.4 tag(376.2, [IL][QK][IL], 730.2)
687.3 etag(782.3, NG[IL], 1066.1)

These two sequence tags are taken from the original paper of Mann and Wilm. You should find that both match to Lysozyme:

1.     Q7LZI3                  Mass: 14220    Score: 76     Matches: 2 (2)     Sequences: 2 (2)   
       Lysozyme C OS=Tragopan satyra GN=LYZ PE=1 SV=1 
       Check to include this hit in error tolerant search or archive report 
         
       Query   Observed    Mr(expt)    Mr(calc)       %    Miss  Score  Expect Rank Unique  Peptide  
           2   687.3000   1372.5854   1267.6019     8.2821   0     42     0.31   1    U     R.GYSLGNWVCAAK.F
           1   877.4000   1752.7854   1752.8278    -0.0024   0     35   0.0021   1    U     R.NTDGSTDYGILQINSR.W

The error tolerant tag has found a match by adjusting the peptide mass by 105 Da, corresponding to s-pyridylethylation of the cysteine residue.