Searching uninterpreted MS/MS data

If you have no time to read this short tutorial, these are the most important do’s and don’ts:

  • You cannot search raw data; it must be converted into a peak list.
  • Search parameters are critical and should be determined by running a standard, such as a BSA digest.
  • If you are not sure which database to search, start with Swiss-Prot.
  • If you use a taxonomy filter, or search a single organism database, include a contaminants database in the search.
  • Only select very abundant modifications as variable.
  • If the protein was digested with an enzyme, choose this enzyme.
  • Use an error tolerant search to find post-translational modifications, SNPs, and non-specific cleavage products.
  • Peptide matches are only significant (reliable) if they have an expect value below 0.05, (5% chance of being false).
  • For important work, run a target-decoy search, set the peptide FDR to 1%, and filter the proteins in Report Manager by requiring significant matches to at least 2 distinct sequences.

Tutorial

The first requirement for database searching is a peak list; you cannot upload a raw data file. Raw data is converted into a peak list by a process called peak picking or peak detection. Often, the instrument data system takes care of this, and you can submit a Mascot search directly from the data system or save a peak list to a disk file for submission using the web browser search form. If not, or if you have a raw data file and no access to the data system, you’ll need to find a utility to convert it into a peak list. Peak lists are text files and come in various different formats. If you have a choice, MGF is recommended. Be careful with mzML, because this may contain either raw data or a peak list.

A peak list, by itself, is not sufficient. There are also a number of search parameters that must be set appropriately. Follow this link to open the search form in a new browser tab. The labels for each control on the search form are also links to help topics. Note that you can set your own defaults for the web browser search form by following the link at the bottom of the Access Mascot Server page.

The form looks much the same whether you have your own Mascot server, in-house, or whether you are connected to the free, public Mascot Server. If you are using the free, public Mascot Server, there are some restrictions, one of which is that you have to provide a name and email address so that we can email a link to your search results if the connection is broken. A more important restriction is that searches are limited to a maximum of 1200 spectra. Whether you enter a search title is your choice. It is displayed at the top of the result report, and can be a useful way of identifying the search at a later date.

If at all possible, run a standard sample and use this to set all the search parameters. By standard sample, we mean something like a BSA digest, which will give strong matches and where you know what the answer is supposed to be. Trying to set search parameters on an unknown is much more difficult, especially if the sample was lost somewhere during the work-up or if the instrument has developed a fault.

The first choice you have to make, and one of the more difficult, is which database to search. The free public web site has just a few of the more popular public databases, but an in-house server may have a hundred or more. Some databases contain sequences from a single organism. Others contain entries from multiple organisms, but usually include the taxonomy for each entry, so that entries for a specific organism can be selected during a search using a taxonomy filter.

If you’re not sure what is in the sample, Swiss-Prot is a good starting point. The entries are all high quality and well annotated. Because Swiss-Prot is non-redundant, it is relatively small. The size of the database is one factor in the size of the search space – the number of peptide sequences that are compared with a spectrum to see which gives the best match. The smaller the search space, the easier it is to get a statistically significant match. This is a very important concept and other factors that affect the size of the search space will be highlighted as we come to them.

If you think you know what is in the sample, you may want to search an organism specific database. But, you can never rule out contaminants. This can be a severe problem if you only have a handful spectra. You might be interested in a human protein, so you search a human database, but your spectrum is for a peptide from a contaminant, so you get no match or a misleading match. When searching entries for a single organism, always include a database of common contaminants. This is important, even if you have a large dataset and no interest in proteins from anything other than your target organism. Otherwise, you may end up reporting your sample is full of serum albumin when it is really BSA or keratin when it is really sheep keratin from clothing. In the web browser form, to select two databases, first click on your target database then hold down the control key and click on a contaminants database. If your search uses a taxonomy filter, that’s not a problem because taxonomy is not configured for the contaminants databases, so all the entries will always be searched.

If your target organism is well characterised, such as human or mouse or yeast or arabidopsis, there may be no need to look beyond Swiss-Prot. You can get a sense of how well your organism is represented in SwissProt by looking at the release notes, which list the 250 best represented species. If you are interested in a bacterium or a plant, you may find that it is poorly represented in Swiss-Prot, and it would be better to try one of the comprehensive protein databases, which aim to include all known protein sequences. The two best known are NCBInr and UniRef100. If the genome of your organism hasn’t been sequenced, you may still be out of luck, and your best hope is to search a collection of ESTs (Expressed Sequence Tags are relatively short nucleic acid sequences). Follow this link to see the entry in the NCBI taxonomy browser for orange, the citrus fruit. This has just 94 entries in Swiss-Prot and only 760 in the whole of NCBInr. If this is your organism of interest, you’ll definitely want to search the ESTs, which number over 214,000. (All counts as of June 2013)

Never choose a narrow taxonomy without looking at the counts of entries and understanding the classification. In the current Swiss-Prot, for example, there are 26,139 entries for rodentia, of which all but 1,602 are for mouse and rat. So, even if your target organism is hamster, it isn’t a good idea to choose ‘other rodentia’. Better to search rodentia and hope to get matches to homologous proteins from mouse and rat.

Swiss-Prot is a non-redundant database, where sequences that are very similar have been collapsed into a single entry. This means that the database entry will often differ slightly from the protein you analysed. Standard database searching requires the exact peptide sequence, so you may miss some matches due to SNPs and other variants. This would be another reason to search a large, comprehensive database. But, remember that NCBInr is 50 times the size of Swiss-Prot, so searches take proportionally longer and the search space is proportionally larger, meaning that you need higher quality data to get a significant match.

If your protein was digested using an enzyme, always choose this enzyme. Choosing a semi-specific enzyme or ‘None’, for non-specific cleavage, greatly increases the search time and the search space, which will almost certainly cause a net reduction in the number of matches. The error tolerant search, discussed below, is a better way of finding non-specific peptides. If you are studying endogenous peptides, such as MHC peptides, you have no choice, and enzyme ‘None’ will look for matches in all sub-sequences of all proteins. If you are doing top-down, to analyse intact proteins, choose NoCleave. Note that NoCleave is not the same as None; it is the exact opposite.

When designing your experiment, be aware that an enzyme of low specificity, which digests proteins to a mixture of very short peptides, is not a good choice, because very short sequences will be found in many database entries, so have low specificity. The longer the peptide, the easier it is to get a significant match and the more likely it is that the match will point to one particular protein. In most cases, it is best to use an enzyme of specificity equal to or greater than trypsin, and focus on peptides with masses between 1200 and 4000 Da.

The number of allowed missed cleavages should be set empirically, by running a standard with this set to a high value and looking at the significant matches to judge the extent of incomplete cleavage. Setting this value higher than necessary simply increases the size of the search space, which you will now recognise as being a ‘bad thing’.

Modifications in database searching are handled in two ways. First, there are the fixed or quantitative modifications. An example would be a the efficient alkylation of cysteine. Since all cysteines are modified, this is effectively just a change in the mass of cysteine. It carries no penalty in terms of search speed or specificity.

In contrast, most post-translational modifications do not apply to all instances of a residue. For example, phosphorylation might affect just one serine in a protein containing many serines and threonines. These variable or non-quantitative modifications are expensive in the sense that they increase the search space. This is because the software has to permute out all the possible arrangements of modified and unmodified residues that fit to the peptide molecular mass. As more and more modifications are considered, the number of combinations and permutations increases geometrically, and we get a so-called combinatorial explosion.

This makes it very important to be sparing with variable modifications. If the aim of the search is to identify as many proteins as possible, the best advice is to use a minimum of variable modifications, or none at all. Most post-translational modifications, such as phosphorylation, are rare and it is much more efficient to use an error tolerant search to find them.

You cannot select two fixed modifications with the same specificity. If you select variable modifications with the same specificity as a fixed modification, this excludes the possibility of an unmodified site. For example, if you choose Carbamidomethyl (C) as fixed and Propionamide (C) as variable, you can get matches to either of these but never to a peptide with free cysteine. Also, you will not get matches to a peptide modified with both carbamidomethyl and propionamide.

Making an estimate of the mass accuracy doesn’t have to be a guessing game. The Mascot result reports include graphs of mass errors. Just run a standard and look at the error graphs for the strong matches. Ignore outliers, which are likely to be chance matches, and you’ll normally see some kind of trend. Add on a safety margin and this is your error estimate. The graph for precursor mass error is in the Protein View report and the graph for MS/MS fragment mass error is in the Peptide View report. You can also use these graphs to decide whether Da or ppm is the best choice for the tolerance unit.

Sometimes, peak picking chooses the 13C peak rather than the 12C, so the mass is out by 1 Da. In extreme cases, it may pick the 13C2 peak. The #13C control allows for this, enabling you to use a tight mass tolerance and still get a match. In general, its not advisable to combine #13C with deamidation because, if you have a high level of 13C precursors, it will be difficult to detect deamidation reliably. This is another setting that should be determined empirically, by running a standard.

Most modern instruments produce monoisotopic mass values. You will only have average masses if the entire isotope distribution has been centroided into a single peak, which usually implies very low resolution. (If you get this setting wrong, the mass errors will be very large and show a strong trend, because the difference between an average and a monoisotopic mass for peptides and proteins is approximately 0.06%.)

Peptide charge is a default, only used if no charge is specified in the peak list. Most peak lists always specify a charge state, so this default is never used.

The instrument setting determines which fragment ion series will be considered in the search. Choose the description that best matches the type of instrument. If you follow the control label link, you’ll see that many of the instruments are very similar. The main problem is if you choose CID for ETD data or vice versa.

Report determines the maximum number of hits displayed in a search results report. Always choose AUTO to display all the protein hits containing one or more significant peptide matches.

The decoy checkbox enables you to estimate the peptide false discovery rate (FDR) as recommended by most journals. Mascot repeats the search, using identical search parameters, against a database in which the sequences have been reversed. You do not expect to get any real matches from the decoy database, so the number of matches observed is an excellent estimate of the number of false positives in the results from the target database. The result report gains a control that allows the significance threshold to be adjusted to a peptide FDR of 5% or 1% or whatever you believe is appropriate for your work. Note that this is peptide FDR, not protein FDR.

As mentioned several times already, an error tolerant search is the most efficient way to discover most post-translational modifications, as well as non-specific peptides and sequence variants. This is a two pass search, the first pass being a simple search of the entire database with minimal modifications. The protein hits found in the first pass search are then selected for an exhaustive second pass search, during which we look for all possible modifications, sequence variants, and non-specific cleavage products. Because only a small number of entries are being searched, search time is not an issue. The matches from the first pass search, in the limited search space, are the evidence for the presence of the proteins, while the matches from the second pass search give increased coverage. If you see a very abundant modification, best to add this as a variable modification and then search again, because the error tolerant search only catches peptides with a single unsuspected modification. Error tolerant searching is not so useful for very heavily modified proteins, such as histones, or where there is only one peptide per protein, such as endogenous peptides.

Finally, if you are analysing proteins, you should search a peak list containing data for as many peptides as possible, because there are a host of reasons why any one spectrum may fail to give a match:

  • The exact peptide sequence isn’t in the database
  • The peptide is modified in an unexpected way
  • Non-specific enzyme cleavage
  • The precursor m/z or charge is wrong
  • The spectrum is very weak or noisy

If you don’t get any matches at all, you can only resort to changing the search parameters by trial and error, which is time consuming and carries the risk of ending up with a false positive. If you search many spectra, you have a much better chance that some of them match, and the search parameters can be modified systematically, or automatically, in an error tolerant search.