#import "../template.typ": *

= Exploring identification data
//    <keyword>identifications</keyword>
//    <keyword>peptide vs mass spectrum match</keyword>
//    <keyword>PSM</keyword>
//    <keyword>data exploration</keyword>

This chapter describes in detail all the steps that the user accomplishes in
  their data exploration session. The general workflow is to start by looking at
  a protein identification results window and then by going into the details of
  the various identifications listed in it. This latter task entails looking
  into the peptides that provided the protein identification and then looking at
  the mass spectrum that provided the peptide identification. The mass spectrum,
  that is, the MS/MS spectrum, has features aimed at allowing the user to make
  an informed opinion on the validity of the peptide #vs; mass spectrum match
  (PSM) at hand. At each moment, it is possible to invalidate a PSM and the
  identification results are recomputed automatically by taking into account the
  modification entered by the user.






== The Protein List Window<sect_protein-list-window>

When identification results files are loaded, #i2mcq; automatically
    performs the protein inference process by using the configuration settings
    described in @sect_configuring-identification-results-loading-parameters.



=== The Protein List Table View<sect1_proteins-list-table-view>

When the protein inference process is finished, #i2mcq; displays the
      protein identifications list in a table view, as pictured in @fig:fig_xtpcpp-proteins-list-window.

#figure(
caption: "The protein list window",
[
#image("../assets/print-xtpcpp-protein-identification-list-window.png")
The protein identifications list window displays the proteins
            assembled into groups. A number of metadata about the
            identifications are shown in a number of colums, the contents of all
            of which are described in detail in the text.
]
)<fig_xtpcpp-proteins-list-window>

The columns that make the protein list table view are detailed below:

/ Checked: if checked, the identified protein
            listed on the table row is set to an #quote[accepted] state.
            By default, all proteins are set to this accepted state. Unchecking
            a protein determines the protein inference reprocessing, because
            disregarding a protein modifies the whole protein identifications
            results set;
/ group: the group the protein belongs to;
/ accession: the accession number field of the protein database;
/ description: the description field in the protein database;
/ log(#evalue;): the $"Log"10$ of the protein #evalue;;
/ #evalue;: the protein #evalue;;
/ spectra: the number of spectra that identified the protein;
/ specific spectra: the number of spectra that
            identified #emph[only] this protein;
/ sequences: the number of peptidic sequences that
            can be assigned to this protein;
/ specific sequences: the number of peptidic
            sequences that can be assigned #emph[only] to this
            protein;
/ coverage: the percentage of the protein sequence
            covered by the peptides that identified it;
/ MW: the molecular weight of the protein M#sub[r];
/ PAI: #quote[Protein abundance index]. This index was defined as the
            #quote[number of peptides identified divided by the number of
            theoretically observable tryptic peptides]. See #linkext("https://www.ncbi.nlm.nih.gov/pmc/articles/PMC186633/","https://www.ncbi.nlm.nih.gov/pmc/articles/PMC186633/");
/ emPAI: #quote[Exponentially modified protein abundance index]. This
            index was defined as $"emPAI" = 10^"PAI" − 1$.  See #linkext("https://pubmed.ncbi.nlm.nih.gov/15958392/","https://pubmed.ncbi.nlm.nih.gov/15958392/").

It is possible to select the columns that must be displayed in the table
      by checking or unchecking the corresponding item in the
      #guimenuitem[Columns] menu.

The #guimenuitem[Show only] menu allows one to select the kind
      of protein items to be shown:


/ Valid proteins: when checked, the program
            only shows valid proteins, that is, protein identifications that
            fullfill the restriction parameters, like protein #evalue;, for
            example. These parameters were set at protein identification results
            loading time but can be modified later;

/ Checked proteins: show only the proteins
            that were checked. This setting is useful when the user has
            unchecked a number of proteins and that they want to regularly keep
            an eye on them. When proteins are unchecked, the protein inference
            process is run anew to compute a new grouping by taking
            #emph[not] into account the proteins that were
            disregarded;

/ Grouped proteins: only show the proteins
            that belong to a group.

The protein identifications list table view above shows greyed protein
      identities. These are proteins that, by current filter parameters (#evalue;
      threshold, for example), are considered #emph[not] valid.




=== Operations in the Protein List Window<sect1_operation-in-protein-list-window>

The #guilabel[Protein list] window houses a number of useful
      features that let the user scrutinize the protein identifications and also
      modify the results to suit either more or less stringent filtering
      parameters.
      


#formalpara("Searching data in the table view")[One interesting feature of the #guilabel[Protein list] window
        is the ability to search through the table's contents using the
        #guilabel[Search] item at the bottom of the window. A number
        of fields of the protein record, that is, columns in the table view,
        might be searched.]

#formalpara("Dynamic setting of the filter parameters")[#i2mcq; provides a rather high level of flexibility: once a protein
        identification results set of files has been loaded and that the protein
        inference process is achieved, the resulting protein groups are
        displayed in the #guilabel[Protein list] window. At this time,
        the grouping was performed using the parameters set as pictured in @sect_configuring-identification-results-loading-parameters.
        It is nonetheless possible to modify these parameters on the fly using
        the main program window's #guilabel[Filter parameters] tab, as
        pictured in @fig:fig_xtpcpp-main-window-filter-parameters-tab.
        
        
#figure(
caption: "Protein identification filter parameters tab of the main window",
[
#image("../assets/print-xtpcpp-main-window-filter-parameters-tab.png")
The filter parameters in this dialog box window do mirror the
                ones that one can set prior to loading protein identification
                results files. When modified, these parameters elicit a complete
                run of the protein inference process.
]
)<fig_xtpcpp-main-window-filter-parameters-tab>
]




#formalpara("Real time update of the false discovery rate")[The false discovery rate (FDR) is recalculated at each protein
      inference process. The data regarding this quality assessment criterion are
      shown in @fig:fig_xtpcpp-main-window-fdr-tab.

#figure(
caption: "False discovery rate (FDR) data after a protein inference process is run",
[
#image("../assets/print-xtpcpp-main-window-fdr-tab.png")
The various data bits about the false discovery rate that is
              computed each time a protein inference process is run. Note that it
              is possible to modify the #guilabel[Decoy settings], after
              which the #guilabel[Apply] button triggers the
              recalculation of the FDR.
]
)<fig_xtpcpp-main-window-fdr-tab>

              
]




#formalpara("Distribution of mass errors on PSMs plotted in a histogram")[It is possible to visualize the distribution of the mass errors over the
      whole dataset, as pictured in @fig:fig_xtpcpp-main-window-mass-precision-tab. The histogram plots
      the number of mass spectra that could achieve a PSM against the mass error
      (mass delta), that is, the difference between the experimental peptide mass
      and the calculated peptide mass.


#figure(
caption: "Mass precision quality assessment",
[
#image("../assets/print-xtpcpp-main-window-mass-precision-tab.png")
The histogram plots the number of PSMs against the mass error
              calculated between the experimental mass of the peptide and the
              calculated mass.
]
)<fig_xtpcpp-main-window-mass-precision-tab>


The mass delta calculation involves only the peptides that successfully
      identifed proteins that are currently checked in the protein
      identification list and that satisfy the filter parameters. The proteins
      identified in the decoy database are not processed.


The unit of the mass delta may be selected using the
      #guilabel[Unit] drop-down list. Two units are available: ppm (for
      part-per-million) or Dalton.]




#formalpara("Exporting the final protein identifications list to a spread sheet")[Once all the proteins in the identifications list have been properly checked,
      the user might export the data set to an OpenDocumentFormat (ODF) spread
      sheet file using the #guimenuitem[As ODS file] menu item of the
      main window's #guimenu[Export] menu.]



=== Delving Inside the Protein Identification Data<sect_delving-inside-protein-identification-data>

The protein list table view, as pictured in @fig:fig_xtpcpp-proteins-list-window is actually an active matrix in
    which the user can easily trigger the exposition of the data that yielded
    any protein identification element of the table. This is simply done by
    clicking onto any cell of the table at the row matching the protein for
    which scrutiny of the data is desired.

Depending on the column at which the mouse click happens, there might be two
    different windows showing up:

- The #guilabel[Protein details] window, showing the sequence
          of the protein, the matching peptides and other informational data
          bits, as pictured below:
  #figure(
caption: "Protein details window",
[
#image("../assets/print-xtpcpp-protein-details-window.png")
When one cell in the #guilabel[Accession],
                  #guilabel[Description] or
                  #guilabel[Coverage] column is clicked, this window
                  shows up and displays the sequence of the protein, the
                  coverage of the peptides and other useful data.
]
)<fig_xtpcpp-protein-details-window>

- When one cell in any one of the remaining columns is clicked, the
          window that shows up is the #guilabel[Peptide list] window
          showing a list of all the peptide identifications, to be described in
          the next section.


#block-tip[When clicking one cell in one column and one given row, the corresponding
      window shows up, if one was not already open. If one window is already
      open, no other window shows up, but the existing window has its data
      updated to match the new protein row being clicked on.
      
It is possible to have multiple windows opened at a time by clicking a new
      row while maintaining the #keycap[Ctrl] key pressed.  
]






== The Peptide List Window<sect_peptide-list-window>

The #guilabel[Peptide list] window displays all the data in a
    table view similar to the one used to display the protein list described in
    the previous sections.


=== The Peptide List Table View<sect1_peptide-list-table-view>

The #guilabel[Peptide list] table view has a pretty large number
      of columns to display all the data about each peptide that identified a
      given protein. These columns are described in the following figures.


#figure(
caption: "The peptide list window (first columns)",
[
#image("../assets/print-xtpcpp-peptide-identifications-list-window-1.png")
The #guilabel[Peptide list] table view has many columns
            (first columns).
]
)<fig_xtpcpp-peptides-list-window-1>


#figure(
caption: "Peptide list window (last columns)",
[
#image("../assets/print-xtpcpp-peptide-identifications-list-window-2.png")
The #guilabel[Peptide list] table view has many columns
            (last columns).
]
)<fig_xtpcpp-peptides-list-window-2>


The table's contents are well described by the column headers that
      are self-explanatory. When hovering over a column header with the mouse
      cursor, a tool-tip explanatory text is displayed.


It must be noted that more columns might make the table view depending
      on the protein identification data that were loaded. Indeed, depending
      on the database searching engine that was used for the protein
      identification, the data to be displayed vary. The whole list of columns
      that might be displayed in the table view are pictured in @fig:fig_xtpcpp-peptide-list-window-all-columns




#figure(
caption: "Columns that populate the peptide list table view",
[
#image("../assets/print-xtpcpp-peptide-list-window-all-columns.png")
Depending on the provenience of the protein identifications (the
              database search engine), the columns that are part of the table view
              differ. This full list is displayed when selecting the
              #guimenuitem[Columns] menu.
]
)<fig_xtpcpp-peptide-list-window-all-columns>  




=== Operations in the Peptide List Window<sect1_operation-in-peptide-list-window>

The #guilabel[Peptide list] window houses a number of pretty
      interesting features that let the user scrutinize the peptide details. 

#formalpara("Searching data in the table view")[One interesting feature of the #guilabel[Peptide list] window
        is the ability to search through the table's contents using the
        #guilabel[Search] item at the bottom of the window. A number
        of fields of the protein record, that is, columns in the table view
        might be searched.]




#formalpara("Exporting the final protein identifications list to a spread sheet")[Once all the peptides in the identifications list have been properly checked,
        the user might export the data set to an OpenDocumentFormat (ODF) spread
        sheet file using the #guimenuitem[As ODS file] menu item of the
        main window's #guimenu[Export] menu.]





=== Delving Inside the Peptide Identification Data<sect_delving-inside-peptide-identification-data>

The #guilabel[Peptide list] table view, as pictured in @fig:fig_xtpcpp-peptides-list-window-1 is actually an active matrix
      in which the user can easily trigger the exposition of the data that
      yielded any peptide identification element of the table. This is simply
      done by clicking onto any cell of the table at the row matching the
      peptide for which scrutiny of the data is desired.



==== The Peptide Details Window<sect_peptide-details-window>

When clicking any one of the cells of the peptide list table view, one
        window shows up that details the various data elements for the peptide
        documented in the table row. The window is pictured in @fig:fig_xtpcpp-peptide-details-window.


#figure(
caption: "Peptide details window",
[
#image("../assets/print-xtpcpp-peptide-details-window.png")
This window displays the MS/MS spectrum that allowed identifying
                a peptide (that is, a PSM). A number of informational data bits
                are displayed, like the MS/MS scan number, the #evalue; for the
                peptide, along with its Hyperscore, for example (see text below
                for a thorough description).
]
)<fig_xtpcpp-peptide-details-window>


In @fig:fig_xtpcpp-peptide-details-window, the two graphs
        show the following:
- The top graph displays the mass spectrum of this PSM. This MS/MS
              spectrum has its recognized peaks in the #bseries; and #yseries;
              ion series labelled in blue and red respectively.  When the mouse
              cursor hovers over a mass peak, the details of that mass peak are
              printed in the status bar of the window (bottom line).
              
  Navigating the spectrum is straightforward: to zoom/unzoom in a
              given area of the spectrum, point the mouse cursor at the peak of
              interest and use the mouse wheel to zoom/unzoom. To modify the
              ordinate intensity scale, click onto the axis and drag the mouse
              upwards or downwards.
              
- The bottom graph plots#sym.dash.em;for each matching MS/MS peak (that
              is, #bseries; and #yseries; ion series)#sym.dash.em;the mass difference
              (mass delta) between the ion's measured mass and the theoretical
              mass. In this example, we see that the #yseries; ion series is
              moderately matched (large error range).
              
  It is possible to set the #guilabel[MS/MS precision] to
              a determinate value and unit (Dalton, ppm or res).  The value
              entered in the spin box widget modifies the assignement of the
              fragmentation peaks.



#block-tip[The MS/MS spectrum mass peaks are annotated using the following
          naming convention:

       //   <!--b     = 0, ///< Nter acylium ions-->
      //    <!--bstar = 1, ///< Nter acylium ions + NH3 loss-->
      //    <!--bo    = 2, ///< Nter acylium ions + H2O loss-->
      //    <!--a     = 3, ///< Nter aldimine ions-->
       //   <!--astar = 4, ///< Nter aldimine ions + NH3 loss-->
       //   <!--ao    = 5, ///< Nter aldimine ions + H2O loss-->
       //   <!--bp    = 6,-->
      //    <!--c     = 7,  ///< Nter amino ions-->
      //    <!--y     = 8,  ///< Cter amino ions-->
      //    <!--ystar = 9,  ///< Cter amino ions + NH3 loss-->
       //   <!--yo    = 10, ///< Cter amino ions + H2O loss-->
       //   <!--z     = 11, ///< Cter carbocations-->
      //    <!--yp    = 12,-->
      //    <!--x     = 13 ///< Cter acylium ions-->
/ \*: neutral NH#sub[3] loss;
/ o: neutral H#sub[2]O loss;

The ion charge is displayed in the form of #quote[+] or #quote[++]
          text strings.]<ms-ms-peak-annotation-convention>




The right hand side margin of the window provides a number of data about
        the PSM, like the peptide #evalue;, the HyperScore, the ion charge, the
        theoretical and experimental masses, the difference between the two, the
        retention time at which this ion was detected#ellipsis; These
        informational data bits are self-explanatory.

The #guibutton[XIC] button at the top right corner of the
        window triggers the calculation of the extracted ion current
        chromatogram, as described in the section below.

        
        


==== The XIC Viewer Window for the Peptide Details<sect_xic-viewer-window-for-peptide-details>

One interesting feature of the #guilabel[Peptide details]
        window, is the #guilabel[XIC] button (top right) that triggers
        the calculation of an extracted ion current chromatogram, as pictured in
        @fig:fig_xtpcpp-xic-viewer-for-peptide-details.


#block-tip(title: "What is a XIC chromatogram?")[


The notion of #emph[extracted ion current] chromatogram
          is best explained by describing the computation that yields that
          chromatogram. 


The user defines the #mz; value for which the chromatogram is to be
          determined. The program iterates in each MS spectrum (that is, full
          scan spectrum) and looks if an ion by that #mz; value was encountered.
          If so, a variable holding the cumulated intensity of that ion is
          incremented for the retention time at which the mass spectrum was
          acquired. For example, if #mz; value 1254.25 is searched for, and an
          ion of that #mz; value is found in the mass spectrum acquired at
          retention time 2.5 min, then a tuple variable is stored like this:
          (2.5, intensity).  Then, another mass peak by that #mz; value is found
          in mass spectrum acquired at retention time 47 min, for which another
          tuple is created: (47, intensity). 



If the data are from ion mobility#sym.dash.em;mass spectrometry (IM-MS)
          experiments, there might be a large number of spectra acquired at a
          given retention time. For example, data from the #productname-tm[Waters Synapt2] instrument have 200 spectra
          acquired for any given retention time value (the spectra are
          drift-related spectra). In #productname-tm[Bruker
          timsTOF] data, there are more than 700  spectra acquired
          at any given retention time.  Thus, the searched #mz; value might be
          found more than once for a retention time value.  In this case, the
          tuple's intensity value is incremented by the intensity of the new
          peak of the #mz; value at that specific retention time value.


When the program has finished iterating in all the mass spectra of the
acquisition, it plots the XIC chromatogram as $"intensity"=f("retention time")$. This is the reason why it is considered a chromatogram.
]


#include "a5b10_xic_viewer.typ"



The #guilabel[XIC viewer] window displays the #quote[guts] of the of MS spectrum of the precursor ion that was fragmented and that yielded a PSM. The XIC chromatogram (left plot panel) is actually a set of XIC chromatograms that are superimposed in the plot widget (see @fig:fig_xtpcpp-xic-viewer-for-peptide-details-zoomed-view). One of the traces (legend #guilabel[+0]) is for the first peak of the isotopic cluster of the searched ion; the second trace (legend #guilabel[+1]) is for the second peak of the isotopic cluster.
Likewise for the third trace.  In the typical informatics-oriented style of numbering, the first isotopic peak (only light isotopes enter in the composition of the peptidic ion), is #quote[isotope 0]; the second isotopic peak (one light isotope is substituted with a heavy one) is #quote[isotope 1] and, finally, the third isotopic peak (two light isotopes were replaced by heavy ones) is #quote[isotope 2].



The right panel is a bar plot showing #sym.dash.em;for each one of the       isotopes#sym.dash.em; a comparison between the experimental peak area and the
        computed probability of the corresponding isotope peak. In the example,
        the match between the experimental and the theoretical cluster shape is
        perfect. This scrutinization of the data is very useful when one wants
        to double-check the quality of a protein identification on the basis of
        a given PSM.

#block-note[
The theoretical isotopic cluster peaks are calculated using the
          formula of the peptide that has been identified in the PSM for which
          the XIC chromatogram is being requested.
]


#include "a5b11_xic_viewer_zoomed.typ"



Another interesting bit of information is the #guilabel[Fraction of
        Isotopic distribution] (spin box widget top right corner of
        the window). This one needs some background. When one has a peptide
        formula and the peptidic ion charge, one can calculate the theoretical
        isotopic cluster corresponding to that specific ion. The calculation is
        CPU-intensive and sometimes one would like to limit its duration. This
        is possible by indicating that one is interested, for example, in only
        the 80~% of the total isotopic peaks that one would effectively find
        (even in minute amouts) in nature. This value tells exactly that. The
        calculation displayed in the window, encompassing only 80~% of the whole
        natural span of the isotopic cluster, yields a calculated cluster made
        of only three isotopic peaks. If the user had set the value to 99~%,
        then, most probably, numerous other isotopic peaks of very low intensity
        would have been calculated on the right hand side of the isotopic
        cluster (heavier ions because more heavy isotopes are included in the
        computation).


Remember to click onto the #quote[#guibutton[Histogram plot]] button next to the spin box widget for the new #guilabel[Fraction of Isotopic distribution] value to take effect.


As for the previous MS/MS spectrum plot, to zoom in/out regions of the
        XIC chromatogram plot widget, hover the mouse cursor over the region of
        interest and rotate the mouse wheel.

        

== Handling Phospho-Proteomics Data<sect_handling-phospho-proteomics-data>


#i2mcq; is able to cope with phospho-peptides. The mass spectrometric data
    are acquired exactly as usual with the mass spectrometer, but the sample
    preparation goes along theses steps:


- Separate digestion of the samples (when there are more than one);
- Labeling of the peptides, each sample gets a different label;
- Pool of the whole set of peptides into a single mixture;
- Separation of the peptides on a strong cation exchange (SCX) resin,
          collection of the fractions;
- Phospho-peptide enrichment using IMAC#footnote[Immobilized-metal affinity chromatography.]
          for each SCX fraction. The SCX fraction is loaded onto the IMAC resin
          and, following a wash step, the phospho-peptides are eluted (pH-based
          elution). There is thus a one-to-one relation between a SCX fraction
          and an IMAC-based purification fraction.
- Mass spectrometric analysis of each IMAC-based phospho-peptide-enriched
          fraction.


#xtandem; needs to be configured in such a manner that it can generate all
    the theoretical peptides (and fragments) that might bear the phosphoryl
    group. This process is described in the section below.
