#import "../template.typ": *

#counter(heading).update(0)

= Generalities

In this chapter, I wish to introduce some general concepts around the
  #i2mcq program, the reference to be used to cite the software in
  publications, the building and installation procedures.

== History of the project

#i2mcq is the successor of the #xtpjava project that has seen the following changes along the years:

- Full rewrite of the #xtpjava program from Java to #box[C++17]. The
          Java-based software program had been published in #cite(<langella_xtandempipeline:_2017>, form:"full")
          #block-tip()[Before the integrations described below, the product of the
              rewrite has been called transitorily #xtpcpp (or
              #application[xtpcpp]). That name might appear in some
              places while the code/documentation is being revised to change
                its name to #i2mcq.
                ]

- Integration into the new software of the #mcq software project that
          was developed as a standalone #box[C++] software piece. #mcq is a software
          project that was developed to perform quantitative proteomics in a
          variety of modes (label-free or with labelling).

- Unfinalized integration of the #mcqr project that was developed as a standalone project. #mcqr is a GNU~R project aimed at performing bio-statistical analyses on the quantification analysis performed by #mcq.

The #i2mcq project encompasses three main quantitative proteomics 
    fields of endeavour:

- Database search, peptide identification and protein inference. The
          database search is actually performed by #xtandem and is started
          seamlessly by #i2mcq. Protein grouping is performed by original code
            in #i2mcq.
- Quantitative proteomics, mainly based on area-under-the-curve
          processes (requires the full mass data set to extract ion current
          chromatograms, XIC).  This part was historically performed by the
            #mcq software program.

- Bio-statistical analysis of the quantification data. This part was
          historically performed by the #mcqr GNU~R-based package
          (unpublished software as of yet).



== What does #i2mcq Stand for?

The #i2mcq software project aims at providing users with an integrated software solution for quantitative proteomics. As described in detail in another chapter of this book, quantitative proteomics involve a number of steps that can be enumerated in sequence below:

- Search databases to connect MS/MS spectra to peptide sequences. This step is called _identification_;
- Apply logic to reliably identify proteins based on the peptides identified at the previous step. This step is called _inference_;
- Optionally perform quantification of the #emph[i]dentified peptides and #emph[i]nferred proteins. #i2mcq has area-under-the-curve quantitative proteomics capabilites that are based on precursor peptide ion current extraction from the mass spectrometric data. The extracted ion currents are then plot like chromatograms: intensity as a function of retention time. Thisanalytical process thus somehow involves #quote[#emph[Mass Chro]matograms] for the #emph[Q]uantification.

From the sequence above, the &i2mcq; name becomes self-explanatory!

#block-tip[It is however possible (and encouraged) to mentally read #i2mcq as #quote[#emph[I too MassChroQ !]]]


== Transitioning from #xtpcpp to #i2mcq

The previous #xtpcpp version of this software did store configuration data
    in the local configuration directory and in the
    #filename[PAPPSO/xtpcpp.conf] file. In order to preserve these
    configuration data after having transitioned from #xtpcpp to #i2mcq,
    please, rename that configuration file to
    #filename[PAPPSO/i2masschroq.conf].
    

== General concepts and terminologies

This section describes the general concepts at the basis of the analysis of
    proteomics data that one needs to grok in order to properly assimilate the
    workings of the #i2mcq software.

=== Bottom-up Proteomics or Top-down Proteomics?

Proteomics is a mass spectrometry-based field of endeavour that is aimed at
      characterizing the #quote[protein complement] of a given genome. The
      protein complement of a genome is the set of proteins that are expressed at a
      given instant in the life of a cell, a tissue or an organ, for example.
      Characterizing that protein complement actually means identifiying the
      proteins expressed by a given living cell or tissue or organ. Optionally, if
      feasible, the characterization of post-translational modifications might be
      desirable.

There are two main variants of protemics: #quote[bottom-up]
    proteomics and #quote[top-down] proteomics:

- The first variant—bottom-up proteomics—identifies proteins on the
          basis of the identification of all the peptides obtained by first
          digesting all the proteins of the sample using an enzyme of known
          specificity. In this variant, the sample that is injected in the
          mass spectrometer is the resulting peptide mixture (first resolved
          by high performance liquid chromatography).  The identification of
          the proteins contained in the initial sample is performed in a
          number of steps that are actually the focus of #i2mcq. Indeed the
            #i2mcq software is a bottom-up-oriented software program.

- The second variant—top-down proteomics—identifies proteins on the
          The second variant identifies proteins on the basis of intact proteins
          directly injected in the mass spectrometer. Of course, it might be
          necessary to fragment the proteins in the mass spectrometer and to use
          the fragments to actually identify the protein. However, the fact that
          the protein is first detected and analyzed as one entity (and not as
          set of peptides), allows for some very useful discoveries, like the
          identity and number of post-translational modifications, for example.

#block-note[At the moment, #i2mcq does not handle top-down proteomics data: it
        is a bottom-up proteomics software project.]

        


=== Typical cycle of a mass spectrometer data acquisition

Once the initial sample, containing all the proteins to identify, has
    been digested using a protease of known cleavage specificity (trypsin,
    typically), the peptidic mixture (that might be highly complex) needs to
    be resolved as much as possible using chromatography. In the vast
    majority of the proteomics experimental settings, the chromatography
    setup is connected to the mass spectrometer so that when the gradient is
    developed, all the peptides are immediately injected #quote[on line] to the mass spectrum ion source.


    The mass spectrometer runs an analysis cycle that can be summarized like
    the following:

- Acquire a full scan mass spectrum of the whole set of ions at a given
          chromatography retention time. This kind of mass spectrum is
          called a MS spectrum;
          
- Enter a loop during which ions having the most intense signal are
          subjected in turn to collision-induced dissociation (CID), that
          is, are fragmented by accelerating them against gas molecules in a
          fragmentation cell. The mass spectra that are collected at each
          one of these fragmentation acquisitions are called MS/MS spectra
          because they are obtained after two mass analysis events: the
          first event is the measurement of the intact peptide ion's m/z
          value (full scan mass spectrum) and the second event is the
          measurement of all the obtained fragments' m/z values (MS/MS
          scan).
          
          
Each instrument records all the MS and MS/MS spectra in a raw data
    format file that is specific of the vendor. Free Software developers
    cannot know the internal structure of the files. To use the mass
    spectrometric data, they need to rely on a specific software that
    performs the conversion from the raw data format to an open data format
  (mzML). That program is called #application[msconvert],
  from  the #productname[ProteoWizard] project. 

  
#block-note[Mass spectrometrists used to call ions that were analyzed in full scan
    mass spectra #quote[parent ions]. They also used to call fragment
    ions arising upon fragmentation of a parent ion #quote[daughter
    ions]. This terminology has been deprecated and has been replaced
    with #quote[precursor ion] and #quote[product ion],
    respectively. In our document, we thus use the new terminology.]


=== Outline of an #i2mcq working session

#i2mcq loads mzXML- and mzML-formatted files and needs for its
    operations to have accesss to all the MS and MS/MS spectra. Once data
    files have been loaded, #i2mcq allows the user to perform the
    following tasks, that will be detailed in later chapters:

- Configure the #xtandem database searching software (that is, the
          software, external to #i2mcq that actually performs the
          peptide-mass spectrum matches);
- Run the #xtandem software and load its results;
- Display the results to the user in a way that they can be
          scrutinized and checked. The peptide identification results
          serve as the basis for another processing step that is
          integrally performed by #i2mcq: the #quote[protein
          inference]. That step aims at using the peptide
          identifications to actually craft a list of proteins identities.
          The user is provided with various means to control that step in
          various ways.

- Optionally start the #mcq module to perform the quantitative
          proteomics on the identification data checked at the previous step.

- Optionally start the #mcq module to perform the bio-statistical
          analysis of the quantitative proteomics data obtained at the
          previous step.

     


== Citing the #i2mcq software.

Please cite the latest article :

#cite(<langella_full_2024>, form:"full")


Former citation was :

#cite(<langella_xtandempipeline:_2017>, form:"full")



== Installation of the software

The installation material is available at #linkext("http://pappso.inrae.fr/en/bioinfo/xtandempipeline/download/", "http://pappso.inrae.fr/en/bioinfo/xtandempipeline/download/").

=== Installation on MS~Windows and macOS systems

The installation of the software is extremely easy on the MS-Windows and
        macOS platforms. In both cases, the installation programs are standard and
        require no explanation.


=== Installation on Debian- and Ubuntu-based systems

The installation on Debian- and Ubuntu-based GNU/Linux platforms is
        also extremely easy (even more than in the above situations).
        ; is indeed packaged and released in the official
        distribution repositories of these distributions and the only command
        to run to install it is:

        
#code-prompt("sudo apt install <package_name>")

In the command above, the typical #emph[package_name] is
    in the form #filename[i2masschroq] for the program package and
    #filename[i2masschroq-doc] for the user manual package.

Once the package has been installed the program shows up in the
    #emph[Science] menu. It can also be launched from the shell
    using the following command: 


#code-prompt("i2masschroq")


