#import "../template.typ": *

= Specific procedures for the timsTOF line of instruments <chap_bruker-timstof-data-handling>

//    <keyword>Bruker</keyword>
//    <keyword>timsTOF</keyword>
//    <keyword>ion mobility</keyword>

This chapter describes the very few specific procedures to carry out when the
  proteomics data at hand are from the Bruker timsTOF line of instruments.


== General considerations <sect_timstof-general-considerations>

The mass spectrometers from the timsTOF line of instruments by Bruker
    implement ion mobility mass spectrometry by trapping ions and subjecting
    them to a gas flow that moves them in the trap according to their
    collisional cross section.  Interestingly, the outcome of that operation is
    the reverse of conventional drift tube-based ion mobility: large ions are
    released first and smaller ions are released last.

Apart from the observation above, the result is nonetheless that ions
    entering the instrument are separated according to a new dimension that is
    orthogonal to the other two retention time and #mz; ratio dimensions: the
    ion mobility dimension. This new dimension inevitably introduces more
    complexity and greater volume to the mass data.


In a historic move, Bruker has decided to publish the technical details of
    their data format. During the acquisition, data are stored in two separate
    files located in their data directory that has the #filename_extension[.d] extension:

/ analysis.tdf: this file is a
          #application[SQLite3] relational database that contains
          all the metadata about the acquisition. The generic
          #emph[metadata] term defines data that describe the
          actual data. So this file contains data that explain how the data are
          organized in the actual data file below.

/ analysis.tdf_bin: this file is a binary-format
          file that holds the data in the form of a succession of numbers packed
          according to a specific scheme that Bruker has decided to make public.

The #application[SQLite3] #filename[analysis.tdf]
    file contains a set of tables.  Most often, records in one table make
    reference (#emph[relate]) to other record(s) in other
    table(s). This is why this database file is said to be a
    #emph[relational database]. A view of the tables making that
    relational database file is shown in @fig:fig_xtpcpp-sqlitebrowser-on-bruker-tdf-file.


#figure(
caption: [View of the relational database file],
[
#image("../assets/print-xtpcpp-sqlitebrowser-on-bruker-tdf-file.png")
The mass spectrometers of the timsTOF line of instruments by Bruker
            produce mass data that are stored in two files. This figure shows
            the table structure of the relational database
            #filename[analysis.tdf] file displayed in
            #application[SqliteBrowser] (a Free Software
            application).
]
)<fig_xtpcpp-sqlitebrowser-on-bruker-tdf-file>
 

When dealing with proteomics projects that have their data originating in
    timsTOF instruments, some specific steps are to be taken so as to inform
    #i2mcq; that specific handling is required. These will be reviewed below in
    the same succession as they need be implemented when running #i2mcq;.



=== Running #xtandem; identifications with Bruker timsTOF data <sect_running-xtandem-identifications-bruker-timstof-data>

It is possible to load Bruker timsTOF data right in the #i2mcq; program's
      graphical user interface, as shown in the window pictured in @fig:fig_xtpcpp-xtandem-configuration-window. The very first
      specific step to take in this case is to select the data files by clicking
      the #guibutton[Add Bruker timsTOF folders].

The #guibutton[Add Bruker timsTOF folders] button lets the user
      choose the Bruker data directory (#filename_extension[.d] extension) and then asks if the data to be
      loaded are in the TDF or MGF format (see @fig:fig_xtpcpp-bruker-data-format-selection-mgf-or-tdf).

Indeed, the MGF file generated by the Bruker software is automatically
      installed in the #filename_extension[.d] extension
      data directory.


#figure(
caption: [File format selection dialog for Bruker timsTOF data],
[
#image("../assets/print-xtpcpp-bruker-data-format-selection-mgf-or-tdf.png")
When handling Bruker timsTOF data, two file formats are available:
            MGF and TDF.  See text for details.
]
)<fig_xtpcpp-bruker-data-format-selection-mgf-or-tdf>


The #application[DataAnalysis] software from Bruker allows
      one to export proteomics MS/MS data into MGF format files (Mascot generic
      format). Their native data format, though, is the TDF format. It is
      important to keep in mind that the MGF format only stores MS/MS spectral
      data, no MS data. By using this format, #i2mcq; and #mcq; won't be able
      to access MS data, which are required in a number of situations, in
      particular when extracting ion currents for given #mz; ratios, for
      example, or for area-under-the-curve quantitative proteomics. It is thus
      always recommended to use the native TDF format whenever available.

When loading Bruker timsTOF data right into #i2mcq; as described above,
      the software performs some under-the-hood operations that the user might
      want to be aware of. The hidden operations are unveiled in the following
      sections, as they involve command line programs shipped along with
      #i2mcq; that might be of interest to the user.




=== Converting Bruker timsTOF data to mzXML with mzxmlconverter <sect_mzxml-conversion-tdf-to-mzxml>

#xtandem; needs the mass spectrometric data that it uses for the database
      searches to be in the mzXML format.  For this very reason, #i2mcq; cannot
      work by harnessing the capabilities of #xtandem; starting from Bruker
      timsTOF data. These data need to be converted to an mzXML file before they
      can then be fed to #xtandem;.

In order to be able to store a mzXML file on disk, the user may convert
      timsTOF data files (TDF or MGF) to mzXML using the
      #application[mzxmlconverter] program that is shipped along
      with #i2mcq;. This program is a command line program that takes a data
      file in input and that writes a mzXML file in output. The command line
      syntax is easy, as picture in @fig:fig_xtpcpp-mzxmlconverter-usage-message.

To obtain help about the program, run the following:

#code-prompt("<path_to>/mzxmlconverter --help ")
 // <footnote><para>The prompt character might be
    //  <keysym>%</keysym> in some shells, like
  //    #application[zsh].</para></footnote> 

//  <command>&lt;path_to&gt;/mzxmlconverter --help </command><keycap>RETURN</keycap>

#figure(
caption: [Converting mass data files to mzXML data files],
[
#image("../assets/print-xtpcpp-mzxmlconverter-usage-message.png")
Files of any format handled by
      #application[ProteoWizard] or files from the Bruker's timsTOF
      line of instruments can be converted to mzXML using
      #application[mzxmlconverter]. Conversion from timsTOF format
      to mzXML is performed entirely by our own software.
]
)<fig_xtpcpp-mzxmlconverter-usage-message>




block-warning(title: "The mzXML file format does not contain mobility data from the timsTOF data files")[
It is important to grasp that the mzXML file that is generated by
      #application[mzxmlconverter] does not contain all the mass
      data that are contained in the Bruker's timsTOF data files. When loading
      the produced mzXML format files in #i2mcq;, the #xtandem; program will be
      able to perform peptide and protein identifications. Later, however, when
      the user will try to activate features in #i2mcq; that require the
      original data, the software will not be able to provide the expected
      results, like XIC reports or ion mobility values, because it won't have
      access to the original data.]
      
      
The #application[mzxmlconverter] program is practical because
    it allows storing the mzXML file on disk and loading it in #i2mcq; for
    #xtandem; to consume it for the identifications. However, as stated in the
    warning above, that mzXML file has not a full copy of the data in the
    original mass spectrometry data file (be that a mzML, or MGF, or TDF file).
    #i2mcq; has a solution for this problem: using an integrated workflow to
    convert the original data file to mzXML, make #xtandem; use it, write out
    the #xtandem; results file and finally rewrite that file into a new version
    by adding a connection between the #xtandem; run results and the original
    mass data file. In this way, when the user will activate features in
    #i2mcq; that need accessing the original mass data file, the expected
    results will be effectively displayed. This process is described in detail
    in the next section.


=== Data conversion process with Bruker timsTOF TDF data and tandemwrapper <sect_tandemwrapper-conversion-tdf-to-mzxml>

In order to seamlessly use Bruker timsTOF data in the context of performing
  #xtandem; identifications and later #mcq;-based quantifications, the
  #application[tandemwrapper] program is made available to users
  who want to perform database searches using the command line interface. The
  #application[tandemwrapper] program performs an under-the-hood
  file format conversion as described in the previous section before
  automatically feeding the generated mzXML file to #xtandem;. After #xtandem;
  has produced its results file, that file is rewritten by
  #application[tandemwrapper] in such a manner that the connection
  with the original mass spectrometry data files is reinstated for further use
  in the #i2mcq; graphical interface.


There is, however, a way to convert Bruker timsTOF data files to mzXML using
  a standalone program, called #application[tandemwrapper], that
  is shipped along with #i2mcq;. The #application[tandemwrapper]
  program is a command line program that takes as input a XML configuration
  file. The XML file is most similar to the configuration file that #xtandem;
  uses.


To obtain help about the program, run the following:



#code-prompt("<path_to>/tandemwrapper --help ")
// <prompt>$</prompt>

 // <command>&lt;path_to&gt;/tandemwrapper --help </command><keycap>RETURN</keycap>


A typical #application[tandemwrapper] input configuration file is shown below:

```xml
<?xml version="1.0" encoding="UTF-8"?>
<bioml label="example-tandemwrapper-mass-data-file.mzxml">
    <note type="heading">Paths</note>
    <note type="input" label="list path, default parameters">/full_path_to/xtandem-presets-file.xml</note>
    <note type="input" label="list path, taxonomy information">/full_path_to/database.xml</note>
    <note type="input" label="spectrum, path">full_path_to/mass-data-file.mzxml</note>
    <note type="heading">Protein general</note>
    <note type="input" label="protein, taxon">usedefined</note>
    <note type="heading">Output</note>
    <note type="input" label="output, path">/full_path_to/tandemwrapper-output.xml</note>
</bioml>
```

The XML configuration file that is provided to
  #application[tandemwrapper] on the command line is a replicate
  of the file that #xtandem; itself expects.  That file is shown in @fig:fig_xtpcpp-tandemwrapper-input-config-file.


#figure(
caption: [Configuration file for tandemwrapper],
[
#image("../assets/print-xtpcpp-tandemwrapper-input-config-file.png")
The #application[tandemwrapper] program takes as input a
        configuration file that is most similar to the configuration file that
        is fed to #xtandem;.
]
)<fig_xtpcpp-tandemwrapper-input-config-file>


The following elements need an explanation:

/ default parameters: In the example, the
        #filename[/full_path_to/xtandem-presets-file.xml] file is the #xtandem;
        presets file, already discussed in @sect_xtandem-parameter-presets.
/ taxonomy information: In the example, the
        #filename[/full_path_to/database.xml] file is the file that
        configures the location of the FASTA protein database files that are
        searched by #xtandem;. This file is described below.
/ path: In the example, the
        #filename[/full_path_to/mass-data-file.mzxml] file is the mass
        spectrometry data file in the mzXML format. 
 #block-tip[The mass spectrometry data file might be of any format that can be
            handled by ProteoWizard (open data formats only, particularly mzML)
            and also the Bruker's timsTOF TDF format that is handled by our own
            code.]
/ output, path: In the example, the
        #filename[/full_path_to/tandemwrapper-output.xml] file is the
        file in which the #xtandem; configuration file is written for immediate
        use by #xtandem;.

The configuration file that indicates where the FASTA protein database files
  are located, that is referenced in the
  #application[tandemwrapper] input configuration file is shown in
  @fig:fig_xtpcpp-tandemwrapper-fasta-database-file.



#figure(
caption: [Configuration file pointing at the FASTA protein databases],
[
#image("../assets/print-xtpcpp-tandemwrapper-fasta-database-file.png")
This file tells #application[tandemwrapper] the location
        of the FASTA protein databases required when #xtandem; will actually
        perform the searches.
]
)<fig_xtpcpp-tandemwrapper-fasta-database-file>


The #application[tandwrapper] program performs the following tasks
  in sequence:

- Convert the input mass data file to the mzXML file format that #xtandem;
        needs to perform the database searches. This step is only performed if
        the original mass spectrometry data file has not the mzXML format.
        #xtandem; produces an identification results file in an XML format;
- The mzXML format that is consumed by #xtandem; is a pretty simple format
        that was not designed to store a large variety of data/metadata, like
        ion mobility data, for example. For this very reason,
        #application[tandemwrapper] reads the identification
        results file produced by #xtandem; (that file is also in a specific XML
        format) and rewrites it to a new analogous file that has all the
        necessary connections to the original mass data file. In this way, when
        the new version of the #xtandem; identification results file is loaded
        in #i2mcq; all the original mass data can be accessed to provide the
        user with all the data, like XIC chromatograms, ion mobility data, for
        example. To load the #xtandem; identification results, see @sect_loading-protein-identification-results.

