TRiFLe Tutorial

TRiFLe Tutorial

TRiFLe Tutorial

Pilar Junier

<>

Thomas Junier

<>

NEWS (2008-10-14): the TRiFLe paper has been published


1. Introduction

Terminal Restriction Fragment Length Polymorphism (TRFLP) is a technique for discriminating between DNA species based on variability in restriction sites. In a typical application, DNA from a sample (clinical, environmental, etc) is amplified with suitable PCR primers. One or both of the primers may be marked with a fluorescent dye. The amplified fragments (amplicons) are then digested with one or more carefully chosen restriction enzymes. The restriction fragments, including the marked ones, are then separated by electrophoresis coupled with a spectrometer. Only the terminal fragments are detected, since only they contain the marked primer. If the lengths of those fragments is sufficiently variable, and if a reference set of lengths is available, the species in the mixture can be determined.

TRiFLe is a program for simulating TRFs and identifying bacteria by TRF profiling. It has two main functions:

  • to simulate TRF experiments
  • to predict the most probable species in a mixture

Simulations help design actual experiments, by allowing users to select the most appropriate restriction enzyme(s). Species prediction helps identifying unknown species, including in a sample containing several, and is thus useful in the study of environmental samples.

TRiFLe is available from http://cegg.unige.ch/trifle/trifle.jnlp (you also need Java WebStart, that's all). We intend to release it as Open Source. Since the paper has been published, the source code will be available here shortly (as soon as I add the GNU license to all the files).

This document is a tutorial that walks the user through a series of examples. TRiFLe is fairly simple to use and we believe the tutorial will be sufficient for most users.

2. Installing TRiFLe

Since TRiFLe is a Java WebStart application, you just need to click on its download link, and it will download the application to your computer and start it. Next time you start the app, it will check if there's a new version available. If this does not work, or for more information about Java WebStart, see Appendix A, Java WebStart.

TRiFLe uses BioJava, which relies on REBase for restriction enzyme data. TRiFLe is distributed with withrefm.801 (January 2008). As requested on http://rebase.neb.com/rebase/rebhelp.html, we emphasize, should it be necessary, that the file is distributed free of charge. If you want to install a newer version of REBase please see Appendix C, Installing REBase.

3. Simulating a TRF Experiment

In this part of the tutorial we're going to use TRiFLe to design a TRF experiment. We wish to conduct TRFLP experiments using the nifH1 gene, found in many bacteria. We have a set of primers that should work for many species - we know it should amplify a fragment of around 450 bp. We have a few restriction enzymes available, but we do not know which enzymes will be the most useful. The question we'll be asking is thus what enzymes should I use?. To answer this, we take a set of reference nifH1 sequences that we assume to be representative of the species we are likely to find in real samples, and use TRiFLe to see which enzymes cut, and where. This should tell us which are the most useful.

Before we start, what do we mean by a 'useful' enzyme? We mean one which produces fragments of different sizes in different species. In the extreme case, if an enzyme produces fragments of the same size in all species, it will be useless for telling them apart.

  1. Download trifle_tut_1.dna. This file contains the reference sequences. There are three of them, they are around 890 bp long.
  2. Launch TRiFLe, and select Simulation-> New. The application loads the enzymes from a REBase file, this can take a little while. Then it opens a dialog box. Fill it thus:
    • In References file, browse the path of the file you just saved.
    • In Forward Primer, put AAAGGYGGWATCGGYAARTCCACCAC
    • In Reverse Primer, put TTGTTSGCSGCRTACATSGCCATCAT
    • Make sure both checkboxes are checked, meaning that both primers are marked.
    • Set the 'Stringency' slider near the middle
    • Select 'AluI', 'HaeIII', 'MboI', 'MspI', and 'TaqI' from the enzyme list. You can move the selection with the arrow keys, and you can also type the beginning of an enzyme's name. To add the enzyme to the list, press Enter or double-click.

    The dialog box should look like this:

  3. Click 'OK'. The results for the first species, MlotiMAFF303099, appear.

    The horizontal line represents TRF sizes, in nucleotides. Each TRF appears as a vertical bar. Its color denotes the enzyme that produces it, as shown in the key on the right. Solid bars are TRFs that include a forward primer, dashed bars are TRFs that include a reverse primer. For example, the dashed yellow line a little after size 125 is a TRF made by TaqI with the reverse primer. If you click near it, more information will appear next to the species name - in this case, we see that the actual size is 130 bp.You can cycle through the species using the buttons in the lower panel (they mean 'first', 'previous', 'next', and 'last', respectively).

    Now click the 'Enzyme' tab. This shows you the same results, but organized by enzyme:

    Each vertical bar in the display is a TRF produced by the enzyme whose name appears in the display's upper left. Solid bars are TRFs that include a forward primer , dashed bars are TRFs that include a reverse primer. You can get more details about a particular TRF by clicking next to it. For example, if you click on the leftmost bar for MspI, you will see that this enzyme produces a TRF of length 31 on the reverse strand of the M. loti sequence:

    By clicking on the other bars, you will see that MspI also produces a direct-strand TRF of size 164 nt in B. japonicum, etc. You can cycle through the enzymes using the buttons in the lower panel (they mean 'first', 'previous', 'next', and 'last', respectively). You will notice that that all enzymes except MspI also produce "TRF"s with a size of 458 bp. As we know, this is more or less the size of the whole, undigested fragment. This means that there are no restriction sites of the corresponding enzyme on those fragments.

The 'Table' tab has details on the size of each fragment. It will show you, for example, that MspI produces reverse-primer fragments of equal size (190) in two species.

 

We see that the least suitable enzyme is probably MboI, because of the small variation in fragment size.

4. Identification

4.1. Part I. Identification using a single run file without length correction

In this part of the tutorial, we'll try to identify species from an environmental sample. The inputs will be:

  • a set of TRF lengths, measured experimentally from a digestion made on DNA extracted from an environmental sample, amplified with suitable primers, and digested with HaeIII with the forward primer marked. This is a plain text file with lines like this:

    Peak    Time    Size    Height    Area    Data Point
    2B,2 5,67 46,26 4377 49447 2126
    ...

    We're interested in the third column, sizes. The other columns will be ignored. Lines which do not contain data, such as the header line, are also ignored.

  • a set of reference sequences for species that we expect to find in the sample (and maybe some that we don't expect, who knows?).
  1. Download file trifle_tut_2_1.txt (experimental TRF length data, with HaeIII and the forward primer marked).
  2. Copy file trifle_tut_2_3.dna (reference dataset)
  3. Launch TRiFLe
  4. Select Identification -> New. After some time for fetching the available enzymes, a dialog box appears, fill it thus:
    • In the References File text area, browse the path to the references file (tutorial 2.3).
    • Fill the forward and reverse primers with the following sequences
      • In Forward Primer, put AAAGGYGGWATCGGYAARTCCACCAC
      • In Reverse Primer, put TTGTTSGCSGCRTACATSGCCATCAT
    • Set the stringency to medium (near the middle of the bar)
    • Set in Size column the column containing the information for fragment size in your run file (for this tutorial column No. 3)
  5. Click Add Sample Run. A text field marked Sample File should appear, as well as a combo box marked Enzyme and two radio buttons named Fwd and Rev.
  6. In the Sample File text area, browse the path to the HaeIII run file (tutorial 2.1). In the Enzyme combo box, select 'HaeIII', and make sure the Fwd radio button is set. The dialog should look like this:
  7. Click OK. It does not matter much if you click the Kaplan-Kitts correction box or not, as the fragments are to small for this to have an effect.

In the table that appears, the following information is displayed: reference species (Sp), simulated T-RF for Sp, nearest experimental T-RF (Sp), distance (simulated-experimental T-RFs), and overall distance (this is meaningful when several run files are included).

The reference species appear ordered by similarity to the experimental data, allowing the user to infer which species are more likely to be present in the sample. In this tutorial, the first 4 species are

  1. Falni overall distance 0
  2. MlotiMAFF303099 overall distance 3
  3. BjaponicumUSDA110 overall distance 3
  4. RhizobiumNGR234 overall distance 5

Notice that the distance for Anabaena7120 is empty. This is because in this setting, large distances are excluded (more on this in section Section 4.3, “Filtering TRFs by size”). The nearest peak for the other species in the set of reference sequences are above 6. Since the results of the simulated and experimental T-RFs that were compared are also available in the results, the user can easily go back to the original data.[NOTE: not clear to me...]

4.2. Identification using several run files

In this part of the tutorial, we'll use several run files in order to identify species from an environmental sample. The inputs will be as in Part II.1, but with an additional run, digested with MspI. We'll assume that you have already downloaded trifle_tut_2_1.txt and trifle_tut_2_3.dna.

  1. Download file trifle_tut_2_2.dna (experimental data with MspI).
  2. Launch TRiFLe
  3. Select Identification -> New. A dialog box appears (maybe after a while if the program has to find the available enzymes), fill it thus:
    • In the References File text area, browse the path to the references file (tutorial 2.3).
    • Fill the forward and reverse primers with the following sequences:
      • In Forward Primer, put AAAGGYGGWATCGGYAARTCCACCAC
      • In Reverse Primer, put TTGTTSGCSGCRTACATSGCCATCAT
    • Set the stringency to medium (near the middle of the bar)
    • Make sure Kaplan & Kitts correction check box is not checked (we'll see what it does shortly)
    • Click Add Sample Run. A text field marked Sample File should appear, as well as a combo box marked Enzyme and two radio buttons named Fwd and Rev.
    • In the Sample File text area, browse the path to the HaeIII run file (trifle_tut_2_1.txt). In the Enzyme combo box, select 'HaeIII', and make sure the Fwd radio button is set.
    • Click again on Add Sample Run, and fill as above but with the MspI results file (trifle_tut_2_2.txt) and with 'MspI' selected in the combo box, and Fwd selected. The dialog should look like this:
  4. Click OK. There will be a warning about a sequence that does not amplify, and then the resuts appear.

The table columns show the following information.

  1. Species
  2. Predicted TRF size, HaeIII Forward
  3. Experimental TRF size, HaeIII Forward
  4. Difference between predicted and experimental size, HaeIII Forward
  5. Predicted TRF size, MspI Forward
  6. Experimental TRF size, MspI Forward
  7. Difference between predcetd and experimental size, MspI Forward
  8. Average distance (over HaeIII and MspI)

The species are ordered by increasing average distance, meaning that the liekliest species are near the top.

4.2.1. Using the Kaplan & Kitts length correction

Kaplan and Kitts (2003) have shown that theoretical and experimental TRF lengths diverge in predictable way and propose a correction factor. To see its effect, redo the above procedure, but checking the Kaplan-Kitts correction.

Notice that the average distances are, on the whole, shorter than without the correction. Also, the order of the species has changed: this is because the correction has a larger effect on long fragments, so species with long TRFs score comparatively better with the correction on.

4.3. Filtering TRFs by size

By default, TRFs shorter than 35 nt or longer than 350 nt are ignored in the average distance. You can change this by adjusting the minimum and maximum peak sizes in the results table.

In our example, Anabaena7120 has a TRF by HaeIII - Forward of length 458. In the default settings, TRFs longer than 350 nt are ignored. We're going to change this.

  • Follow the steps above, and set the Max. peak size:to 500 instead of 350.

The effect is that the Anabaena TRF is now included into the average distance for that species.

A. Java WebStart

Java WebStart is a technology that simplifies the distribution of Java applications, from the users point of view at least. Once Java WebStart is installed, any Java WebStart application can be installed and/or launched by simply clicking on a link on a web page, and it also automatically updates your installation if any updates are available. We distribute TRiFLe via web start because it is more convenient (if it were distributed without Java WebStart, youd have to manually download 8 different files, set the classpath, etc).

1. Getting Java WebStart

First, you may not even need it. Many systems come with Java WebStart pre-installed (see the demos pages below). You need to do this only if the download link does not work. If you do need it, this is just a matter of downloading an archive from the Sun site, but some people find the page confusing, so here are detailed steps:

  1. Go to http://java.sun.com/products/javawebstart
  2. Click the Download now link (NOT the Download link)[1]
  3. Click the Download JRE 5.0 Update 6 link
  4. Click the Accept checkbox (you may even want to actually read the license terms)
  5. Select the appropriate download for your machine. http://java.sun.com/j2se/1.5.0/download-info.html has more info, but briefly, Windows users should try the online installer first, and Linux users should try the self-extractine archive first.
  6. Run the installer (double-click on Windows, from a shell on Linux)
  7. Java WebStart should now be working. You can check this by trying out the demos at http://java.sun.com/products/javawebstart/demos.html

 

2. What's this certificate thing?

Java WebStart requires applications to be digitally signed, at least when they access the disk (which TRiFLe does). You have to accept the certificate in order to download. By the way, it probably says it cannot trust us. Thats because I didnt want to get myself one of those spiffy (but expensive) certificates from Verisign, etc, so I just signed it myself. 1 You may wonder why I do not link directly to the archive. That's because the Sun site uses sessions, and there is no fixed link that can be provided.

B. FastA IDs

TRiFLe uses the BioJava libraries to perform some of its tasks, including reading sequences from a FastA file. What we mean by a valid FastA ID thus really means one that BioJava recognizes as such. I have been unable to find details about this in the BioJava documentation, but from a look at the source code1 it would seem that the ID is the first word of the line, i.e. the part between the > and the first whitespace character. Any characters after the first whitespace are ignored they dont harm, but they wont be taken into account.

1. So what should I use?

Use alphanumeric characters and you should be fine. Dont use spaces, as the ID would end at the first space. Examples include:

>B_subtilis
>EcoliK12
>Sf123

Counter example: >B. subtilis - because of the space, the ID will be only B., so if you also have, say, >B. cereus, there will be a name conflict.

The examples page has examples of files known to work, you may wish to look at them.

C. Installing REBase

You can download an up-to-date version of the REBase enzyme data from http://rebase.neb.com/rebase/link_withref. TODO: check that the following works on a user machine, with Java WebStart You must then configure your CLASSPATH so that TRiFLe finds the data. The location of the enzymes file is set in the Java properties file, somedir/org/biojava/bio/molbio/RestrictionEnzymeManager.properties, where somedir is a directory named in the CLASSPATH. In other words:

  1. there must be a file named RestrictionEnzymeManager.properties, and this file must specify the location of the REBase file as the value of the rebase.data.file property
  2. this file must be in a dir hierarchy org/biojava/bio/molbio
  3. the directory that contains org must be found in the CLASSPATH and it must appear there BEFORE the BioJava archive (otherwise the REBase file that comes with Biojava will be used instead)
  4. the value of the rebase.data.file property is the path to the REBase file itself (withrefm.xyz), and this path is relative to the CLASSPATH dir that contains org.

For example, we could create a directory named rebase, in which we create the subdir hierarchy org/biojava/bio/molbio. In that last dir (molbio), we put the REBase file itself, withrefm.801, as well as a Java properties file named RestrictionEnzymeManager.properties, which reads:

# Where the REBase file is located

rebase.data.file=/org/biojava/bio/molbio/withrefm.801

# end

Then we put the whole path to rebase in the CLASSPATH before the biojava jar, and it works.

D. Troubleshooting

1. Error Messages

1.1. Could not read sequence

This is typically due to a format problem in the FastA file. The most frequent case is a missing ID, e.g.

>
ATTGCCA...

It can also be caused by an unrecognized ID. See Appendix B, FastA IDs.

2. Other problems

2.1. My sequence does not appear, but there was no error message.

Maybe its identifier could not be recognized. See Appendix B, FastA IDs.