proTRAC QuickStart and Troubleshooting
QuickStart
- Download the proTRAC folder to the desired directory on your machine.
- Map your sequence reads (FASTA format) to a genome with SeqMap (Jiang and Wong 2008). Use the option /output_all_matches. SeqMap is freely available here. Many Genomes are available at NCBI or Ensembl. Map your reads with: seqmap 0 reads.fas genome.fas ELAND3_output.txt /output_all_matches
- Start proTRAC and use ELAND3_output.txt as input file
- Adjust settings and start.
Troubleshooting
My ELAND3 file is huge
Check your sequence dataset. Do not map homo- or dipolymeric stretches like GCGCGCGCGCGCGCGC or AAAAAAAAAAAAAAA to the genome. These sequences produce millions of hits since they will map to microsatellites. In some cases it can be helpful to filter out sequence reads that correspond to rRNA or tRNA sequences etc. You can find most useful Perl scripts like filter_simple_repeats.pl or map_sequences.pl as part of the "NGS tools for the novice". The whole toolkit is freely available here.
I cannot execute proTRAC.pl
Executing Perl scripts requires installation of a Perl distribution. Furthermore, you may need to install additional Perl modules like Tk and GD. Missing modules will be listed in the error message. Modules are freely available at the Comprehensive Perl Archive Network. If you are not on a Windows system, you can also try to run proTRAC.exe via an emulated Windows. We ran proTRAC.exe without any problems on a MAC with emulated Windows XP by Parallels Desktop®.
proTRAC returns an error message when reading the input file.
Ensure that your input file is in ELAND3 format. Your file should look something like this:
trans_id trans_coord target_seq probe_id probe_seq num_mismatch strand Chr1 2549081 TTGTACTACTTCCATT 3 TTGTACTACTTCCATT 0 - Chr1 3743045 TGAGGCCATGTTTCA 1 TGAGGCCATGTTTCA 0 - Chr1 3785722 TCAATTCTTGACTTCT 2 TCAATTCTTGACTTCT 0 + Chr1 3797369 TTTCTTATCGTGCATG 1 TTTCTTATCGTGCATG 0 -
proTRAC does not assemble cluster candidates
Cluster candidates are assembled on the basis of hit density. There might be no hit accumulation in your ELAND3 input file that satisfies your settings. Try to reduce p for hit density or the minimum hit density. If the appointed sliding window size seems to be to high (usually ~10), probably one of the simple settings or probabilistic settings is adjusted too strict.
proTRAC assembles cluster candidates but verifies 0 clusters
The cluster candidates do not pass the requirements. Probably one or several of the simple settings or probabilistic settings are adjusted too strict. IMPORTANT: Keep in mind what you mapped to the genome. If all of your sequences are 26-32nt in length, there cannot be an accumulation of reads with typical length, if typical length is set to 26-32. If you mapped piRNAs, do not expect an accumulation of loci starting with T as compared to the entirety of mapped reads. In this case, use the option based on random base composition.
Validation of results takes hell of a long time
Probably quite a few clusters have been detected. Abort computation at this point. The detected clusters and the optional cluster summary file have already been saved.
How to include transcriptional information in proTRAC?
This information must be included in the sequence file that is mapped to the genome. FASTA titles must refer to the abundance of the respective sequence read:
>73
GCTAGCTAGCGTAGCTAGCTGCGCTA
>2
AATGCGCTATATACGGCTCTTATAGCGCAT
>12
TCTCTAGAGATCTCTTTTTTAAGTC
A Perl script (discard_redundant_sequences.pl) that converts FASTA files with redundant sequences to the required format is part of the "NGS tools for the novice". The whole toolkit is freely available here.
Any more questions or problems?
Contact David Rosenkranz: rosenkrd@uni-mainz.de