[About]      [Flow chart]      [System architecture]      [Manual]                                                               

PESTAS - a web server for EST analysis and sequence mining


Publication

Seong-Hyeuk Nam, Dae-Won Kim, Tae-Sung Jung, Young-Sang Choi, Dong-Wook Kim, Han-Suk Choi, Sang-Haeng Choi and Hong-Seog Park. (2009) PESTAS: a web server for EST analysis and sequence mining, Bioinformatics. [Read]




Overview

Pipeline for EST Analysis Service (PESTAS) is a web server for the high throughput annotation of expressed sequence tags (ESTs) by an automated pipeline of 13 analytic services. Entire data sets, generated by pipeline processing, are deposited into the MySQL database and transformed into three kinds of reports (preprocessing, assembling and annotation). All annotated information is provided to the scientist and can be downloaded through a web browser. To get more relevant functional annotation results, a curation function was introduced with which biologists can easily change the best-hit annotation information. PESTAS provides a gene chip function to help understand the expression pattern differences in each library by making a comparison with the count of accession number of read from BLAST result. PESTAS also provides access to the pathway information of KEGG, which is useful for mapping the relationships among a whole network of annotated enzymes and is especially valuable for those researchers interested in biological pathways.


Example screenshot from PESTAS



Key features of PESTAS

1. User-controlled EST analysis pipeline service
Thirteen analytic tools using PESTAS have been produced into web services using web service technologies. Each analytic web service can be executed by the user-controlled pipeline as well as individually. A user-controlled pipeline provides an efficient and optimal EST analysis service to scientists. Also, the web service technologies adopted in PESTAS make it an extensible and flexible system as the analytic web ser-vices are developed with component based and loosely coupled.

2. Report service
PESTAS provides three kinds of reports: pre-processing, assem-bling and annotation. The pre-processing report presents information generated on cleansing step (base calling, vector trimming, contamination trimming and repeat masking) of pipeline module. This report contains a summary, graphs and tables about subsets of the cleansing step. The assem-bling report presents the status of contigs and singletons after clus-tering and assembling step of pipeline module. The assembling report also provides a contig viewer facility that can compare a consensus sequence with each read sequence and shows the chro-matogram view of each read sequence. The annotation report pro-vides information about the functional annotation of the pipeline module such as statistics, summary, seven functional annotation reports and two additional analysis services reports.

3. Curation service
The main function of the curation service is to give more intuitive and effective information to scientists. In general, when EST con-sensus sequences are annotated, automated high-throughput EST analysis pipelines use BLAST algorithms to search for similar sequences in various relevant databases and after which they assign putative function by selecting the best hits description. At this time, if some species are not very well researched, we can readily find out non-significant the description of best hits such as an indicated "hypothetical transcript" or "hypothetical protein". Thus, in order to make sense of BLAST results and to offer intuitive annotations information, we developed a versatile curation function so that the user can customize the annotation result for further research. In other words, the user can change the best-hit results on the basis of their description or use another rank for well-annotated BLAST output through individual authentication.

4. Gene chip service
To aid the understanding of gene expression patterns, we included a gene chip module to detect gene expression differences in each library through by making a comparison with the count of accession number of read from BLAST result. It was graded 14 colors depending on the count of accession number of reads comprising its contigs. A newly identified gene expression profile could help predict how difference with libraries will expressed according to tissue and environment.

5. Pathway search service
By overlaying expression data on biological pathways, established and novel relationships among genes can be explored. These path-ways give key information about the functional and metabolic organization of cellular and biological systems within organisms. Thus we provide the pathway information of KEGG. PESTAS extract EC numbers from the description of UniProt results and these EC numbers are mapped with KEGG pathway information. It is useful to map the relationships among a whole system of anno-tated enzymes, and these maps are especially valuable for re-searchers interested in biological pathways.

6. Organism-specific databases
The NCBI's NT/NR databases are divided into 12 divisions to improve performance in BLAST. The 12 divisions are bacteria, invertebrates, mammals, phages, plants, primates, rodents, synthetic, unassigned, viruses, vertebrates and environmental samples.




Thirteen analytic tools in PESTAS

1. Phred: Base calling
    Web site: http://www.phrap.org/phredphrapconsed.html
    Reference: http://genome.cshlp.org/content/8/3/175.full, http://genome.cshlp.org/content/8/3/186.full

2. Cross_match: Vector masking
    Web site: http://www.phrap.org/phredphrapconsed.html

3. SeqClean: Contamination trimming
    Web site: http://compbio.dfci.harvard.edu/tgi/software/

4. RepeatMasker: Repeat masking
    Web site: http://www.repeatmasker.org/

5. TGICL: Clustering and assembling
    Web site: http://compbio.dfci.harvard.edu/tgi/software/
    Reference: http://bioinformatics.oxfordjournals.org/cgi/content/abstract/19/5/651

6. BLASTN: Search a nucleotide database using a nucleotide query (DeCypher (TimeLogic) HMM search)
    Web site: http://blast.ncbi.nlm.nih.gov/Blast.cgi
    Reference: http://nar.oxfordjournals.org/cgi/content/abstract/25/17/3389

7. BLASTX: Search protein database using a translated nucleotide query (DeCypher (TimeLogic) HMM search)
    Web site: http://blast.ncbi.nlm.nih.gov/Blast.cgi
    Reference: http://nar.oxfordjournals.org/cgi/content/abstract/25/17/3389

8. TBLASTX: Search translated nucleotide database using a translated nucleotide query (DeCypher (TimeLogic) HMM search)
    Web site: http://blast.ncbi.nlm.nih.gov/Blast.cgi
    Reference: http://nar.oxfordjournals.org/cgi/content/abstract/25/17/3389

9. UniProt: Protein knowledgebase
    Web site: http://www.uniprot.org/
    Reference: http://nar.oxfordjournals.org/cgi/content/abstract/33/suppl_1/D154

10. KEGG: Genomic information knowledgebase
    Web site: http://www.genome.jp/kegg/
    Reference: http://nar.oxfordjournals.org/cgi/content/abstract/28/1/27

11. InterproScan: Scan protein sequences against the protein signatures of the InterPro databases
    Web site: http://www.ebi.ac.uk/interpro/
    Reference: http://bioinformatics.oxfordjournals.org/cgi/content/abstract/17/9/847

12. AutoSNP: Detect single nucleotide polymorphisms (SNPs)
    Web site: http://www.cerealsdb.uk.net/discover.htm
    Reference: http://bioinformatics.oxfordjournals.org/cgi/content/abstract/19/3/421

13. Tandem Repeats Finder: Find tandem repeats
    Web site: http://tandem.bu.edu/trf/trf.html
    Reference: http://nar.oxfordjournals.org/cgi/content/abstract/27/2/573




Parameters used by 13 analytic tools

¡á Phred
   -cs, -trim_alt, -trim_cutoff (default value: 0.05), -trim_fasta, -qd, -sd
¡á 1st cross_match
   -minmatch (default value: 7), -minscore (default value: 7), -screen
¡á 2nd cross_match
   -minmatch (default value: 10), -minscore (default value: 14), -screen
¡á 1st SeqClean
   -v (default value: UniVec), -s (default value: mito.nt,ecoli.nt,chloroplast), -c
¡á RepeatMasker
   -pa, -nolow
¡á 2nd SeqClean
   -o
¡á TGICL
   -c, -l (default value: 50), -v (default value: 50), -q, -O '-b (default value: 65) -c (default value: 45) -o (default value: 50) -p (default value: 95)'
¡á BLASTN
   -priority, -template tera-blastn, -query, -target (selected organism-specific database), -format tab, -qfilter f
¡á BLASTX
   -priority, -template tera-blastx, -query, -target (selected organism-specific database), -format tab, -qfilter f
¡á TBLASTX
   -priority, -template tera-tblastx, -query, -target (selected organism-specific database), -format tab, -qfilter f
¡á KEGG
   -priority, -template tera-blastx, -query, -target (selected organism-specific database), -format tab, -qfilter f
¡á UniProt
   -priority, -template tera-blastx, -query, -target (selected organism-specific database), -format tab, -qfilter f
¡á InterproScan
   -cli, -i, -o, -seqtype n, -goterms, -iprlookup, -format html
¡á AutoSNP
   Default command by AutoSNP provider.
¡á Tandem Repeats Finder
   Default command by TRFinder provider.




ID structure

¨ç Est Project indicator (Two alphabets)
¨è Organism type (A single alphabet)
¨é Species type (Three-digit)
¨ê Tissue type (Two alphabets)
¨ë Individual type (Two alphabets)
¨ì Analysis count (Double-digit)
¨í Read type (A single alphabet; C: contig, S: singleton)
¨î Read ID (Six-digit)