You are currently browsing the category archive for the 'Scanners/Parsers' category.
Today the re2c sources for the BLASTScanner application has been released as well. You can use them to modifiy the BLASTScanner source code. The Zip-archiv contains a make file which should build on any computer architecture.
Direct download link src-release: BLASTScanner1.0-src.zip
Direct download link binary-release: BLASTScanner1.0.zip
At the 23. of december 2008 the first offical release of BLASTScanner has been released. BLASTScanner is a small platform independent console application which translates BLAST-output of the NCBI-BLAST-tool into database code. No installation is required. Just download unpack and run. A single C-Source file is compiled into a very fast and very small application (~30kb). For instance a 20MB-BLAST-file can be translated in 2 seconds into database code suitable being piped into SQLite, MySQL or PostgreSQL-databases. You can download binaries for Win32, Linux32 and Linux64, Mac-OSX, OSF1 and Solaris as well as the C-code for compilation for your platform all together in a 79kb-Zip-file from the sourceforge project page: http://sourceforge.net/projects/bioscanners/
A problem with Java based applications is the hugh memory amount required to run the scanner. Tested were diffeent settings of the maximum memory allocation pool using the commandline option -Xmx.
| Mode | 1 | 10 | 100 | 1000 | 10000 | Memory (Mb) |
|---|---|---|---|---|---|---|
| java | 0.14 | 0.256 | 0.453 | 1.708 | 15.500 | 788 |
| java -Xmx16m | 0.174 | 0.260 | 0.436 | 1.728 | 15.598 | 290 |
| java -Xmx32m | 0.175 | 0.255 | 0.457 | 1.713 | 15.576 | 308 |
| java -Xmx64m | 0.175 | 0.254 | 0.456 | 1.713 | 15.460 | 339 |
| java -Xmx128m | 0.156 | 0.255 | 0.441 | 1.708 | 15.607 | 404 |
| java -Xmx256 | 0.156 | 0.253 | 0.421 | 1.714 | 15.520 | 532 |
| java -Xmx512 | 0.155 | 0.253 | 0.456 | 1.709 | 15.507 | 795 |
Comparing 64 and 32bit Scanners generated with the tply-lexer for pascal (free pascal) and with the jflex-Lexer for Java. Java programs require about 800Mb of memory whereas the pascal programs require just 1Mb of memory. However the Java programs where faster with the complete gene ontology obofile (about 250000 lines).
| Mode/Lines | 1 | 10 | 100 | 1000 | 10000 | 100000 | 250000 |
|---|---|---|---|---|---|---|---|
| obo-plex32 | 0.001 | 0.001 | 0.003 | 0.015 | 0.142 | 1.334 | 3.623 |
| obo-plex64 | 0.001 | 0.001 | 0.003 | 0.015 | 0.160 | 1.448 | 4.038 |
| obo-java15_x64 | 0.140 | 0.135 | 0.133 | 0.206 | 0.623 | 1.297 | 2.241 |
| obo-java15 | 0.133 | 0.136 | 0.137 | 0.208 | 0.596 | 1.352 | 2.028 |
| obo-java16 | 0.134 | 0.132 | 0.134 | 0.215 | 0.465 | 1.117 | 1.943 |
Again the same set of blastfiles was used for testing of a word counting scanner. Flex and re2c based scanners again were performing best.
| Mode | 1 | 10 | 100 | 1000 | 10000 |
|---|---|---|---|---|---|
| wc-flex | 0.003 | 0.011 | 0.102 | 1.083 | 12.459 |
| wc-flexpp | 0.026 | 0.169 | 1.940 | 21.193 | 244.294 |
| wc-gcj-exe | 0.097 | 0.123 | 0.441 | 3.934 | 42.928 |
| wc-gcj | 0.087 | 0.307 | 2.875 | 30.163 | nd |
| wc-java14 | 0.153 | 0.259 | 0.481 | 1.748 | 15.965 |
| wc-java | 0.176 | 0.257 | 0.444 | 1.704 | 15.682 |
| wc-javaip14 | 0.122 | 0.345 | 2.774 | 28.982 | 329.265 |
| wc-javaip | 0.120 | 0.345 | 2.771 | 28.769 | 335.123 |
| wc-perl-hand | 0.006 | 0.018 | 0.155 | 1.590 | 18.132 |
| wc-perl-lex | 0.164 | 0.872 | 9.108 | 97.561 | nd |
| wc-plex64 | 0.008 | 0.044 | 0.476 | 5.106 | 58.264 |
| wc-plex | 0.006 | 0.043 | 0.433 | 4.589 | 55.773 |
| wc-re2c | 0.002 | 0.005 | 0.035 | 0.346 | 3.975 |
| wc-tcl8532 | 0.257 | 1.625 | 17.440 | 190.076 | nd |
| wc-tcl8564 | 0.183 | 1.071 | 11.550 | 126.854 | nd |
| wc-tcl | 0.401 | 2.327 | 25.212 | 274.777 | nd |
| wc-unix | 0.006 | 0.026 | 0.285 | 2.937 | 33.792 |
We recently compared our newly generated Blast scanners with currently available BLAST-scanners from the BioJava-project [1], the BioPerl-project [2] and with the Zerg-BLAST parser [3]. Those parsers were compared with our scanners created either with C-based scanner generators like Re2c [4] and Flex [5] or with the Java based scanner generator Jflex [6]. Wheras the parsers mentioned above requires source code editing for parsing and analysing blast files our scanners are emitting SQL-code. Analyzing of blast results can afterwards done with a high level language (SQL). Please note that the BioJava scanner does not work with actual BLAST-versions. File sizes for the blast files has been about 1 (small), 14 (medium) and 140 (large) Mb
| Mode | small | medium | large | memory(Mb) |
|---|---|---|---|---|
| blast-biojava | 2.019 | 8.055 | err | 1054 |
| blast-bioperl | 4.026 | 47.822 | nd | 21 |
| blast-flex | 0.051 | 0.567 | 4.237 | 20 |
| blast-jflex | 0.607 | 1.906 | 8.532 | 863 |
| blast-re2c | 0.027 | 0.283 | 2.094 | 19.7 |
| blast-zerg | 0.017 | 0.185 | 1.331 | 6.5 |
| blast-tclkit851 | 35.480 | nd | nd | 10.1 |
Sample: BlastFile with 1 to 10.000 result items
| Mode | 1 | 10 | 100 | 1000 | 10000 |
|---|---|---|---|---|---|
| flex | 0.003 | 0.015 | 0.199 | 1.979 | 25.548 |
| flex-tcl | 0.005 | 0.018 | 0.167 | 1.775 | 20.701 |
| gcj | 0.099 | 0.149 | 0.784 | 7.536 | 83.421 |
| gij | 0.109 | 0.482 | 4.917 | 51.939 | nd |
| java | 0.228 | 0.348 | 0.753 | 3.048 | 27.806 |
| javaip | 0.180 | 0.517 | 4.715 | 49.852 | nd |
| plex | 0.011 | 0.082 | 0.856 | 9.565 | 107.987 |
| perl | 0.031 | 0.050 | 0.235 | 2.280 | 23.916 |
| re2c | 0.004 | 0.012 | 0.076 | 0.765 | 8.438 |
| tcl | 1.702 | 12.524 | 140.249 | nd | nd |
The Re2c based scanner is the fastest, but the setup and the coding is more complicated than for the other scanners.
Flex-based scanners are 2-3 times slower than Re2c based scanners, regardless if there is an embedded Tcl-interpreter for better string handling (flex-tcl), Jflex code (java), executed with the Sun-Java Hotspot virtual machine (1.5) as well as to machine code compiled Jflex code (java-gcj) and Plex (sbs-plex = Pascal lex) based scanners are about 5 and 10 times slower than Re2c based scanners. Interpreted Java-Code either executed with the Sun-interpreter (java-ip = “java -Xint”) or with the gnu-interpreter (java-gij) is about 50 times slower than Re2c-Code. The Tcl based scanner is about 1000 times slower than the Re2c based. The per scanner is a line based scanner thereof not able to do complicated scanning with more than two states or patterns on the same line.
