Our new project member, elchuxo, is currently working on a Perl lexer, called pplex. The program will allow to write flex like input files for scanner generation. Versions for the programming languages Tcl and Python will be released soon afterwards. The code for the Perl lexer is not yet released but is already in the CVS: http://bioscanners.cvs.sourceforge.net/viewvc/bioscanners/lexer/perl/. In the samples directory is already a wc implementation. Comments and suggestions are welcome.

D.

Today the re2c sources for the BLASTScanner application has been released as well. You can use them to modifiy the BLASTScanner source code. The Zip-archiv contains a make file which should build on any computer architecture.

Direct download link src-release: BLASTScanner1.0-src.zip

Direct download link binary-release: BLASTScanner1.0.zip

At the 23. of december 2008 the first offical release of BLASTScanner has been released. BLASTScanner is a small platform independent console application which translates BLAST-output of the NCBI-BLAST-tool into database code. No installation is required. Just download unpack and run. A single C-Source file is compiled into a very fast and very small application (~30kb). For instance a 20MB-BLAST-file can be translated in 2 seconds into database code suitable being piped into SQLite, MySQL or PostgreSQL-databases. You can download binaries for Win32, Linux32 and Linux64, Mac-OSX, OSF1 and Solaris as well as the C-code for compilation for your platform all together in a 79kb-Zip-file from the sourceforge project page: http://sourceforge.net/projects/bioscanners/

There is the gplex scanner generator (http://plas.fit.qut.edu.au/gplex/) for the C#-programming language. Due to the mono runtime environment (http://www.mono-project.com/) C# code can be run on Win32, Linux and Mac-OSX platforms. As our wc sample run below suggests gplex scanner generated code is slower than Java JIT-compiled code but faster than Java-interpreted code. The mono application requires much less memory.

Tool/Queries 1 10 100 1000 10000
wc-flex 0.005 0.011 0.087 0.912 10.426
wc-java 0.197 0.248 0.367 1.549 14.707
wc-java016 0.216 0.263 0.346 1.533 14.806
wc-java064 0.209 0.243 0.364 1.522 14.649
wc-java512 0.216 0.247 0.351 1.542 16.486
wc-javaip 0.193 0.422 2.808 28.353 331.753
wc-mono1.2 0.168 0.282 1.580 15.650 175.863
wc-mono2.0 0.147 0.267 1.676 15.870 187.633
wc-mono2.0-static 0.150 0.272 1.670 16.265 189.411
wc-re2c 0.004 0.006 0.038 0.382 4.301
wc-unix 0.010 0.044 0.453 4.782 54.704

wc-mono-static means that a executable with the mono-runtime linked in was tested. Mono-frameworks were 1.2 and 2.0.1. Different options for java memory were tested. java016 means that java was run with the -Xmx16m option.

Mode/Lines 1 10 100 1000 10000 100000 250000 Memory (M
flex 0.001 0.001 0.001 0.004 0.015 0.182 0.313 2.
java15 0.132 0.135 0.132 0.242 0.623 1.273 2.130 79
java15_x64 0.137 0.133 0.133 0.233 0.581 1.240 2.288
java16 0.131 0.134 0.131 0.205 0.441 1.194 1.963 800
perl 0.031 0.031 0.031 0.032 0.066 0.384 0.961 7
plex2.0_32 0.001 0.001 0.002 0.011 0.109 1.011 2.765 0.8
plex2.2_32 0.001 0.001 0.003 0.012 0.123 1.136 3.136 1
plex2.2_64 0.002 0.001 0.003 0.016 0.155 1.479 3.808 1
tcl 0.090 0.089 0.092 0.123 0.439 3.392 8.724 6

A problem with Java based applications is the hugh memory amount required to run the scanner. Tested were diffeent settings of the maximum memory allocation pool using the commandline option -Xmx.

Mode 1 10 100 1000 10000 Memory (Mb)
java 0.14 0.256 0.453 1.708 15.500 788
java -Xmx16m 0.174 0.260 0.436 1.728 15.598 290
java -Xmx32m 0.175 0.255 0.457 1.713 15.576 308
java -Xmx64m 0.175 0.254 0.456 1.713 15.460 339
java -Xmx128m 0.156 0.255 0.441 1.708 15.607 404
java -Xmx256 0.156 0.253 0.421 1.714 15.520 532
java -Xmx512 0.155 0.253 0.456 1.709 15.507 795

Comparing 64 and 32bit Scanners generated with the tply-lexer for pascal (free pascal) and with the jflex-Lexer for Java. Java programs require about 800Mb of memory whereas the pascal programs require just 1Mb of memory. However the Java programs where faster with the complete gene ontology obofile (about 250000 lines).

Mode/Lines 1 10 100 1000 10000 100000 250000
obo-plex32 0.001 0.001 0.003 0.015 0.142 1.334 3.623
obo-plex64 0.001 0.001 0.003 0.015 0.160 1.448 4.038
obo-java15_x64 0.140 0.135 0.133 0.206 0.623 1.297 2.241
obo-java15 0.133 0.136 0.137 0.208 0.596 1.352 2.028
obo-java16 0.134 0.132 0.134 0.215 0.465 1.117 1.943

Again the same set of blastfiles was used for testing of a word counting scanner. Flex and re2c based scanners again were performing best.

Mode 1 10 100 1000 10000
wc-flex 0.003 0.011 0.102 1.083 12.459
wc-flexpp 0.026 0.169 1.940 21.193 244.294
wc-gcj-exe 0.097 0.123 0.441 3.934 42.928
wc-gcj 0.087 0.307 2.875 30.163 nd
wc-java14 0.153 0.259 0.481 1.748 15.965
wc-java 0.176 0.257 0.444 1.704 15.682
wc-javaip14 0.122 0.345 2.774 28.982 329.265
wc-javaip 0.120 0.345 2.771 28.769 335.123
wc-perl-hand 0.006 0.018 0.155 1.590 18.132
wc-perl-lex 0.164 0.872 9.108 97.561 nd
wc-plex64 0.008 0.044 0.476 5.106 58.264
wc-plex 0.006 0.043 0.433 4.589 55.773
wc-re2c 0.002 0.005 0.035 0.346 3.975
wc-tcl8532 0.257 1.625 17.440 190.076 nd
wc-tcl8564 0.183 1.071 11.550 126.854 nd
wc-tcl 0.401 2.327 25.212 274.777 nd
wc-unix 0.006 0.026 0.285 2.937 33.792

We recently compared our newly generated Blast scanners with currently available BLAST-scanners from the BioJava-project [1], the BioPerl-project [2] and with the Zerg-BLAST parser [3]. Those parsers were compared with our scanners created either with C-based scanner generators like Re2c [4] and Flex [5] or with the Java based scanner generator Jflex [6]. Wheras the parsers mentioned above requires source code editing for parsing and analysing blast files our scanners are emitting SQL-code. Analyzing of blast results can afterwards done with a high level language (SQL). Please note that the BioJava scanner does not work with actual BLAST-versions. File sizes for the blast files has been about 1 (small), 14 (medium) and 140 (large) Mb

Mode small medium large memory(Mb)
blast-biojava 2.019 8.055 err 1054
blast-bioperl 4.026 47.822 nd 21
blast-flex 0.051 0.567 4.237 20
blast-jflex 0.607 1.906 8.532 863
blast-re2c 0.027 0.283 2.094 19.7
blast-zerg 0.017 0.185 1.331 6.5
blast-tclkit851 35.480 nd nd 10.1

Sample: BlastFile with 1 to 10.000 result items

Mode 1 10 100 1000 10000
flex 0.003 0.015 0.199 1.979 25.548
flex-tcl 0.005 0.018 0.167 1.775 20.701
gcj 0.099 0.149 0.784 7.536 83.421
gij 0.109 0.482 4.917 51.939 nd
java 0.228 0.348 0.753 3.048 27.806
javaip 0.180 0.517 4.715 49.852 nd
plex 0.011 0.082 0.856 9.565 107.987
perl 0.031 0.050 0.235 2.280 23.916
re2c 0.004 0.012 0.076 0.765 8.438
tcl 1.702 12.524 140.249 nd nd

The Re2c based scanner is the fastest, but the setup and the coding is more complicated than for the other scanners.
Flex-based scanners are 2-3 times slower than Re2c based scanners, regardless if there is an embedded Tcl-interpreter for better string handling (flex-tcl), Jflex code (java), executed with the Sun-Java Hotspot virtual machine (1.5) as well as to machine code compiled Jflex code (java-gcj) and Plex (sbs-plex = Pascal lex) based scanners are about 5 and 10 times slower than Re2c based scanners. Interpreted Java-Code either executed with the Sun-interpreter (java-ip = “java -Xint”) or with the gnu-interpreter (java-gij) is about 50 times slower than Re2c-Code. The Tcl based scanner is about 1000 times slower than the Re2c based. The per scanner is a line based scanner thereof not able to do complicated scanning with more than two states or patterns on the same line.