Tips and tricks for software engineering in bioinformatics (talk by Joel Dudley)
February 11, 2009 14 Comments
Joel Dudley, co-founder of MacResearch and a student in the Stanford Biomedical Informatics program, gave a quick seminar today on how to program effectively for bioinformatics. The main point is that to be a good bioinformatician, you need to build up your toolbox, be aware of what’s out there, and use and integrate existing tools to do more powerful work. To do this, he gives the following suggestions:
1. Learn UNIX. It’s quick, it’s powerful, it’s easy to learn. What often takes several lines to code in a scripting language can usually be reduced to a single line on the command line.
2. Be jack of all trades, but master of ONE. That is, be familiar with most programming languages, but be really good at one of them. In the hierarchy of languages, VB and C are more “primitive” while Ruby and Python are most “advanced” – he recommends starting with one of the more advanced languages if you are new to programming. Out of Ruby and Python, Python will probably give you more bang for your buck, due to the smorgasbord of libraries available and broad acceptance (e.g. academic labs, Google). In addition, there are lots of bridges between languages, such as Jython (Java and Python) and JRuby (Java and Ruby) so expert knowledge of one is usually sufficient for you to make a lot of things work practically everywhere.
3. Don’t reinvent the wheel. “Frameworks are your friends.” Take advantage of large existing projects like BioPython/Perl/Ruby/Java, Django, Rails, etc which contain lots of ready to go code for practically everything. Use the internet to find existing code solutions – e.g. Koders is like a Google search for open source code on the web.
4. Learn one text editor really well. Take your pick of Emacs, vi, or a GUI-based editor like TextMate for Macs. The advantage of emacs and vi is that they will be installed on pretty much any system you come across.
6. Don’t be afraid to use more than 3 letters to define a variable. Having short variable names won’t make the code run faster. It will, however, make the code more difficult for others (and you, 3 months from now) to understand!
7. Balance architecture and accomplishment. You may be tempted to create something that is complete, elegant, and perfectly structured. This will likely be a waste of time. It’s ok to sacrifice a little bit of structure to get something that actually works.
8. Automate documentation. Documentation is necessary, but it’s a pain to write. So come up with a convention for your headers and make it automatic. Use available tools like Doxygen, JavaDoc, and RDoc, many of which are free.
The above are generic for academic-level software engineering. Some tips that more specifically address high-throughput biomedical computing:
9. Kill the flat file (sort of). This is the most common file format used in bioinformatics, but it hardly lends itself efficient computation. A common task we want to do with the file is read in the data and store it keyed so that we can look up specific pieces of the data later. Hate databases? Cringe at SQL? If you can represent your data as key/value pairs, consider using an embeddable database like the open source BerkeleyDB (now licensed by Oracle), which require no administration. If you don’t mind SQL, but hate the administration, SQLite allows you to create embedded, serverless databases. Other options that go beyond the relational database concept are CouchDB (“a distributed, fault-tolerant and schema-free document-oriented database accessible via a RESTful HTTP/JSON API”) and Hypertable (“a high performance distributed data storage system”).
10. New ways to do parallel computing. Determine whether your tasks are loosely coupled (independent) or tightly coupled. Although personal computers and laptops are coming out with more cores, most programs only use one at a time. Find ways to utilize idle cores – e.g. there is a way to do this in R. Think in terms of MapReduce. Take advantage of cloud computing, like Amazon’s EC2. Use platforms like Hadoop and Disco to make parallel computing applications. A cool example of this is Cloudburst-Bio, a massively parallel project for genome assembly from next-generation sequencing that uses MapReduce.
11. Embrace hardware. New (and old) hardware is available that can give you significant speedups in biomedical computation, notably graphical processing units (GPUs) which have been used to accelerate molecular dynamics. Hardware vendors like Nvidia are starting to respond; you can now get GPU workstations like NVidia’s Tesla personal supercomputer offering many 100sX speedup over traditional workstations. So if you don’t want to utilize the cloud, you can get an affordable and powerful cluster that fits on top of your desk. Aside from GPUs, there are field programmable gate arrays – chips you can program after manufacturing.
12. Playing nice with others. Think a bit about data exchange formats – but definitely use them! Suggestions are JSON, YAML, and, of course, XML. When working in teams, use an “agile software development” strategy – mainly many fast iterations of the specification-development-feedback cycle. Use tools to automate the development process, such as unit testing and the granddaddy, “make“. Tools like BaseCamp (and perhaps Science 2.0 versions like Laboratree) can help with the more general project management aspects.
Focus on the goal (biology or medicine).
Don’t be clever (you’ll trick yourself).
Value your time.
Outsource everything but genius.
Use tools available to you.
And have fun. ;)
Many thanks to Joel for the tips! He mentioned uploading the presentation to Slideshare so I’ll include a link to the slides once they’re up.
Update: slides for Joel’s presentation are up on Slideshare.