Tips and tricks for software engineering in bioinformatics (talk by Joel Dudley)

Photo by ladyada on Flickr

Photo by ladyada on Flickr

Joel Dudley, co-founder of MacResearch and a student in the Stanford Biomedical Informatics program, gave a quick seminar today on how to program effectively for bioinformatics. The main point is that to be a good bioinformatician, you need to build up your toolbox, be aware of what’s out there, and use and integrate existing tools to do more powerful work. To do this, he gives the following suggestions:

1. Learn UNIX. It’s quick, it’s powerful, it’s easy to learn. What often takes several lines to code in a scripting language can usually be reduced to a single line on the command line.

2. Be jack of all trades, but master of ONE. That is, be familiar with most programming languages, but be really good at one of them. In the hierarchy of languages, VB and C are more “primitive” while Ruby and Python are most “advanced” – he recommends starting with one of the more advanced languages if you are new to programming. Out of Ruby and Python, Python will probably give you more bang for your buck, due to the smorgasbord of libraries available and broad acceptance (e.g. academic labs, Google). In addition, there are lots of bridges between languages, such as Jython (Java and Python) and JRuby (Java and Ruby) so expert knowledge of one is usually sufficient for you to make a lot of things work practically everywhere.

3. Don’t reinvent the wheel. “Frameworks are your friends.” Take advantage of large existing projects like BioPython/Perl/Ruby/Java, Django, Rails, etc which contain lots of ready to go code for practically everything. Use the internet to find existing code solutions – e.g. Koders is like a Google search for open source code on the web.

koders

4. Learn one text editor really well. Take your pick of Emacs, vi, or a GUI-based editor like TextMate for Macs. The advantage of emacs and vi is that they will be installed on pretty much any system you come across.

5. “Don’t trust yourself”, i.e. use code versioning. Examples are Subversion, CVS, and git. You can even outsource your code hosting with github. Combine this with project management in GForge.

6. Don’t be afraid to use more than 3 letters to define a variable. Having short variable names won’t make the code run faster. It will, however, make the code more difficult for others (and you, 3 months from now) to understand!

Photo by archeon on Flickr

Photo by archeon on Flickr

7. Balance architecture and accomplishment. You may be tempted to create something that is complete, elegant, and perfectly structured. This will likely be a waste of time. It’s ok to sacrifice a little bit of structure to get something that actually works.

8. Automate documentation. Documentation is necessary, but it’s a pain to write. So come up with a convention for your headers and make it automatic. Use available tools like Doxygen, JavaDoc, and RDoc, many of which are free.

The above are generic for academic-level software engineering. Some tips that more specifically address high-throughput biomedical computing:

9. Kill the flat file (sort of). This is the most common file format used in bioinformatics, but it hardly lends itself efficient computation. A common task we want to do with the file is read in the data and store it keyed so that we can look up specific pieces of the data later. Hate databases? Cringe at SQL? If you can represent your data as key/value pairs, consider using an embeddable database like the open source BerkeleyDB (now licensed by Oracle), which require no administration. If you don’t mind SQL, but hate the administration, SQLite allows you to create embedded, serverless databases. Other options that go beyond the relational database concept are CouchDB (“a distributed, fault-tolerant and schema-free document-oriented database accessible via a RESTful HTTP/JSON API”) and Hypertable (“a high performance distributed data storage system”).

10. New ways to do parallel computing. Determine whether your tasks are loosely coupled (independent) or tightly coupled. Although personal computers and laptops are coming out with more cores, most programs only use one at a time. Find ways to utilize idle cores – e.g. there is a way to do this in R. Think in terms of MapReduce. Take advantage of cloud computing, like Amazon’s EC2. Use platforms like Hadoop and Disco to make parallel computing applications. A cool example of this is Cloudburst-Bio, a massively parallel project for genome assembly from next-generation sequencing that uses MapReduce.

via PCNews

via PCNews

11. Embrace hardware. New (and old) hardware is available that can give you significant speedups in biomedical computation, notably graphical processing units (GPUs) which have been used to accelerate molecular dynamics. Hardware vendors like Nvidia are starting to respond; you can now get GPU workstations like NVidia’s Tesla personal supercomputer offering many 100sX speedup over traditional workstations. So if you don’t want to utilize the cloud, you can get an affordable and powerful cluster that fits on top of your desk. Aside from GPUs, there are field programmable gate arrays – chips you can program after manufacturing.

12. Playing nice with others. Think a bit about data exchange formats – but definitely use them! Suggestions are JSON, YAML, and, of course, XML. When working in teams, use an “agile software development” strategy – mainly many fast iterations of the specification-development-feedback cycle. Use tools to automate the development process, such as unit testing and the granddaddy, “make“. Tools like BaseCamp (and perhaps Science 2.0 versions like Laboratree) can help with the more general project management aspects.

————————————————-

In summary:

Focus on the goal (biology or medicine).
Don’t be clever (you’ll trick yourself).
Value your time.
Outsource everything but genius.
Use tools available to you.
And have fun. ;)

Many thanks to Joel for the tips! He mentioned uploading the presentation to Slideshare so I’ll include a link to the slides once they’re up.

Update: slides for Joel’s presentation are up on Slideshare.

Update 2/27/09: Episode #13 of the Coast 2 Coast Bio podcast discusses these points in much more depth. Thanks, Deepak and Hari!

15 Responses to Tips and tricks for software engineering in bioinformatics (talk by Joel Dudley)

  1. nsaunders says:

    All excellent tips.

  2. Chris Lasher says:

    Nice post. I’d probably scratch #4 after being convinced by a friend that the amount of productivity you gain from learning Vim/Emacs is negligible to the amount of productivity you lose learning Vim/Emacs. (I say this as a hardcore a Vim user.) I’d also scratch #11 in light of #10. Throw it up in the cloud. Forget the cost of maintaining your own hardware and the value lost in all the time you don’t actually use it.

    (Also, props to the photo from Lady Ada.)

  3. shwu says:

    Chris, sure, there are definitely arguments both ways. For more extensive file editing or testing it makes sense to use an editor with more features. But if I just want to change one part of one line, I can do this in vi in 2 seconds, which is really convenient given that I do most of my work in a shell anyway.

    Also, on #11 vs #10 – Seb Paquet just posted some caveats and suggestions for robust cloud computing. One problem with throwing everything into the cloud is that you no longer have control over it. Of course, this applies more to data than to hardware, which I agree you rarely want to deal with.

  4. Khader Shameer says:

    5/5 : Short, but explained the heart-of-the matter in detail !!

  5. Nice post; very solid. I’m not entirely sold on frameworks yet – the concept is fine, but implementation often falls short. Though this may be too obvious to include, I also recommend learning and using an IDE, especially for compiled languages. It has been a huge time-saver for me.

  6. Pingback: Turulcsirip - attilacsordas

  7. neurospora says:

    Nice summary, but I’d add Perl to #2.

  8. Harry Mangalam says:

    Very good. I’d also add Perl to #2 (BioPerl still leads BioPython in list traffic and coverage) and a good grasp of the the primitive but useful-in-a-Paleolithic-way bash.
    Also, the R project and the based-on-R BioConductor, tho some find it very difficult to learn, is very powerful and has some great built-in data visualization tools (ggobi, for one).

    Speaking of which, you completely left out Data/Information Visualization.
    Here’s a link to some of my favorite quick & dirty vis tools: tinyurl.com/c78oj6

  9. Harry Mangalam says:

    The URL should be:
    http://tinyurl.com/dyne5k

  10. shwu says:

    Thanks all for the comments!

    neurospora and Harry, I agree Perl is probably still the classic bioinformatics language but what language you end up using the most is probably influenced most by what is dominant in your circle of colleagues (a combination of peer pressure and the fact that there will be plenty of people to help if you run into trouble) . In Silicon Valley it is very much a Python-centric world (and becoming Ruby-ish). ;)

    And Harry, yes, much can be said about data visualization and should be – when done well it is amazing how it aids understanding, and when done poorly it is amazing how it obscures it. That’s probably best left for its own dedicated post, though. Your page on data analysis is a good resource!

  11. Pingback: Coast to Coast Bio Podcast » Blog Archive » Episode 13: Better engineering in Bioinformatics. Joel Dudley as blogged by Shirley Wu

  12. Chris Lasher says:

    Shirley, I forgot to mention that I produced two screencast series that would help with #1, working with the shell (http://showmedo.com/videos/series?id=94), and #5, revision control with Subversion (http://showmedo.com/videos/series?id=95).

  13. Pingback: Developing effective bioinformatics programming skills

  14. Pingback: 7 Tips for Managing Software Developers Effectively – Site Title

Leave a comment