Search This Blog

Tuesday, July 31, 2012

The bioinformatics sysadmin craftmanship: An EMBOSS 6.5 production server install: Part 1: Installing from sources

Every 15th of July, the EMBOSS team at EBI releases a fresh version of the European Molecular Biology Open Software Suite (EMBOSS). Started and shaped by the EMBnet community, EMBOSS is one of the most versatile systems to perform sequence analysis and a variety of bioinformatics pipeline tasks, as it copes with a variety of file formats and contains a plethora of applications. 

Most of the procedures outlined here are described in more detail by the 'EMBOSS User's Guide: Practical Bioinformatics' book, written by the EMBOSS authoring team. While this is an excellent publication, books quickly get out of date as software evolves. In addition, the on-line EMBOSS administration documentation is out of date. As a result, I felt that this two part article series (Part 2 covers the task of enabling data access in EMBOSS (including local flatfile database setup) will be a quick startup guide for those that have to administer EMBOSS installations.
This year the version clock has turned into 6.5. In this Part, I shall be going through an installation from the sources on a production Linux server, covering all aspects of the system configuration, including the formatting of databases.  There might be binary/prebuilt packages available for your Linux distribution. However, I always maintain the principle of building the latest binaries from the sources. This gives you the latest and the greatest with a little bit of extra effort.

Most of the steps below can be automated with simple scripts. However, the process of going through a manual installation of EMBOSS should make you aware of the different system components. Once you have an understanding of the system, it is then wise to automate/script these steps.  

What kind of hardware you will need 

EMBOSS is a  fairly modest system to install in terms of hardware requirements. The only thing that can draw the hardware envelope is how much data you would like to index. If your server should host/index the entire EMBL/Genbank databases, you will need plenty of disk space (I advise you to have at least 3-4 Tbytes to spare, yes you read right). 

Memory and CPU wise, 8 cores with 32-64 Gigs of RAM should be enough to keep most user loads happy (30-40 users) on a production server setup. What you do draws the map for the hardware requirements. If you are trying to do a global alignment of large sequences, you might easily eat up 64 Gigs of RAM. In contrast, basic sequence processing could also be performed on a dual core Laptop with 4 Gigs of RAM. By and large, the figures I suggest here should meet most requirements. If you have the task of specing an EMBOSS server, your best bet to get it right is to talk to your scientists and ask for what sort of operations they would be performing, to get an accurate picture of the hardware specs.     

The downloading of the sources

Prior starting, I ensure that my Linux system has most of the development libraries installed. Some EMBOSS applications can be sensitive to missing libraries like libpng, libjpeg, etc. You will also need to ensure that you have your C/C++ compilers installed (gcc/g++).

EMBOSS is a large system. Apart from the core EMBOSS packages, there is an entire array of third party applications that are bundled together with the EMBOSS core applications (some examples: PHYLIP, MEME, IPRSCAN). These are the EMBASSY tools. This is a detail for most users, who collectively refer to the entire package as EMBOSS. However, when you go to download the source EMBOSS tarball, it does not contain these additional packages. This means that if you want to have the full array of EMBOSS/EMBASSY applications, you will have to go through the following steps:

1)Go to the main EMBOSS FTP download server and I download the latest EMBOSS tarball (normally named emboss-latest.tar.gz). In my case, it points to the EMBOSS-6.5.7. 

2)After downloading this to my source dir, I unpack it by doing a:

tar -xvfz EMBOSS-6.5.7.tar.gz

3)I then cd to the EMBOSS-6.5.7 dir and at the top level of the sources, I do a:
mkdir embassy

4)Under the newly created embassy directory, I then download the tarballs of the EMBASSY packages (version info will vary, but the base name of each package should be more or less the same): CBSTOOLS, CLUSTALOMEGA, DOMAINATRIX, DOMALIGN, DOMSEARCH, EMNU, ESIM4, HMMER, IPRSCAN, MEME, MSE, PHYLIPNEW, SIGNATURE, STRUCTURE, TOPO, VIENNA .
I unpack each of the tarballs with the same command as step 2 under the embassy subdirectory. Once I am done, I can delete the remaining *.tar.gz files.

5)At this point, it might be wise to create a tarball with all the sources properly laid out under the embassy subdirectory by going above the EMBOSS-6.5.7 directory and doing a:
tar -cvf embossembassy65.tar EMBOSS-6.5.7/

This will create the file embossembassy65.tar. This is handy in case you wish to erase the whole source tree and start from scratch and/or repeating the installation on other systems by not having to go through the steps 1-4 again to assemble the source tree.

Configure and compile

We are now ready to start configuring the various packages and eventually compiling them into the EMBOSS/EMBASSY binary applications we shall be using. In my system, I choose that the directory holding the binaries and the produced libraries should be under:


You are free to choose what you wish on your system. 

6)Thus, I cd into the top level of the EMBOSS-6.5.7 directory and I issue a:
./configure --prefix=/usr/lsc/emboss; make; make install

In one sentence, this says to the config process where to place the produced files and instructs the system to compile and place the produced applications under that location.  Grub a cup of tea/coffee/beer as this will take some time. If it all goes well, and you see no errors in the terminal output, you should see the first installed binary applications under the /usr/lsc/emboss/bin directory. In my base, I verify that I have functioning applications by executing embossversion:
Report the current EMBOSS version number


This means that I am on good ground and can continue with the installation of the rest of the applications. 

One detail new to the process of installing EMBOSS as of version 6.5.x is the automatic kick in of the embossupdate application, which you note in the final output lines of a successful step 6 operation:
make[3]: Entering directory `/usr/lsc/sources/EMBOSS-6.5.7'
Checks for more recent updates to EMBOSS
EMBOSS is the latest version (


Basically, the EMBOSS install process will check for patches and updates to the source code, a process performed manually by EMBOSS admins before. This is a very welcome addition and eases the process of receiving up-to-date code, in order to address bug fixes and enhancements.

If you do not get to the point where you see the emboss applications and you see errors as part of the make process, the most likely scenario is that you are missing some development library or tool. You can get help by posting a request for help to the EMBOSS mailing list

What you need to do now is to repeat step 6 for every subdirectory under the embassy directory and watch gradually the new applications being added to the bin folder.

Post installation configuration

You should have installed by now all the applications of core EMBOSS and EMBASSY packages from source.  After this process, you should start configuring your system so you can make the applications available.

7)Make sure that the emboss bin folder is in a system wide path, to ensure that all users can reference the applications. For my systems, all the freshly compiled applications reside under the /usr/lsc/emboss/bin folder. Hence, this is the folder I enter into the system wide PATH. in my server /etc/profile.d/, there is a line that contains the following:  
export PATH=$PATH:/usr/lsc/emboss/bin 

8)Make sure you install all the application dependencies for the EMBOSS/EMBASSY applications you are going to use . There is a number of EMBOSS/EMBASSY applications that are wrappers around third party packages. This means that the EMBOSS/EMBASSY application will not function, unless you install its required dependencies. This is normally simple. I am not going to mention all the dependencies now, but  a few examples from my userbase are the following:
-emma which requires the installation of the Clustalw tool. 
-eiprscan which requires the installation of the iprscan tool. 
-ememe which requires the installation of the meme tool. 

Each of these installations might involve an entire set of separate procedures and instructions, but you get the picture.

Part 2 of this article will examine how to configure the EMBOSS databases. 

No comments:

Post a Comment