Search This Blog

Friday, January 6, 2012

The bioinformatics sysadmin craftmanship: Installing the MRS v5 platform: Part 2

In Part 1 of the article series, we examined the basics of what MRS is and its computer hardware requirements. It is about time we get our hands dirty and install a production MRS server.

A basic production setup


The image above illustrates a basic production setup for MRS. You do not have to follow this setup, you could have a single server to handle everything. However, the above setup has a number of advantages that I shall explain.

There are two servers here. The front-end one serves the user queries, whereas the back-end server is used for the MRS index build process. You will notice that the front end is more beefed up hardware-wise than the backend. This is because (as explained in Part 1) the MRS queries can scale in terms of CPU,  I/O and RAM. In contrast, that is not the case with the index building process, which beyond the 8 cores and the I/O it can create, will not scale to a large number of CPUs/cores. As a result, it makes sense to have the most capable machine at the query response end and keep an 8 core CPU with an adequate amount of RAM to crunch your datasets periodically.

The disk I/O setup reflects the same need/trend. I would recommend to place your disks at the front-end machine and have a capable disk controller (Directly Attached Storage SAS, Fiber Channel, Fiber Channel over Ethernet). The backend machine can access these disks to build the index by means of a well performing NFS setup over 10 Gigabit Ethernet. Plain Gigabit Ethernet should also be acceptable, however, I found that a "jumbo frame" enabled 10 Gigabit Ethernet in comparison to plain Gigabit Ethernet cuts the index generation time by 40-60% on average.

This setup is designed to achieve two things:
  • To place the performance where is mostly needed (MRS queries), especially if MRS is used as part of a pipeline (command-line or Galaxy based).  
  • To increase the impact of the index generation process on a busy/hard-working server that is hit by queries. 
The disadvantage is of course that you have to keep two MRS instances running, so what I describe below should be applied to both servers in order to keep things in sync. However, you will see that once you get a basic instance up and running, most of your attention will turn to post-installation issues and not really on keeping two instances in sync, installation-wise.

Software prerequisites

Before we get to the specifics of an MRS server installation, let us go through some important software requirements for installing on a RHEL 6 platform. If your distro is Redhat based (Fedora, CentOS, and Scientific Linux are some of the most well known free derivatives of RHEL), the instructions should carry you through to a functional MRS installation. If your distro is not RHEL based, you can at least have a good appreciation of what building blocks are required for the proper operation of the system. Here is a list of them:
Working as the root user, on a RHEL 6 platform, most of these components can be easily installed by the yum package manager with the exception of the Boost library and snarf:
yum install perl-XML-LibXSLT
yum libarchive libarchive-devel

Starting with the gcc compiler, due to some code optimization bug issues, there were issues when attempting to compile MRS and its prerequisites with a compiler more recent than a 4.4.x series gcc. By mid January 2012, this issue was addressed and is now possible to use more recent compilers than 4.4.x Nevertheless, the  RedHat default 4.4 gcc compiler (in my case it was 4.4.6 20110731 (Red Hat 4.4.6-3) ) is a stable choice.

At the time of writing, RHEL 6.2 (Santiago) is equipped with Boost version 1.41, as part of its default yum package repository. That´s too old for MRS and thus it means that we have to uninstall the yum related Boost packages and install the Boost libs from source.


yum remove boost boost-dev boost-date-time boost-filesystem boost-graph boost-system boost-iostreams boost-thread boost-regex boost-serialization boost-signals

Then grub a copy of the libboost 1.47 from:
http://sourceforge.net/projects/boost/files/boost/

and complete the boost lib install here in a fixed path (in my case /usr/lsc/libs)  by doing a:

tar xvfz boost_1_48_0.tar.gz
(Note:  Earlier versions of libzeep had a problem with boost version >1.47 and would not build. Around mid of January 2012,  it became possible to use boost version 1.48)
cd boost_1_47_0
./bootstrap.sh --prefix=/usr/lsc/libs
./b2 install

At that point, make sure that your shared library config (normally /etc/ld.so.conf should contain the /usr/lsc/libs/lib path and then you should do an ldconfig. Check with an ldconfig -p | grep /usr/lsc/lib/ to see that the boost shared libraries are in place.

For snarf, you need to install the utility in the system wide PATH. 

MRS installation

At this point, we should be ready to start installing MRS itself. Libzeep is the first part of installing MRS. It is a bespoke W3C compliant XML processor that enables MRS to talk the SOAP. This allows users to query an MRS server using web services. Still working as the root user, get the latest version (at the time of writing) 2.6.3 from the CMBI SVN server:

svn co https://svn.cmbi.ru.nl/libzeep/trunk


(revision 337)

Modify the makefile and set the following parameters, having in mind a prefix where you want the libzeep to install:

BOOST_LIB_DIR           = /usr/lsc/libs/lib
BOOST_INC_DIR           = /usr/lsc/libs/include


PREFIX                  ?= /usr/lsc/libs

Then issue a:
make; make install; ldconfig

Do verify that you can see with an ldconfig -p | grep libzeep that the boost shared libraries are in place.:


libzeep.so.2.6 (libc6,x86-64) => /usr/lsc/libs/lib/libzeep.so.2.6
   libzeep.so (libc6,x86-64) => /usr/lsc/libs/lib/libzeep.so

We are ready to install the actual MRS code now. Now let us install the MRS version. Grab the latest from the CMBI svn

svn co https://svn.cmbi.ru.nl/mrs/trunk

(checks out revision 1430)
Note: The MRS SVN repository is an active project and as such, the developers might be in the process of cleaning/modifying the code. It is possible that if you checkout the latest sources from the CMBI SVN server, that something might break/will not compile. When in doubt, please consult the MRS mailing list and verify the latest known working version. At the time of writing, you can be sure that revision 1430 is a working MRS version. If you wish to use it as a reference, you can issue the command:

svn co -r 1430 https://svn.cmbi.ru.nl/mrs/trunk

Then I shall make the directory under which I shall have the MRS binary utilities, as well as the directory where I am going to store the datasets (the large multi Tb volume we talked in Part 1 of the article series):

mkdir /usr/lsc/mrs
mkdir /storage/tools/mrsdata

Then, I am ready to initiate the configuration of the sources by selecting the prefix, the data directory location, as well pointing the location of the boost libraries which I installed from source, just to be safe and ensure the MRS coigure routine will find the right library paths, as shown below:

./configure --prefix=/usr/lsc/mrs --data-dir=/storage/tools/mrsdata --boost_lib=/usr/lsc/libs/lib --boost_inc=/usr/lsc/libs/include

Various checks will be performed and if no errors are returned at this stage, the configure command should be followed by a: 
make; make install; ldconfig

If the compilation stage finishes with no errors (you will see plenty of warnings and you can normally safely ignore them), you have just completed the MRS installation stage. Congratulations!

Post installation config check and orientation

In this section, we will discuss what you should see/check, prior using MRS for the first time. After completing the make install and ldconfig steps as described above, you should familiarize yourself with the directory layout of MRS. So, let us take a tour and show the various MRS directories.

First of all, under the installation prefix (it was /usr/lsc/mrs) you should see the following directories:
  • bin: This is where the MRS utilities reside: mrs-blast mrs-config-value mrs-mirror  mrs-run-and-log  mrs-build  mrs-lock-and-run  mrs-query mrs-test and mrs-update. All these are tools you will be using to configure and query the various MRS datasets.
  • lib: This directory is meant to hold MRS library modules, but it is empty on 64bit systems (x86_64).   
  • lib64: On 64-bit systems (x86_64), this contains the MRS.so shared library, as well as the MRS.pm Perl module, a vital module referenced by the dataset parsing scripts (share directory)
  • sbin: Here you should have the mrs-ws binary which is the SOAP web services MRS module.   
  • share: This directory contains a series of Perl parsers, one for each databank MRS supports.
Next, you should navigate your shell to the /usr/local/etc/mrs directory. Under the directory, you should find a series of important configuration files. I shall not go into details on the syntax of these files in this article, but very briefly:
  • databank.info: This file instructs MRS how to fetch (location and method) and generate index for various databanks you can offer/query under MRS.
  • mrs-config.xml: This XML formatted file (its DTD schema is in the mrs-config.dtd) controls various operational parameters of MRS such as the location of the various MRS directories (most of them are auto-generated by the configure step of the MRS sources), the location/path of externally used utilities (clustalw, NCBI BLAST), as well as the port number and URL location of the MRS SOAP web services servers. Latter articles will explain this parameters in more detail.
Both of the previously mentioned files have a sample you can use for reference (*.dist files). 

If you can see all of these things at this point, you are on good track to fire up MRS for the first time and check it out. We will do that by navigating back to the /usr/lsc/mrs/bin directory. We are going to fetch a simple databank and watch MRS generate the index so we can query the database. We shall do that by running the mrs-update utility:
./mrs-update enzyme

and if everything was compiled properly, MRS will issue the following output:

/usr/lsc/mrs/bin/mrs-run-and-log -r 5 -l /storage/tools/mrsdata/status/enzyme.fetch_log /usr/bin/make -f /usr/lsc/mrs/bin/mrs-update DATABANK=enzyme fetch
/usr/bin/make: success
/usr/lsc/mrs/bin/mrs-run-and-log -r 5 -l /storage/tools/mrsdata/status/enzyme.mrs_log /usr/bin/make -f /usr/lsc/mrs/bin/mrs-update DATABANK=enzyme mrs
/usr/bin/make: success
rm -f /storage/tools/mrsdata/flags/enzyme.fetch_done /storage/tools/mrsdata/flags/enzyme.mrs_done 

After that, we can navigate under the MRS data directory to gain a basic understanding of what happens every time MRS generates the index of a databank. Under the data directory (in my case as indicated by the above output /storage/tools/mrsdata), you will find the following sub-directories:
  • mrs: This is where the MRS index files are produced and stored. Each databank has a number of associated .cmp files, together with an associated dictionary file .dict. For the enzyme databank, the produced files are enzyme.cmp and enzyme.dict.
  • raw: This directory holds the flatfiles of the databanks. These are downloaded from the URL and method, as specified in the databank.info file. 
  • status: A useful directory for the MRS administrator, as it holds important logs about the status of the mrs-update process for the databanks. For each MRS hosted databank, you can see the fetch_log (whether the flatfile download procedure was completed), the mrs_log which outlines whether the MRS index generation was completed properly. Finally, if all was completed properly, an mrs_done file is created to indicate that MRS was successful in updating the databank. The logs for each databank auto-rotate. 
  • docroot: This directory holds the CSS/HTML and web content of the MRS HTTP server. We will describe how to fire-up this server shortly, together with the system SOAP functionality. 
  • flags: This directory is used internally by MRS to sync certain procedures of the databank fetching process.  
  • blast: This directory contains the NCBI BLAST database index for each databank, in order to BLAST databases via the MRS system.
Hence, every time you mrs-update a databank, the latest flatfiles are fetched automatically under the raw directory. After that, the mrs-build utility will attempt to invoke the MRS parsers and create the index under the mrs directory. 

If you wish to see which databanks you can fetch/update with the mrs-update utility, here is a list of them:
dbest embl_release embl_updates enzyme genbank_release gene go goa gpcrdb interpro omim oxford pdb pdbfinder2 dssp hssp pfam pmc prints prosite rebase refseq_release refseq_updates taxonomy unigene uniprot uniref50 uniref90 uniref100

I leave the mrs-index generation of them as an exercise to the reader with two hints:
  • Do not attempt to start multiple mrs-update processes in parallel. Remember, the index generation process does not scale.
  • Some of the largest databanks (embl_release, genbank, dbest, pdb, hssp) will require entire days to  download and index. Thus, what I tend to do is to issue something like: nohup ./mrs-update embl_release &  , to ensure that the process will not be interrupted by a terminal session timeout/disconnection.

Querying and firing up the MRS web and SOAP server

If you have followed all the previous instructions, you should have installed MRS and have indexed one or more databanks. What about querying them to ensure that MRS does indeed its job? After all that effort, you should really experience the power and simplicity of MRS searches. 

The most user intuitive way is to fire up the MRS web server. Before you fire up the MRS web server interface, you should consider making a non-privileged user. Up to this point, we have been working as root. However, opening an HTTP/SOAP port bound to a process with superuser credentials is not the best thing for the security of your server. What I do is to make a normal system user:
useradd -d /home/users/mrsuser mrsuser

I assign a secure password to the user. As root, I make sure that this user can have access to the /var/log/mrsws.log file which logs the queries that hit the MRS SOAP server:
touch /var/log/mrsws.log
chown mrsuser /var/log/mrsws.log

and also change the owneship of the /usr/lsc/mrs directories to the mrsuser recursively:
chown -R mrsuser /usr/lsc/mrs

After that, I switch to the mrsuser and start the MRS SOAP server by navigating to the following directory:
su - mrsuser
cd /usr/lsc/mrs/sbin
nohup ./mrs-ws &

This starts a number of mrs-ws servers with the credentials of the mrsuser and not the root account. Make sure you do not have a firewall between your desktop and the server and point a recent web browser version (Firefox 8, Chrome) to the IP of your server, following the URL convention:
http://IP_of_your_server:18080 


If you hit the Status tab, you will see the MRS web environment as shown above. You can enter your search terms in the top bar and search against all or specific databases.

There are other ways to search the databases that will be outlined in Part 3 of this article series. However, you have now the basic knowledge of how to kickstart MRS in a basic way. In the next article, we will discuss the production usage of MRS.