Thoughts of a technocrat as letter sequences: December 2011

I always wanted to gather my thoughts on the process of installing the Maartens Retrieval System (MRS) properly on a RHEL server platform. This series of articles describe the procedure in detail and can serve as a guide for the system administrator and/or power user that wishes to install a production grade MRS server. Although MRS is relatively simple to install, there are a few gotchas and complexities, especially when you do not install it on a Debian based platform (including Ubuntu).

In the first part of the article series, I shall describe MRS version 5, in a few words and discuss a regular production setup you should consider, in order to ensure that you have a reliable MRS engine running.

Please do not send me questions directly if you have various MRS issues beyond the setup stage (comments are welcome). Subscribe and ask your questions to the mrs-user discussion list for that purpose, where I normally participate in the discussions.

MRS: What is it and why life scientists need it?

I have gathered some useful background information in these MRS lecture notes. A video of the course is also available. Here I shall just state the basics.

In the bioinformatics world, biological sequence, disease and genome repositories are an important tool for the life scientist. Note that I use the term 'repository' and not another word such as 'database'. We have not got to the database business yet. What we know is that the era of molecular and genomic medicine is here and thus being able to search/reference/associate sequence/genome and disease information is important.

Now, think about your favorite search engine (Google, Yahoo, etc) and then narrow your scope to life science information. This is the purpose of MRS. It is a simple system that allows you to search various life science information repositories. Most of these repositories are given in what we call 'flatfile' format: usually a human readable text which contains a consistent record format, but not enough structure to be able to search the file(s) in question and make them useful for the scientists.

An index is what we need to apply on these flatfiles and make them searchable. MRS does exactly that amongst many other things and we can now talk about life science databases. So, it is a set of tools that provides access to:

i)An engine that indexes the flatfiles, as well as keeping them up-to-date.

ii)A set of tools to present a simple web interface, to facilitate a web search pretty much like you perform your searches in your web browser with your preferred search engine.

iii)A set of tools that allow you to perform programmatic searches, ie searches that can be issued in a repeated way from a script/batch mode.

iv)BLAST and Clustalw functionality to perform biological sequence homology search and alignment from a single interface.

MRS is not the only system to give you this kind of functionality. In fact, Entrez and SRS are two examples of a free and a commercial solution respectively that are comprehensive and will probably suit most of your needs. In addition, a growing number of web services (as in SOAP/REST) can facilitate easy information access to biological databases. Examples include EBI's ENA browser and other similarly crafted tools, which can facilitate programmatic access to large datasets.

So, if other resources can provide free access to relevant information, why should you invest in effort and hardware to use MRS? The answer is along the following lines:

i)If you have power bioinformaticians in house that need persistent and concurrent programmatic access to large biological databases AND/OR
ii)You need to facilitate simple web access to life scientists for your own sequence data

MRS is one of the most computationally efficient engines to address both of these issues. In terms of issue i), programmatic access is not always available at large from public resources (there is a quota on how many questions you can ask over a public web service given a certain amount of time to prevent resource utilization). In addition, network bandwidth could restrict you from retrieving a large number of sequences/info. This is an important factor if you run a departmental/workgroup computing setup, where your local bioinformatician can issue several hundred thousands queries on data sets that can reach TiBs of information.

What kind of computing gear do you need to run MRS?

Although MRS is fairly efficient, running it on a dedicated server grade machine is a must. This is especially true for the dataset indexing processes, where large amounts of RAM maybe required to crunch a large flatfile repository such as the EMBL Nucleotide or Genbank data sets. The table below provides an overview of the minimum computing requirements required for various aspects of the MRS operation.

The hardware impact of MRS can be measured in terms of the:

Disk space: This is Directly Attached (DAS) or network filesystem storage, in order to store the flatfiles and indices of the various datasets.
RAM: The amount of RAM needed to perform the indexing operations and/or have people using the system at the same time (MRS queries). A query could be an index or full text search on the datasets, an NCBI BLAST operation, or a CLUSTAL operation.
CPU cores: The number of CPU cores required by the various indexing/quering processes.

It is important to understand the disk space requirements for hosting the MRS datasets, especially the well known/standard ones (it is possible to build your own datasets). For example, in order to host the EMBL nucleotide dataset, you need to count for the space to download its compressed flatfiles plus the space required to generate and store the MRS EMBL index. At the time of writing, the release 110 compressed flatfiles are worth 177 Gb. The index is worth approximately 980 Gb. So immediately after crunching the EMBL dataset for the first time, you are writing off just over 1 Tb of disk space.

Is that all? Well, not exactly. During the index generation stage you might also have to deal with:

lots of temporary files that are created and then merged into the main index.
If you data set should have NCBI BLAST indices (MRS can take care of that automatically for you by running the formatdb process.
In addition, the next time you upgrade the version of EMBL, you will need to keep the old index on disk. Your users will be using the server in the meantime and as the download indexing process takes at least 3-5 days, you might not have the choice of deleting the old index and flatfiles and wait for the new one to download and be crunched.

Hence, an estimate for a full production cycle for the EMBL release 111 could consume at most:

old MRS index + new flatfiles compressed + new MRS Index + temp files + BLAST indices =

980 + 190 + 1100 + 100 + 50 = 2429 Gb = 2.4 Tb

At the end of the version EMBL 111 indexing process, only 1.3 Tb will be left permanently on disk, which shows the efficiency of MRS: an uncompressed version of all the flatfiles will be more than that. That´s why I state a minimum of 2 Tb disk space is required for now. Factor in the increase of datasets on a yearly basis and any hosting of your own datasets and you will need to dedicate 9 -10 Tb in a nice hardware disk controller that can support nested RAID, on a single volume/partition/filesystem, in order to ensure that you remain production ready for the next couple of years. I would go either for RAID 50 or RAID 60.

Moving on to RAM requirements, well the more RAM you have, the better for you. However RAM is not always as inexpensive as disk space, so the very minimum requirement you should have is 32 Gb of RAM. Why? Well, take a look at the screenshot below.

This terminal 'top' command screenshot shows the EMBL data indexing process (mrs-build) grabbing a 21Gb RAM chunk, on a system with 32 Gb of physical RAM. MRS large dataset indexing is a very RAM hungry process. Thus, if you want the server to be able to be responsive (queries) and/or crunch more than one index in parallel, you should really have more than 32 Gb of RAM plus an adequate amount of swap.

Finally, 8 cores could do for doing one thing at the time. Again, as we RAM, the more the better, however if you have at least 16 cores, you should be able to ensure that you should serve adequately a small group ( <= 20 ) of users.

One final thing you should note about MRS is that the index generation process is not scalable CPU and I/O wise. You can have at most up to 8 cores to build the index of a large dataset. This means that if you want to generate the index of large datasets, it is not worth launching multiple mrs index building processes in parallel. This will not speed up things. In contrast, MRS queries can scale. This means that user queries can be executed in parallel and dealt effectively if your system is loaded with lots of CPUs/cores RAM and a capable disk controller.

In the next part (Part 2) of the article series, we will be setting up an MRS v5 server.

Thoughts of a technocrat as letter sequences

Search This Blog

Monday, December 26, 2011

The bioinformatics sysadmin craftmanship: The MRS v5 platform: Part 1

Sunday, December 25, 2011

A christmas post

Sunday, December 4, 2011

Can you crack it? GCHQ challenge