Search This Blog

Tuesday, December 25, 2012

KVM hosted virtual servers using bridging: theory and practice

If you are a systems or networks administrator that:
  • works in enterprise data centers or 
  • someone that wants to deploy virtual servers in a newly acquired multi-core server using RHEL 6 and nothing more than the Linux KVM and RedHat's basic virt-manager application and/or 
  • you wish to gain an understanding of KVM's virtual networking architecture
then this article/technical walkthrough is for you. Most of these techniques will work on other Linux distributions besides RHEL 6. Admittedly, there are more user friendly, free and commercial tools that allow you to deploy virtual machines. The usual suspects include VMware, RedHat, Oracle, Parallels that provide industrial strength solutions with intuitive point-and-click interfaces that make the setup of virtual machines an easy task.

However, I like to keep my production server software stack as simple as possible. Those of you that had to troubleshoot VM performance or other problems and faced the 'ping-pong' between the virtualization and the OS vendors will know what I mean. Thus, I use KVM/qemu and virt-manager to cater for my VM needs. The downside is that these tools are less intuitive to use for the newcomer, but with a little bit of good documentation and practice, they can be effective. I draw this conclusion after looking around in various technical support threads and after browsing RedHat's documentation on the subject. The threads seem to confuse the various virtual switching modes and techniques when things could be done more easily with interface bridging. The same can be said for Redhat's Virtualization Administration Guide, which does a fairly good job detailing the Routed, NAT and isolated virtual networking modes (Chapter 18), however it fails to mention how bridging could be used for hosting virtual servers. I am going to spend the rest of the article to explain this in detail.

The Theory

Let's be more specific now and explain what I mean when I say I need to deploy a fully networked virtual server. When you use the virt-manager application, it's easy to deploy a network enabled guest OS by means of using Network Address Translation (NAT). In fact, NAT (IP Masquerading, a specific mode of NAT) is the default guest OS virtual networking mode, using the IP address of the physical host server.  

 Figure 1

The figure above displays the networking data path traversal from the VM guests, all the way to the physical network/VLAN, when using the default virtual networking mode (NAT). Starting at the bottom of the figure, each guest has been assigned to a virtual network interface (vnetx). This is essentially a software implementation of an interface which is part of a virtual switch. At the other end of the virtual switch, a virtual bridge interface (virbr0) merges the traffic from the VMs and interfaces to the IPTABLES module which performs the actual NAT. At the end, you have the eth0 physical interface which carries the packets to the actual wire.  

In this scenario, your guest OS will have outbound network connectivity. Should you wish to enable inbound network connectivity, you will fail. It is possible to perform other tricks and enable port forwarding/SNAT/DNAT to enable inbound connections. However, this is cumbersome. As a result, my definition of deploying a proper virtual server resembles the following aspects of a true physical server:
  • You have a physical MAC address tied to a network/VLAN broadcast domain
  • You can deal with that MAC address in any way you would deal with a true physical NIC: ARP, assign a static IP, (static) DHCP, etc.
  • You can have unrestricted outbound and inbound network access within that network/VLAN broadcast domain, a must requirement for a server system.
In order to achieve this, we need to employ the technique of interface bridging. For references on bridges, you can consult a variety of sources such as:
i)The IEEE 802.1D standard
ii)The older (out of date but still useful) Ethernet Bridge + netfilter HOW TO from TDLP.
iii)A copy of A. S. Tanenbaum's  Computer Networks classic textbook.
However, prior explaining how this works, let's throw in a realistic production environment scenario.

Figure 2

Figure 2 displays the network topology of a production VM server scenario.  There are two networks. One Class C internal (192.168.14.24), where hosts may or may not have outbound connectivity. Inbound connectivity to this network is prohibited by the top server which offers FTP, DMZ, FIREWALL, DHCP, and DNS services on the INTERNAL net. The other network is a world routable Class B (129.230/16).  

The VM host server needs to serve a number of virtual servers that have different network access criteria:

  • Guest_01: Linux server to run an LAMP stack, exposed on the internal network.
  • Guest_02: Development Windows 7 box, which needs to be accessible via non standard port ranges on the internal network, but also needs Internet access.
  • Guest_03: Legacy SCADA Windows XP based system which needs to be accessible only via the internal network.
Clearly, Guest_01 is the least restricted system, so it makes sense to place it on the INTERNET/EXTERNAL Class B net. Guest_02 needs some protection so the outside folks cannot reach it, only it should reach the outside world by means of IP Masquerading, by using the Public routable IP of the FTP/DMZ/FIREWALL/DHCP/DNS server (129.230.135.131). Thus, it's a candidate for the INTERNAL Class C net. The same goes for Guest_03, which is the most isolated environment we need to protect, accessible only by INTERNAL network hosts.
At this point, it is useful to modify Figure 1 to illustrate the virtual network data path of our new scenario.  
  Figure 3

Figure 3 above illustrates the virtual network data path of our production scenario (Figure 2). In this case, instead of the virbr0 we have bridging modules bound to physical interfaces. Each physical interface is connected to the proper network/VLAN and has a bridge bound to it (we will illustrate how this is done). The role of the bridge is to create a data channel and forward traffic between the vnetx interfaces of the virtual switch and the physical interfaces. The objective is to enable the MAC address of the Guest_X machines to connect to the actual physical network/VLAN, as stated earlier. As a result, via bridge br3, we enable the virtual  servers Guest_02 and Guest_03 for the internal network and via br4, we connect Guest_01 to the external world. 

The practice

The previous section presented the theory. It's time now for the hands-on practical part. First of all, if you are dealing with a fresh installation, make sure you yum install the following groups, in order to have the full range of virtualization utilities and install your guests.

yum groupinstall Virtualization "Virtualization Client" "Virtualization Platform" "Virtualization Tools"

You should also install the bridge utilities, as they are needed:

yum install bridge-utils

The next thing you should ensure is that you have enough physical network interfaces on your VM host server. In order to implement our production scenario, Figure 2 indicates clearly that we need four Ethernet NIC ports: Two of them (eth2, eth3) are used to enable the server to have IP connectivity and routing on both networks. In contrast, eth4 and eth5 will be dedicated to carry the virtual server traffic.

We will not need IP addresses for interfaces eth4 and eth5. They will be brought up only to carry the bridged VM traffic. Make sure you identify the NIC ports properly and connect them to the proper network/VLAN Ethernet switch ports. To do that, you can remove their network cables and use the ethtool command to blink the NIC lights on the server side by doing a:
ethtool -p eth4

and
ethtool -p eth5 

to respectively identify the proper NIC ports. The next step is to connect them to the proper switch ports. In principle, once you identify the NIC port side with ethtool you should be OK. In practice, it is easy to make mistakes in messy/unlabelled network panels. Thus, after connecting the cables to the switch ports, one easy check is to bring the interface to promiscuous mode and watch for traffic indicating you are indeed on the right network/VLAN, by doing things like:
tcpdump -i eth4

and amongst the rest of the traffic, you would get something like the ARP or UDP broadcasts below confirming that eth4 is indeed on the internal network (Figures 2 and 3):

tcpdump: WARNING: eth4: no IPv4 address assigned
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth4, link-type EN10MB (Ethernet), capture size 65535 bytes
16:51:47.089529 ARP, Request who-has intfn1.internal.net tell esxfarm.internal.net, length 46
16:51:47.407363 STP 802.1d, Config, Flags [none], bridge-id 8005.00:1e:14:e6:48:80.800a, length 43
16:51:49.936209 IP winsys01.internal.net.17500 > 255.255.255.255.17500: UDP, length 119
16:51:49.936588 IP winsys02.internal.net.17500 > 192.168.14.255.17500: UDP, length 119


Now that the cables are connected properly we can start configuring the Ethernet bridges. A bridge is just another interface and the best way to configure this on a RHEL 6 system is by getting your hands dirty. Go right under the /etc/sysconfig/network-scripts directory and use your favourite text editor (vim, nano, Emacs) to make two files, one for each bridge interface device

ifcfg-br3 with the following contents:
DEVICE=br3
BOOTPROTO=none
TYPE=Bridge
ONBOOT=yes
DELAY=0


ifcfg-br4 with the following contents:
DEVICE=br4
BOOTPROTO=none
TYPE=Bridge
ONBOOT=yes
DELAY=0



This takes care of the bridge interface declaration. What's left is to associate the newly defined bridges with the right physical interface. Thus, under the same directory (/etc/sysconfig/network-scripts), we create two more files:

ifcfg-eth4 with the following contents:
DEVICE=eth4
HWADDR=00:10:18:31:5A:5B
NM_CONTROLLED=no
ONBOOT=yes
BRIDGE=br3


ifcfg-eth5 with the following contents:
DEVICE=eth5
HWADDR=00:10:18:19:4F:5C
NM_CONTROLLED=no
ONBOOT=yes
BRIDGE=br4


In short, with these four files we ensure that we have a persistent config where all interfaces (bridges and physical ones) are up on boot and we associate br3 to eth4 and br4 to eth5 (Figure 3). Fans of the brctl utility could also achieve the same result by doing a:

brctl addbr br3
brctl addif br3 eth4
brctl addbr br4
brctl addif br4 eth5


At that point, it is good to issue a:

service network stop; service network start

and check that the bridges and physical interfaces are up and available by issuing an ifconfig command. If all is well, you should see output like the one below (I have excluded some of the non relevant output for length reduction purposes):

br3       Link encap:Ethernet  HWaddr 00:10:18:31:5A:5B 
          inet6 addr: fe80::210:18ff:fe31:5a4b/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:386265 errors:0 dropped:0 overruns:0 frame:0
          TX packets:7 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:46672357 (44.5 MiB)  TX bytes:578 (578.0 b)

br4       Link encap:Ethernet  HWaddr
00:10:18:19:4F:5C         

          inet6 addr: fe80::210:18ff:fe19:4f33/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:616409 errors:0 dropped:0 overruns:0 frame:0
          TX packets:7 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:58946648 (56.2 MiB)  TX bytes:578 (578.0 b)
...

eth4      Link encap:Ethernet  HWaddr
00:10:18:31:5A:5B 
          inet6 addr: fe80::210:18ff:fe31:5a4b/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:600933 errors:0 dropped:0 overruns:0 frame:0
          TX packets:128158 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:270119283 (257.6 MiB)  TX bytes:10497306 (10.0 MiB)
          Interrupt:16

eth5      Link encap:Ethernet  HWaddr
00:10:18:19:4F:5C 
          inet6 addr: fe80::210:18ff:fe19:4f33/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:708614 errors:0 dropped:0 overruns:0 frame:0
          TX packets:9547 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:96954226 (92.4 MiB)  TX bytes:986694 (963.5 KiB)
          Interrupt:16

...

Note that all relevant interfaces are up and do not have an IP address . The second thing you should note is that the each bridge interface has the same MAC address as the physical interface it is associated with.

If you have reached that point, you are almost done. What you need to do now is to build your virtual machines. I assume you are familiar with how to build VMs on virt-manager. If not, I have written a quick summary of the procedures. Alternatively, if you have already existing VMs, you could reconfigure their networking to use the bridge interfaces.

Figure 4

Figure 4 above illustrates the network config for Guest_02. Make sure that the 'Source device' is one the available vnet interfaces that connects to br3 and apply the changes. You can do the same for the rest of the virtual server VMs. When you are done, you can now check with the brctl utility the final configuration by doing a:

brctl show

and you should get output similar to the one below:

Figure 5

Note the interfaces column which should correctly list all the physical and vnet interfaces associated to each bridge.  When you fire up any of the virtual servers, you should be able to see it with its vnet's interface MAC address on the virtual network. Let's take Guest_02 as an example.  From our VM host server console, we type:

[root@vmserver ~]# ping win01
PING win01.internal.net (192.168.14.23) 56(84) bytes of data.
64 bytes from win01.internal.net (192.168.14.23): icmp_seq=1 ttl=128 time=2.13 ms
64 bytes from win01.internal.net (192.168.14.23): icmp_seq=2 ttl=128 time=0.518 ms
^C
--- win01.internal.net ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1360ms
rtt min/avg/max/mdev = 0.518/1.324/2.131/0.807 ms
[root@vmserver ~]# arp -a | grep win01
win01.internal.net (192.168.14.23) at 52:54:00:28:23:af [ether] on eth2


Note Guest_02's MAC address from Figure 4. That's the one replying and bridged into the internal network. This means that for all intents and purposes, Guest_02 is just another server on the internal network. Mission accomplished.

Happy KVM sponsored virtual server hosting!

Friday, August 3, 2012

The bioinformatics sysadmin craftmanship: An EMBOSS 6.5 production server install: Part 2: EMBOSS database access setup

EMBOSS Database configuration

Part 1 of this article series covered a basic installation of EMBOSS from sources. The configuration of EMBOSS databases merits a separate article Part as it requires some knowledge of the indexing process and the various mechanisms to download and index flat file databases. Correspondence from the EMBOSS mailing list shows that this is a topic that confuses users and admins frequently. Thus, we are going to take a detailed look at it.

Remote data access methods and the emboss.default file

If you would like a recap of what is a flatfile database and what EMBOSS can do for you in terms of accessing indexed flatfile databases, you might like to take a look at some of the lectures I have given on the subject (slides, video). EMBOSS is not the fastest and most efficient way to index your flatfile databases. You should look at something like MRS and similar systems to have a more efficient way to index and perform comprehensive queries on flatfile databases. In fact, EMBOSS can access MRS indexed databases and in my opinion, this is better than a pure EMBOSS index system in many perspectives (speed of indexing/quering the index, storage efficiency etc). Nevertheless, EMBOSS does its job and this section describes only the process of indexing flatfile databases by using exclusively EMBOSS utilities.

One thing you need to understand is that in order to have access to indexed flatfile databases, you do not always have to index them locally. The EMBOSS applications support a variety of remote data retrieval methods to many useful datasets. Amongst the most popular of them we have:
  • MRS methods (mrs, mrs3 and mrs4): These allow you to search an MRS based index on a local or remote server.

To understand how to engage/activate these different data access methods, you will need to become familiar with the 'emboss.default' file. Part 1 of this article mentioned that the EMBOSS installation directory was under: /usr/lsc/emboss . You will need to navigate to the following directory:

/usr/lsc/emboss/share/EMBOSS

When you install EMBOSS for the first time in your system, you will see amongst others two files:

  • The 'emboss.default.template' file: This is a sample configuration file which shows the EMBOSS admin how to define databases. We will explain more in the process, but you can use this file as a reference to see many examples of how to configure properly various types of EMBOSS databases.
  • The emboss.standard file: This file also contains valid EMBOSS database configuration entries. However, the database definitions are included by default in your current setup.
The idea is that you have some default entries in the emboss.standard file which are included in your database list. So, if on your shell you issue a:

showdb

you will immediately get the following list of database entries by default:

Display information on configured databases
# Name          Type     ID  Qry All Comment
# ============= ======== ==  === === =======
taxon           Taxonomy OK  OK  OK  -
drcat           Resource OK  OK  OK  -
chebi           Obo      OK  OK  OK  -
eco             Obo      OK  OK  OK  -
edam            Obo      OK  OK  OK  -
edam_data       Obo      OK  OK  OK  -
edam_format     Obo      OK  OK  OK  -
edam_identifier Obo      OK  OK  OK  -
edam_operation  Obo      OK  OK  OK  -
edam_topic      Obo      OK  OK  OK  -
go              Obo      OK  OK  OK  -
go_component    Obo      OK  OK  OK  -
go_function     Obo      OK  OK  OK  -
go_process      Obo      OK  OK  OK  -
pw              Obo      OK  OK  OK  -
ro              Obo      OK  OK  OK  -
so              Obo      OK  OK  OK  -
swo             Obo      OK  OK  OK  -



If you wish to define any additional databases beyond this default list, you should create an emboss.default file, using the file 'emboss.default.template' as your reference (we are going to explain how shortly).

For now let's focus on these default databases defined by the emboss.standard file. They are a good example of how the new EMBOSS 6.5 enables remote data access from a variety of global public servers out of the box (I assume your Internet connection is working, right?). Let's use the EDAM ontology to retrieve data about an identifier. To do that I choose the ontotext EMBOSS application and I type:
ontotext edam_data:0849

The resulting file (0849.ontotext) contains the info which is retrieved from available servers. Let's take a look at the emboss.standard file to see how the edam_data database is defined:

DB edam_data [
  type:   "obo"
  format: "obo"
  method: "emboss"
  dbalias: "edam"
  namespace: "data|identifier"
  indexdirectory: "$emboss_standard/index"
  directory:      "$emboss_standard/data"
  field: "id ! identifier without the prefix"
  field: "acc ! full name and any alternate identifier(s)"
  field: "nam ! words in the name"
  field: "isa ! parent identifier from is_a relation(s)"
  field: "des ! words in the description"
  field: "ns ! namespace"
  field: "hasattr ! identifier(s) from has_attribute relation(s)"
  field: "hasin ! identifier(s) from has_input relation(s)"
  field: "hasout ! identifier(s) from has_output relation(s)"
  field: "isid ! identifier(s) from is_identifier_of relation(s)"
  field: "isfmt ! identifier(s) from is_format_of relation(s)"
  field: "issrc ! identifier(s) from is_source_of relation(s)"
]

...

RES edamresource [
  type: "Index"
  fields: "id acc nam isa des ns hasattr hasin hasout
           isid isfmt issrc"
  acclen: "80"
  namlen: "32"
  deslen: "30"
  accpagesize: "8192"
  despagesize: "4096"
]

 

In general, an EMBOSS database definition has two main parts:
  • The DB definition part: It defines the name, type, format, access method and various fields of the database record.
  • The RES (resource definition) part: Where the length of the various record fields is defined in the index. (note that RES definitions are normally found towards the end of the file). 
The DB and RES fields go together for each database definition. In addition,  for remote data access methods, a SERVER definition might be necessary to necessitate access to remote information repositories.
Step 9:The 'emboss.default' file does not yet exist,so create it under the directory  where the emboss.default.template. From now on, you will be editing the emboss.default file to define all aspects of the EMBOSS database configuration. Start with a minimal file like the one below:

#############################################
# EMBOSS environment variables
#############################################

SET emboss_tempdata /usr/lsc/emboss/share/EMBOSS/test

DB martensembl [
    method: "biomart"
    type: "P"
    url: "http://www.biomart.org:80/biomart/martservice"
    dbalias: "hsapiens_gene_ensembl"
    format: "biomart"
    filter: "chromosome_name=13"
    sequence: "peptide"
    return: "ensembl_gene_id,description,external_gene_id,chromosome_name"
]

 

Show here, we have defined the database 'martensembl' which could retrieve remotely entries from the Homo Sapiens Ensembl gene repository. Save the file and go back to your shell. You can repeat the 'showdb' command and verify that you can see the newly defined martensembl database. Now, test it by typing:


seqret martensembl:ENST00000380152

The resulting fasta file should contain the info you require and this was all the way from the remote Biomart server. Congratulations, you just setup your first remote database access in EMBOSS!

Browsing remote access repositories is a good idea and the EMBOSS team was right to enable the functionality in EMBOSS. However, accessing remote datasets does not always work very well if:

  • You go into a place where Internet availability is sketchy or of limited bandwidth capacity.
  • The datasets you need to access involve millions of sequences or Gigabytes of information.
In these case, your only reliable option is to setup a database locally and make a flatfile database index. This is explained in the next section.



How to define a local flatfile database index 

What was said in the previous section about the main parts of an EMBOSS  database definition in the emboss.standard file can also be applied to the emboss.default file. Let's provide an example and give you an example of how you can format the latest Uniprot/sprot database, in three steps:


  • Step A: Download and uncompress the latest file into your flatfile index area, a directory where you should have plenty of space to hold your flatfiles and the produce indices of your datasets. The file lies here (EBI FTP server). On the command line, you could do a: 
wget ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.dat.gz
 followed by a;
gunzip uniprot_sprot.dat.gz

  • Step B: Update the 'emboss.default' file by adding a database definition, as well as a resource definition, as shown below:
SET emboss_database_dir /storage/tools/embossdbs
SET emboss_db_dir /storage/tools/embossdbs
DB sprot [
        type: P
        method: emboss
        release: "57.1"
        format: swiss
        fields: "id acc sv des key org"
        directory: $emboss_db_dir/uniprotsprotfiles
        file: *.dat
        indexdirectory: $emboss_db_dir/uniprotsprotfiles
        comment: "UniProtKB/Swiss-Prot Latest Release "
]

RES sprot [
   type: Index
   idlen:  15
   acclen: 15
   svlen:  20
   keylen: 85
   deslen: 75
   orglen: 75
]

The first two lines are optional and provide an alias for the directory locations where you have uncompressed the flatfile and you are going to produce the index. After that you have the database (DB sprot) definition. It is a protein sequence database (type: P). The fields specification is important. It lists all the indices that are going to be produced. So, we know that we will be able to search the database by sprot IDs (id), accession number (acc), sequence version (sv), descriptive text from the sequence header (des), keyword (key) and taxonomy info (org). 

Each of these index fields has a defined length as part of the associated RES (resource definition) entry.  Note that it is important to define both the DB and the RES blocks. If you do not and for example you forget to define the RES record, the EMBOSS applications will complain until you resolve the issue with an error message similar to this one:
EMBOSS An error in ajnam.c at line 9126:
unknown resource 'sprot'

For now, save the file and do a showdb to verify that you can see the 'sprot' database. If you have omitted or misconfigured any important parts of the definition, the command should complain with informative errors.

  • Step C:  Produce the index. Go to the directory where you have your uncompressed flatfile (.dat)  (in my case this is under /storage/tools/embossdbs/uniprotsprotfiles) and type the following emboss command:     dbxflat -outfile uniprotsprotout -directory /storage/tools/embossdbs/uniprotsprotfiles -idformat SWISS -filenames '*.dat' -fields id,acc,sv,des,key,org -compressed N -dbname sprot -dbresource sprot -release 2012_07 -date 03/08/12    
         You will need to wait a bit, as the system takes its time to crunch the index.

If all goes well, you should see the following index files in your directory where your flatfile lies:

-rw-r--r--. 1 root root  103 2012-08-03 19:28 sprot.ent
-rw-r--r--. 1 root root  299 2012-08-03 19:36 sprot.pxac
-rw-r--r--. 1 root root  301 2012-08-03 19:36 sprot.pxde
-rw-r--r--. 1 root root  295 2012-08-03 19:36 sprot.pxid
-rw-r--r--. 1 root root  297 2012-08-03 19:36 sprot.pxkw
-rw-r--r--. 1 root root  295 2012-08-03 19:36 sprot.pxsv
-rw-r--r--. 1 root root  299 2012-08-03 19:36 sprot.pxtx
-rw-r--r--. 1 root root  63M 2012-08-03 19:36 sprot.xac
-rw-r--r--. 1 root root 259M 2012-08-03 19:36 sprot.xde
-rw-r--r--. 1 root root  40M 2012-08-03 19:36 sprot.xid
-rw-r--r--. 1 root root 161M 2012-08-03 19:36 sprot.xkw
-rw-r--r--. 1 root root  38M 2012-08-03 19:36 sprot.xsv
-rw-r--r--. 1 root root 264M 2012-08-03 19:36 sprot.xtx
-rw-r--r--. 1 root root 2,5G 2012-08-03 19:26 uniprot_sprot.dat
-rw-r--r--. 1 root root  758 2012-08-03 19:36 uniprotsprotout


and you should be able to test your new database. For instance, to obtain all sequences that have the word influenza in the description index from your current sprot release, you could type:

seqret sprot-des:influenza

The same procedure could be used for nucleotide databases (type: N). Remember, you have the emboss.default.template as your guide. I hope you have a better understanding of how you can setup local databases in EMBOSS now.

Tuesday, July 31, 2012

The bioinformatics sysadmin craftmanship: An EMBOSS 6.5 production server install: Part 1: Installing from sources



Every 15th of July, the EMBOSS team at EBI releases a fresh version of the European Molecular Biology Open Software Suite (EMBOSS). Started and shaped by the EMBnet community, EMBOSS is one of the most versatile systems to perform sequence analysis and a variety of bioinformatics pipeline tasks, as it copes with a variety of file formats and contains a plethora of applications. 

Most of the procedures outlined here are described in more detail by the 'EMBOSS User's Guide: Practical Bioinformatics' book, written by the EMBOSS authoring team. While this is an excellent publication, books quickly get out of date as software evolves. In addition, the on-line EMBOSS administration documentation is out of date. As a result, I felt that this two part article series (Part 2 covers the task of enabling data access in EMBOSS (including local flatfile database setup) will be a quick startup guide for those that have to administer EMBOSS installations.
 
This year the version clock has turned into 6.5. In this Part, I shall be going through an installation from the sources on a production Linux server, covering all aspects of the system configuration, including the formatting of databases.  There might be binary/prebuilt packages available for your Linux distribution. However, I always maintain the principle of building the latest binaries from the sources. This gives you the latest and the greatest with a little bit of extra effort.

Most of the steps below can be automated with simple scripts. However, the process of going through a manual installation of EMBOSS should make you aware of the different system components. Once you have an understanding of the system, it is then wise to automate/script these steps.  

What kind of hardware you will need 

EMBOSS is a  fairly modest system to install in terms of hardware requirements. The only thing that can draw the hardware envelope is how much data you would like to index. If your server should host/index the entire EMBL/Genbank databases, you will need plenty of disk space (I advise you to have at least 3-4 Tbytes to spare, yes you read right). 

Memory and CPU wise, 8 cores with 32-64 Gigs of RAM should be enough to keep most user loads happy (30-40 users) on a production server setup. What you do draws the map for the hardware requirements. If you are trying to do a global alignment of large sequences, you might easily eat up 64 Gigs of RAM. In contrast, basic sequence processing could also be performed on a dual core Laptop with 4 Gigs of RAM. By and large, the figures I suggest here should meet most requirements. If you have the task of specing an EMBOSS server, your best bet to get it right is to talk to your scientists and ask for what sort of operations they would be performing, to get an accurate picture of the hardware specs.     


The downloading of the sources

Prior starting, I ensure that my Linux system has most of the development libraries installed. Some EMBOSS applications can be sensitive to missing libraries like libpng, libjpeg, etc. You will also need to ensure that you have your C/C++ compilers installed (gcc/g++).

EMBOSS is a large system. Apart from the core EMBOSS packages, there is an entire array of third party applications that are bundled together with the EMBOSS core applications (some examples: PHYLIP, MEME, IPRSCAN). These are the EMBASSY tools. This is a detail for most users, who collectively refer to the entire package as EMBOSS. However, when you go to download the source EMBOSS tarball, it does not contain these additional packages. This means that if you want to have the full array of EMBOSS/EMBASSY applications, you will have to go through the following steps:


1)Go to the main EMBOSS FTP download server and I download the latest EMBOSS tarball (normally named emboss-latest.tar.gz). In my case, it points to the EMBOSS-6.5.7. 

2)After downloading this to my source dir, I unpack it by doing a:

tar -xvfz EMBOSS-6.5.7.tar.gz


3)I then cd to the EMBOSS-6.5.7 dir and at the top level of the sources, I do a:
mkdir embassy


4)Under the newly created embassy directory, I then download the tarballs of the EMBASSY packages (version info will vary, but the base name of each package should be more or less the same): CBSTOOLS, CLUSTALOMEGA, DOMAINATRIX, DOMALIGN, DOMSEARCH, EMNU, ESIM4, HMMER, IPRSCAN, MEME, MSE, PHYLIPNEW, SIGNATURE, STRUCTURE, TOPO, VIENNA .
I unpack each of the tarballs with the same command as step 2 under the embassy subdirectory. Once I am done, I can delete the remaining *.tar.gz files.


5)At this point, it might be wise to create a tarball with all the sources properly laid out under the embassy subdirectory by going above the EMBOSS-6.5.7 directory and doing a:
tar -cvf embossembassy65.tar EMBOSS-6.5.7/

This will create the file embossembassy65.tar. This is handy in case you wish to erase the whole source tree and start from scratch and/or repeating the installation on other systems by not having to go through the steps 1-4 again to assemble the source tree.


Configure and compile

We are now ready to start configuring the various packages and eventually compiling them into the EMBOSS/EMBASSY binary applications we shall be using. In my system, I choose that the directory holding the binaries and the produced libraries should be under:

/usr/lsc/emboss

You are free to choose what you wish on your system. 

6)Thus, I cd into the top level of the EMBOSS-6.5.7 directory and I issue a:
./configure --prefix=/usr/lsc/emboss; make; make install

In one sentence, this says to the config process where to place the produced files and instructs the system to compile and place the produced applications under that location.  Grub a cup of tea/coffee/beer as this will take some time. If it all goes well, and you see no errors in the terminal output, you should see the first installed binary applications under the /usr/lsc/emboss/bin directory. In my base, I verify that I have functioning applications by executing embossversion:
./embossversion
Report the current EMBOSS version number
6.5.7.0

 

This means that I am on good ground and can continue with the installation of the rest of the applications. 

One detail new to the process of installing EMBOSS as of version 6.5.x is the automatic kick in of the embossupdate application, which you note in the final output lines of a successful step 6 operation:
...
make[3]: Entering directory `/usr/lsc/sources/EMBOSS-6.5.7'
/usr/lsc/emboss/bin/embossupdate
Checks for more recent updates to EMBOSS
EMBOSS 6.5.7.0 is the latest version (6.5.0.0)

 

Basically, the EMBOSS install process will check for patches and updates to the source code, a process performed manually by EMBOSS admins before. This is a very welcome addition and eases the process of receiving up-to-date code, in order to address bug fixes and enhancements.

If you do not get to the point where you see the emboss applications and you see errors as part of the make process, the most likely scenario is that you are missing some development library or tool. You can get help by posting a request for help to the EMBOSS mailing list


What you need to do now is to repeat step 6 for every subdirectory under the embassy directory and watch gradually the new applications being added to the bin folder.




Post installation configuration

You should have installed by now all the applications of core EMBOSS and EMBASSY packages from source.  After this process, you should start configuring your system so you can make the applications available.


7)Make sure that the emboss bin folder is in a system wide path, to ensure that all users can reference the applications. For my systems, all the freshly compiled applications reside under the /usr/lsc/emboss/bin folder. Hence, this is the folder I enter into the system wide PATH. in my server /etc/profile.d/bash_login.sh, there is a line that contains the following:  
export PATH=$PATH:/usr/lsc/emboss/bin 


8)Make sure you install all the application dependencies for the EMBOSS/EMBASSY applications you are going to use . There is a number of EMBOSS/EMBASSY applications that are wrappers around third party packages. This means that the EMBOSS/EMBASSY application will not function, unless you install its required dependencies. This is normally simple. I am not going to mention all the dependencies now, but  a few examples from my userbase are the following:
-emma which requires the installation of the Clustalw tool. 
-eiprscan which requires the installation of the iprscan tool. 
-ememe which requires the installation of the meme tool. 

Each of these installations might involve an entire set of separate procedures and instructions, but you get the picture.

Part 2 of this article will examine how to configure the EMBOSS databases.