Thoughts of a technocrat as letter sequences: 2012

Tuesday, December 25, 2012

KVM hosted virtual servers using bridging: theory and practice

If you are a systems or networks administrator that:

works in enterprise data centers or
someone that wants to deploy virtual servers in a newly acquired multi-core server using RHEL 6 and nothing more than the Linux KVM and RedHat's basic virt-manager application and/or
you wish to gain an understanding of KVM's virtual networking architecture

then this article/technical walkthrough is for you. Most of these techniques will work on other Linux distributions besides RHEL 6. Admittedly, there are more user friendly, free and commercial tools that allow you to deploy virtual machines. The usual suspects include VMware, RedHat, Oracle, Parallels that provide industrial strength solutions with intuitive point-and-click interfaces that make the setup of virtual machines an easy task.

However, I like to keep my production server software stack as simple as possible. Those of you that had to troubleshoot VM performance or other problems and faced the 'ping-pong' between the virtualization and the OS vendors will know what I mean. Thus, I use KVM/qemu and virt-manager to cater for my VM needs. The downside is that these tools are less intuitive to use for the newcomer, but with a little bit of good documentation and practice, they can be effective. I draw this conclusion after looking around in various technical support threads and after browsing RedHat's documentation on the subject. The threads seem to confuse the various virtual switching modes and techniques when things could be done more easily with interface bridging. The same can be said for Redhat's Virtualization Administration Guide, which does a fairly good job detailing the Routed, NAT and isolated virtual networking modes (Chapter 18), however it fails to mention how bridging could be used for hosting virtual servers. I am going to spend the rest of the article to explain this in detail.

The Theory

Let's be more specific now and explain what I mean when I say I need to deploy a fully networked virtual server. When you use the virt-manager application, it's easy to deploy a network enabled guest OS by means of using Network Address Translation (NAT). In fact, NAT (IP Masquerading, a specific mode of NAT) is the default guest OS virtual networking mode, using the IP address of the physical host server.

Figure 1

The figure above displays the networking data path traversal from the VM guests, all the way to the physical network/VLAN, when using the default virtual networking mode (NAT). Starting at the bottom of the figure, each guest has been assigned to a virtual network interface (vnetx). This is essentially a software implementation of an interface which is part of a virtual switch. At the other end of the virtual switch, a virtual bridge interface (virbr0) merges the traffic from the VMs and interfaces to the IPTABLES module which performs the actual NAT. At the end, you have the eth0 physical interface which carries the packets to the actual wire.

In this scenario, your guest OS will have outbound network connectivity. Should you wish to enable inbound network connectivity, you will fail. It is possible to perform other tricks and enable port forwarding/SNAT/DNAT to enable inbound connections. However, this is cumbersome. As a result, my definition of deploying a proper virtual server resembles the following aspects of a true physical server:

You have a physical MAC address tied to a network/VLAN broadcast domain
You can deal with that MAC address in any way you would deal with a true physical NIC: ARP, assign a static IP, (static) DHCP, etc.
You can have unrestricted outbound and inbound network access within that network/VLAN broadcast domain, a must requirement for a server system.

In order to achieve this, we need to employ the technique of interface bridging. For references on bridges, you can consult a variety of sources such as:
i)The IEEE 802.1D standard
ii)The older (out of date but still useful) Ethernet Bridge + netfilter HOW TO from TDLP.
iii)A copy of A. S. Tanenbaum's Computer Networks classic textbook.
However, prior explaining how this works, let's throw in a realistic production environment scenario.

Figure 2

Figure 2 displays the network topology of a production VM server scenario. There are two networks. One Class C internal (192.168.14.24), where hosts may or may not have outbound connectivity. Inbound connectivity to this network is prohibited by the top server which offers FTP, DMZ, FIREWALL, DHCP, and DNS services on the INTERNAL net. The other network is a world routable Class B (129.230/16).

The VM host server needs to serve a number of virtual servers that have different network access criteria:

Guest_01: Linux server to run an LAMP stack, exposed on the internal network.
Guest_02: Development Windows 7 box, which needs to be accessible via non standard port ranges on the internal network, but also needs Internet access.
Guest_03: Legacy SCADA Windows XP based system which needs to be accessible only via the internal network.

Clearly, Guest_01 is the least restricted system, so it makes sense to place it on the INTERNET/EXTERNAL Class B net. Guest_02 needs some protection so the outside folks cannot reach it, only it should reach the outside world by means of IP Masquerading, by using the Public routable IP of the FTP/DMZ/FIREWALL/DHCP/DNS server (129.230.135.131). Thus, it's a candidate for the INTERNAL Class C net. The same goes for Guest_03, which is the most isolated environment we need to protect, accessible only by INTERNAL network hosts.

At this point, it is useful to modify Figure 1 to illustrate the virtual network data path of our new scenario.

Figure 3

Figure 3 above illustrates the virtual network data path of our production scenario (Figure 2). In this case, instead of the virbr0 we have bridging modules bound to physical interfaces. Each physical interface is connected to the proper network/VLAN and has a bridge bound to it (we will illustrate how this is done). The role of the bridge is to create a data channel and forward traffic between the vnetx interfaces of the virtual switch and the physical interfaces. The objective is to enable the MAC address of the Guest_X machines to connect to the actual physical network/VLAN, as stated earlier. As a result, via bridge br3, we enable the virtual servers Guest_02 and Guest_03 for the internal network and via br4, we connect Guest_01 to the external world.

The practice

The previous section presented the theory. It's time now for the hands-on practical part. First of all, if you are dealing with a fresh installation, make sure you yum install the following groups, in order to have the full range of virtualization utilities and install your guests.

yum groupinstall Virtualization "Virtualization Client" "Virtualization Platform" "Virtualization Tools"

You should also install the bridge utilities, as they are needed:

yum install bridge-utils

The next thing you should ensure is that you have enough physical network interfaces on your VM host server. In order to implement our production scenario, Figure 2 indicates clearly that we need four Ethernet NIC ports: Two of them (eth2, eth3) are used to enable the server to have IP connectivity and routing on both networks. In contrast, eth4 and eth5 will be dedicated to carry the virtual server traffic.

We will not need IP addresses for interfaces eth4 and eth5. They will be brought up only to carry the bridged VM traffic. Make sure you identify the NIC ports properly and connect them to the proper network/VLAN Ethernet switch ports. To do that, you can remove their network cables and use the ethtool command to blink the NIC lights on the server side by doing a:


ethtool -p eth4

and


ethtool -p eth5

to respectively identify the proper NIC ports. The next step is to connect them to the proper switch ports. In principle, once you identify the NIC port side with ethtool you should be OK. In practice, it is easy to make mistakes in messy/unlabelled network panels. Thus, after connecting the cables to the switch ports, one easy check is to bring the interface to promiscuous mode and watch for traffic indicating you are indeed on the right network/VLAN, by doing things like:


tcpdump -i eth4

and amongst the rest of the traffic, you would get something like the ARP or UDP broadcasts below confirming that eth4 is indeed on the internal network (Figures 2 and 3):


tcpdump: WARNING: eth4: no IPv4 address assigned
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth4, link-type EN10MB (Ethernet), capture size 65535 bytes
16:51:47.089529 ARP, Request who-has intfn1.internal.net tell esxfarm.internal.net, length 46
16:51:47.407363 STP 802.1d, Config, Flags [none], bridge-id 8005.00:1e:14:e6:48:80.800a, length 43
16:51:49.936209 IP winsys01.internal.net.17500 > 255.255.255.255.17500: UDP, length 119
16:51:49.936588 IP winsys02.internal.net.17500 > 192.168.14.255.17500: UDP, length 119

Now that the cables are connected properly we can start configuring the Ethernet bridges. A bridge is just another interface and the best way to configure this on a RHEL 6 system is by getting your hands dirty. Go right under the /etc/sysconfig/network-scripts directory and use your favourite text editor (vim, nano, Emacs) to make two files, one for each bridge interface device

ifcfg-br3 with the following contents:


DEVICE=br3
BOOTPROTO=none
TYPE=Bridge
ONBOOT=yes
DELAY=0

ifcfg-br4 with the following contents:


DEVICE=br4
BOOTPROTO=none
TYPE=Bridge
ONBOOT=yes
DELAY=0

This takes care of the bridge interface declaration. What's left is to associate the newly defined bridges with the right physical interface. Thus, under the same directory (/etc/sysconfig/network-scripts), we create two more files:

ifcfg-eth4 with the following contents:


DEVICE=eth4
HWADDR=00:10:18:31:5A:5B
NM_CONTROLLED=no
ONBOOT=yes
BRIDGE=br3

ifcfg-eth5 with the following contents:


DEVICE=eth5
HWADDR=00:10:18:19:4F:5C
NM_CONTROLLED=no
ONBOOT=yes
BRIDGE=br4

In short, with these four files we ensure that we have a persistent config where all interfaces (bridges and physical ones) are up on boot and we associate br3 to eth4 and br4 to eth5 (Figure 3). Fans of the brctl utility could also achieve the same result by doing a:


brctl addbr br3

brctl addif br3 eth4

brctl addbr br4

brctl addif br4 eth5

At that point, it is good to issue a:


service network stop; service network start

and check that the bridges and physical interfaces are up and available by issuing an ifconfig command. If all is well, you should see output like the one below (I have excluded some of the non relevant output for length reduction purposes):


br3       Link encap:Ethernet  HWaddr

00:10:18:31:5A:5B inet6 addr: fe80::210:18ff:fe31:5a4b/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:386265 errors:0 dropped:0 overruns:0 frame:0 TX packets:7 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:46672357 (44.5 MiB) TX bytes:578 (578.0 b) br4 Link encap:Ethernet HWaddr00:10:18:19:4F:5C
inet6 addr: fe80::210:18ff:fe19:4f33/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:616409 errors:0 dropped:0 overruns:0 frame:0 TX packets:7 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:58946648 (56.2 MiB) TX bytes:578 (578.0 b) ... eth4 Link encap:Ethernet HWaddr00:10:18:31:5A:5B inet6 addr: fe80::210:18ff:fe31:5a4b/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:600933 errors:0 dropped:0 overruns:0 frame:0 TX packets:128158 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:270119283 (257.6 MiB) TX bytes:10497306 (10.0 MiB) Interrupt:16 eth5 Link encap:Ethernet HWaddr

00:10:18:19:4F:5C  
          inet6 addr: fe80::210:18ff:fe19:4f33/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:708614 errors:0 dropped:0 overruns:0 frame:0
          TX packets:9547 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:96954226 (92.4 MiB)  TX bytes:986694 (963.5 KiB)
          Interrupt:16 

...

Note that all relevant interfaces are up and do not have an IP address . The second thing you should note is that the each bridge interface has the same MAC address as the physical interface it is associated with.

If you have reached that point, you are almost done. What you need to do now is to build your virtual machines. I assume you are familiar with how to build VMs on virt-manager. If not, I have written a quick summary of the procedures. Alternatively, if you have already existing VMs, you could reconfigure their networking to use the bridge interfaces.

Figure 4

Figure 4 above illustrates the network config for Guest_02. Make sure that the 'Source device' is one the available vnet interfaces that connects to br3 and apply the changes. You can do the same for the rest of the virtual server VMs. When you are done, you can now check with the brctl utility the final configuration by doing a:


brctl show

and you should get output similar to the one below:

Figure 5

Note the interfaces column which should correctly list all the physical and vnet interfaces associated to each bridge. When you fire up any of the virtual servers, you should be able to see it with its vnet's interface MAC address on the virtual network. Let's take Guest_02 as an example. From our VM host server console, we type:


[root@vmserver ~]# ping win01 

PING win01.internal.net (192.168.14.23) 56(84) bytes of data.
64 bytes from win01.internal.net (192.168.14.23): icmp_seq=1 ttl=128 time=2.13 ms
64 bytes from win01.internal.net (192.168.14.23): icmp_seq=2 ttl=128 time=0.518 ms
^C
--- win01.internal.net ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1360ms
rtt min/avg/max/mdev = 0.518/1.324/2.131/0.807 ms
[root@vmserver ~]# arp -a | grep win01

win01.internal.net (192.168.14.23) at 52:54:00:28:23:af [ether] on eth2

Note Guest_02's MAC address from Figure 4. That's the one replying and bridged into the internal network. This means that for all intents and purposes, Guest_02 is just another server on the internal network. Mission accomplished.

Happy KVM sponsored virtual server hosting!

Friday, August 3, 2012

The bioinformatics sysadmin craftmanship: An EMBOSS 6.5 production server install: Part 2: EMBOSS database access setup

EMBOSS Database configuration

Part 1 of this article series covered a basic installation of EMBOSS from sources. The configuration of EMBOSS databases merits a separate article Part as it requires some knowledge of the indexing process and the various mechanisms to download and index flat file databases. Correspondence from the EMBOSS mailing list shows that this is a topic that confuses users and admins frequently. Thus, we are going to take a detailed look at it.

Remote data access methods and the emboss.default file

If you would like a recap of what is a flatfile database and what EMBOSS can do for you in terms of accessing indexed flatfile databases, you might like to take a look at some of the lectures I have given on the subject (slides, video). EMBOSS is not the fastest and most efficient way to index your flatfile databases. You should look at something like MRS and similar systems to have a more efficient way to index and perform comprehensive queries on flatfile databases. In fact, EMBOSS can access MRS indexed databases and in my opinion, this is better than a pure EMBOSS index system in many perspectives (speed of indexing/quering the index, storage efficiency etc). Nevertheless, EMBOSS does its job and this section describes only the process of indexing flatfile databases by using exclusively EMBOSS utilities.

One thing you need to understand is that in order to have access to indexed flatfile databases, you do not always have to index them locally. The EMBOSS applications support a variety of remote data retrieval methods to many useful datasets. Amongst the most popular of them we have:

MRS methods (mrs, mrs3 and mrs4): These allow you to search an MRS based index on a local or remote server.

DBFETCH method (dbfetch): Supported by servers at EBI.

WSDBFETCH method (wsdbfetch): A SOAP based EBI service similar to the DBFETCH method.

BIOMART method (biomart): Using the Biomart service.

To understand how to engage/activate these different data access methods, you will need to become familiar with the 'emboss.default' file. Part 1 of this article mentioned that the EMBOSS installation directory was under: /usr/lsc/emboss . You will need to navigate to the following directory:

/usr/lsc/emboss/share/EMBOSS

When you install EMBOSS for the first time in your system, you will see amongst others two files:

The 'emboss.default.template' file: This is a sample configuration file which shows the EMBOSS admin how to define databases. We will explain more in the process, but you can use this file as a reference to see many examples of how to configure properly various types of EMBOSS databases.
The emboss.standard file: This file also contains valid EMBOSS database configuration entries. However, the database definitions are included by default in your current setup.

The idea is that you have some default entries in the emboss.standard file which are included in your database list. So, if on your shell you issue a:

showdb

you will immediately get the following list of database entries by default:

Display information on configured databases
# Name          Type     ID Qry All Comment
# ============= ======== == === === =======
taxon           Taxonomy OK OK OK -
drcat           Resource OK OK OK -
chebi           Obo      OK OK OK -
eco             Obo      OK OK OK -
edam            Obo      OK OK OK -
edam_data       Obo      OK OK OK -
edam_format     Obo      OK OK OK -
edam_identifier Obo      OK OK OK -
edam_operation Obo      OK OK OK -
edam_topic      Obo      OK OK OK -
go              Obo      OK OK OK -
go_component    Obo      OK OK OK -
go_function     Obo      OK OK OK -
go_process      Obo      OK OK OK -
pw              Obo      OK OK OK -
ro              Obo      OK OK OK -
so              Obo      OK OK OK -
swo             Obo      OK OK OK -

If you wish to define any additional databases beyond this default list, you should create an emboss.default file, using the file 'emboss.default.template' as your reference (we are going to explain how shortly).

For now let's focus on these default databases defined by the emboss.standard file. They are a good example of how the new EMBOSS 6.5 enables remote data access from a variety of global public servers out of the box (I assume your Internet connection is working, right?). Let's use the EDAM ontology to retrieve data about an identifier. To do that I choose the ontotext EMBOSS application and I type:

ontotext edam_data:0849

The resulting file (0849.ontotext) contains the info which is retrieved from available servers. Let's take a look at the emboss.standard file to see how the edam_data database is defined:

DB edam_data [
type: "obo"
format: "obo"
method: "emboss"
dbalias: "edam"
namespace: "data|identifier"
indexdirectory: "$emboss_standard/index"
directory: "$emboss_standard/data"
field: "id ! identifier without the prefix"
field: "acc ! full name and any alternate identifier(s)"
field: "nam ! words in the name"
field: "isa ! parent identifier from is_a relation(s)"
field: "des ! words in the description"
field: "ns ! namespace"
field: "hasattr ! identifier(s) from has_attribute relation(s)"
field: "hasin ! identifier(s) from has_input relation(s)"
field: "hasout ! identifier(s) from has_output relation(s)"
field: "isid ! identifier(s) from is_identifier_of relation(s)"
field: "isfmt ! identifier(s) from is_format_of relation(s)"
field: "issrc ! identifier(s) from is_source_of relation(s)"
]
...

RES edamresource [
type: "Index"
fields: "id acc nam isa des ns hasattr hasin hasout
isid isfmt issrc"
acclen: "80"
namlen: "32"
deslen: "30"
accpagesize: "8192"
despagesize: "4096"
]

In general, an EMBOSS database definition has two main parts:

The DB definition part: It defines the name, type, format, access method and various fields of the database record.
The RES (resource definition) part: Where the length of the various record fields is defined in the index. (note that RES definitions are normally found towards the end of the file).

The DB and RES fields go together for each database definition. In addition, for remote data access methods, a SERVER definition might be necessary to necessitate access to remote information repositories.

Step 9:The 'emboss.default' file does not yet exist,so create it under the directory where the emboss.default.template. From now on, you will be editing the emboss.default file to define all aspects of the EMBOSS database configuration. Start with a minimal file like the one below:

#############################################
# EMBOSS environment variables
#############################################

SET emboss_tempdata /usr/lsc/emboss/share/EMBOSS/test

DB martensembl [
    method: "biomart"
    type: "P"
    url: "http://www.biomart.org:80/biomart/martservice"
    dbalias: "hsapiens_gene_ensembl"
    format: "biomart"
    filter: "chromosome_name=13"
    sequence: "peptide"
    return: "ensembl_gene_id,description,external_gene_id,chromosome_name"
]

Show here, we have defined the database 'martensembl' which could retrieve remotely entries from the Homo Sapiens Ensembl gene repository. Save the file and go back to your shell. You can repeat the 'showdb' command and verify that you can see the newly defined martensembl database. Now, test it by typing:

seqret martensembl:ENST00000380152

The resulting fasta file should contain the info you require and this was all the way from the remote Biomart server. Congratulations, you just setup your first remote database access in EMBOSS!

Browsing remote access repositories is a good idea and the EMBOSS team was right to enable the functionality in EMBOSS. However, accessing remote datasets does not always work very well if:

You go into a place where Internet availability is sketchy or of limited bandwidth capacity.
The datasets you need to access involve millions of sequences or Gigabytes of information.

In these case, your only reliable option is to setup a database locally and make a flatfile database index. This is explained in the next section.

How to define a local flatfile database index

What was said in the previous section about the main parts of an EMBOSS database definition in the emboss.standard file can also be applied to the emboss.default file. Let's provide an example and give you an example of how you can format the latest Uniprot/sprot database, in three steps:

Step A: Download and uncompress the latest file into your flatfile index area, a directory where you should have plenty of space to hold your flatfiles and the produce indices of your datasets. The file lies here (EBI FTP server). On the command line, you could do a:

wget ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.dat.gz

followed by a;

gunzip uniprot_sprot.dat.gz

Step B: Update the 'emboss.default' file by adding a database definition, as well as a resource definition, as shown below:

SET emboss_database_dir /storage/tools/embossdbs
SET emboss_db_dir /storage/tools/embossdbs

DB sprot [
        type: P
        method: emboss
        release: "57.1"
        format: swiss
        fields: "id acc sv des key org"
        directory: $emboss_db_dir/uniprotsprotfiles
        file: *.dat
        indexdirectory: $emboss_db_dir/uniprotsprotfiles
        comment: "UniProtKB/Swiss-Prot Latest Release "
]

RES sprot [
   type: Index
   idlen: 15
   acclen: 15
   svlen: 20
   keylen: 85
   deslen: 75
   orglen: 75
]

The first two lines are optional and provide an alias for the directory locations where you have uncompressed the flatfile and you are going to produce the index. After that you have the database (DB sprot) definition. It is a protein sequence database (type: P). The fields specification is important. It lists all the indices that are going to be produced. So, we know that we will be able to search the database by sprot IDs (id), accession number (acc), sequence version (sv), descriptive text from the sequence header (des), keyword (key) and taxonomy info (org).

Each of these index fields has a defined length as part of the associated RES (resource definition) entry. Note that it is important to define both the DB and the RES blocks. If you do not and for example you forget to define the RES record, the EMBOSS applications will complain until you resolve the issue with an error message similar to this one:

EMBOSS An error in ajnam.c at line 9126:
unknown resource 'sprot'

For now, save the file and do a showdb to verify that you can see the 'sprot' database. If you have omitted or misconfigured any important parts of the definition, the command should complain with informative errors.

Step C: Produce the index. Go to the directory where you have your uncompressed flatfile (.dat) (in my case this is under /storage/tools/embossdbs/uniprotsprotfiles) and type the following emboss command: dbxflat -outfile uniprotsprotout -directory /storage/tools/embossdbs/uniprotsprotfiles -idformat SWISS -filenames '*.dat' -fields id,acc,sv,des,key,org -compressed N -dbname sprot -dbresource sprot -release 2012_07 -date 03/08/12

You will need to wait a bit, as the system takes its time to crunch the index.

If all goes well, you should see the following index files in your directory where your flatfile lies:

-rw-r--r--. 1 root root 103 2012-08-03 19:28 sprot.ent
-rw-r--r--. 1 root root 299 2012-08-03 19:36 sprot.pxac
-rw-r--r--. 1 root root 301 2012-08-03 19:36 sprot.pxde
-rw-r--r--. 1 root root 295 2012-08-03 19:36 sprot.pxid
-rw-r--r--. 1 root root 297 2012-08-03 19:36 sprot.pxkw
-rw-r--r--. 1 root root 295 2012-08-03 19:36 sprot.pxsv
-rw-r--r--. 1 root root 299 2012-08-03 19:36 sprot.pxtx
-rw-r--r--. 1 root root 63M 2012-08-03 19:36 sprot.xac
-rw-r--r--. 1 root root 259M 2012-08-03 19:36 sprot.xde
-rw-r--r--. 1 root root 40M 2012-08-03 19:36 sprot.xid
-rw-r--r--. 1 root root 161M 2012-08-03 19:36 sprot.xkw
-rw-r--r--. 1 root root 38M 2012-08-03 19:36 sprot.xsv
-rw-r--r--. 1 root root 264M 2012-08-03 19:36 sprot.xtx
-rw-r--r--. 1 root root 2,5G 2012-08-03 19:26 uniprot_sprot.dat
-rw-r--r--. 1 root root 758 2012-08-03 19:36 uniprotsprotout

and you should be able to test your new database. For instance, to obtain all sequences that have the word influenza in the description index from your current sprot release, you could type:

seqret sprot-des:influenza

The same procedure could be used for nucleotide databases (type: N). Remember, you have the emboss.default.template as your guide. I hope you have a better understanding of how you can setup local databases in EMBOSS now.

Tuesday, July 31, 2012

The bioinformatics sysadmin craftmanship: An EMBOSS 6.5 production server install: Part 1: Installing from sources

Every 15th of July, the EMBOSS team at EBI releases a fresh version of the European Molecular Biology Open Software Suite (EMBOSS). Started and shaped by the EMBnet community, EMBOSS is one of the most versatile systems to perform sequence analysis and a variety of bioinformatics pipeline tasks, as it copes with a variety of file formats and contains a plethora of applications.

Most of the procedures outlined here are described in more detail by the 'EMBOSS User's Guide: Practical Bioinformatics' book, written by the EMBOSS authoring team. While this is an excellent publication, books quickly get out of date as software evolves. In addition, the on-line EMBOSS administration documentation is out of date. As a result, I felt that this two part article series (Part 2 covers the task of enabling data access in EMBOSS (including local flatfile database setup) will be a quick startup guide for those that have to administer EMBOSS installations.

This year the version clock has turned into 6.5. In this Part, I shall be going through an installation from the sources on a production Linux server, covering all aspects of the system configuration, including the formatting of databases. There might be binary/prebuilt packages available for your Linux distribution. However, I always maintain the principle of building the latest binaries from the sources. This gives you the latest and the greatest with a little bit of extra effort.

Most of the steps below can be automated with simple scripts. However, the process of going through a manual installation of EMBOSS should make you aware of the different system components. Once you have an understanding of the system, it is then wise to automate/script these steps.

What kind of hardware you will need

EMBOSS is a fairly modest system to install in terms of hardware requirements. The only thing that can draw the hardware envelope is how much data you would like to index. If your server should host/index the entire EMBL/Genbank databases, you will need plenty of disk space (I advise you to have at least 3-4 Tbytes to spare, yes you read right).

Memory and CPU wise, 8 cores with 32-64 Gigs of RAM should be enough to keep most user loads happy (30-40 users) on a production server setup. What you do draws the map for the hardware requirements. If you are trying to do a global alignment of large sequences, you might easily eat up 64 Gigs of RAM. In contrast, basic sequence processing could also be performed on a dual core Laptop with 4 Gigs of RAM. By and large, the figures I suggest here should meet most requirements. If you have the task of specing an EMBOSS server, your best bet to get it right is to talk to your scientists and ask for what sort of operations they would be performing, to get an accurate picture of the hardware specs.

The downloading of the sources

Prior starting, I ensure that my Linux system has most of the development libraries installed. Some EMBOSS applications can be sensitive to missing libraries like libpng, libjpeg, etc. You will also need to ensure that you have your C/C++ compilers installed (gcc/g++).

EMBOSS is a large system. Apart from the core EMBOSS packages, there is an entire array of third party applications that are bundled together with the EMBOSS core applications (some examples: PHYLIP, MEME, IPRSCAN). These are the EMBASSY tools. This is a detail for most users, who collectively refer to the entire package as EMBOSS. However, when you go to download the source EMBOSS tarball, it does not contain these additional packages. This means that if you want to have the full array of EMBOSS/EMBASSY applications, you will have to go through the following steps:

1)Go to the main EMBOSS FTP download server and I download the latest EMBOSS tarball (normally named emboss-latest.tar.gz). In my case, it points to the EMBOSS-6.5.7.

2)After downloading this to my source dir, I unpack it by doing a:

tar -xvfz EMBOSS-6.5.7.tar.gz

3)I then cd to the EMBOSS-6.5.7 dir and at the top level of the sources, I do a:

mkdir embassy

4)Under the newly created embassy directory, I then download the tarballs of the EMBASSY packages (version info will vary, but the base name of each package should be more or less the same): CBSTOOLS, CLUSTALOMEGA, DOMAINATRIX, DOMALIGN, DOMSEARCH, EMNU, ESIM4, HMMER, IPRSCAN, MEME, MSE, PHYLIPNEW, SIGNATURE, STRUCTURE, TOPO, VIENNA .
I unpack each of the tarballs with the same command as step 2 under the embassy subdirectory. Once I am done, I can delete the remaining *.tar.gz files.

5)At this point, it might be wise to create a tarball with all the sources properly laid out under the embassy subdirectory by going above the EMBOSS-6.5.7 directory and doing a:

tar -cvf embossembassy65.tar EMBOSS-6.5.7/

This will create the file embossembassy65.tar. This is handy in case you wish to erase the whole source tree and start from scratch and/or repeating the installation on other systems by not having to go through the steps 1-4 again to assemble the source tree.

Configure and compile

We are now ready to start configuring the various packages and eventually compiling them into the EMBOSS/EMBASSY binary applications we shall be using. In my system, I choose that the directory holding the binaries and the produced libraries should be under:

/usr/lsc/emboss

You are free to choose what you wish on your system.

6)Thus, I cd into the top level of the EMBOSS-6.5.7 directory and I issue a:

./configure --prefix=/usr/lsc/emboss; make; make install

In one sentence, this says to the config process where to place the produced files and instructs the system to compile and place the produced applications under that location. Grub a cup of tea/coffee/beer as this will take some time. If it all goes well, and you see no errors in the terminal output, you should see the first installed binary applications under the /usr/lsc/emboss/bin directory. In my base, I verify that I have functioning applications by executing embossversion:
./embossversion
Report the current EMBOSS version number
6.5.7.0

This means that I am on good ground and can continue with the installation of the rest of the applications.

One detail new to the process of installing EMBOSS as of version 6.5.x is the automatic kick in of the embossupdate application, which you note in the final output lines of a successful step 6 operation:
...
make[3]: Entering directory `/usr/lsc/sources/EMBOSS-6.5.7'
/usr/lsc/emboss/bin/embossupdate
Checks for more recent updates to EMBOSS
EMBOSS 6.5.7.0 is the latest version (6.5.0.0)

Basically, the EMBOSS install process will check for patches and updates to the source code, a process performed manually by EMBOSS admins before. This is a very welcome addition and eases the process of receiving up-to-date code, in order to address bug fixes and enhancements.

If you do not get to the point where you see the emboss applications and you see errors as part of the make process, the most likely scenario is that you are missing some development library or tool. You can get help by posting a request for help to the EMBOSS mailing list.

What you need to do now is to repeat step 6 for every subdirectory under the embassy directory and watch gradually the new applications being added to the bin folder.

Post installation configuration

You should have installed by now all the applications of core EMBOSS and EMBASSY packages from source. After this process, you should start configuring your system so you can make the applications available.

7)Make sure that the emboss bin folder is in a system wide path, to ensure that all users can reference the applications. For my systems, all the freshly compiled applications reside under the /usr/lsc/emboss/bin folder. Hence, this is the folder I enter into the system wide PATH. in my server /etc/profile.d/bash_login.sh, there is a line that contains the following:
export PATH=$PATH:/usr/lsc/emboss/bin

8)Make sure you install all the application dependencies for the EMBOSS/EMBASSY applications you are going to use . There is a number of EMBOSS/EMBASSY applications that are wrappers around third party packages. This means that the EMBOSS/EMBASSY application will not function, unless you install its required dependencies. This is normally simple. I am not going to mention all the dependencies now, but a few examples from my userbase are the following:
-emma which requires the installation of the Clustalw tool.
-eiprscan which requires the installation of the iprscan tool.
-ememe which requires the installation of the meme tool.

Each of these installations might involve an entire set of separate procedures and instructions, but you get the picture.

Part 2 of this article will examine how to configure the EMBOSS databases.

Saturday, July 14, 2012

Το τίμημα της οικονομικής Γερμανικής επιτυχίας

Σημείωση μετάφρασης:Το παρόν αποτελεί μετάφραση του κύριου άρθρου με τίτλο "Lavtløne og fattige betaler regninge" (Οι χαμηλόμισθοι και οι φτωχοί πληρώνουν το λογαριασμό) της Νορβηγίδας δημοσιογράφου Ingrid Brekke. Στο εξώφυλλο απεικονίζεται ο τίτλος "Må betale for Tysklands suksess" ("Η επιτυχία της Γερμανίας πληρώνεται"). Δημοσιεύθηκε στην έγκριτη Νορβηγική εφημερίδα Aftenposten στις 11 Ιουνίου του 2012 (σελίδες 17-19).

Το άρθρο περιγράφει με ακρίβεια τη σκοτεινή πλευρά της Γερμανικής οικονομικής μηχανής και τα βιώματα ενός Γερμανού δημοσιογράφου που έζησε απο πρώτο χέρι συνθήκες εργασίας σκλαβιάς μέσα στη Γερμανία. Η μετάφραση είναι δική μου, τα σχόλια δικά σας.

---

Η ΣΚΟΤΕΙΝΗ ΠΛΕΥΡΑ ΤΗΣ ΓΕΡΜΑΝΙΑΣ

Η μεγαλύτερη και πιο ισχυρή χώρα της Ευρώπης έχει στα χέρια της την περαιτέρω ανάπτυξη της ηπείρου. Λίγες δεκαετίες νωρίτερα, η Γερμανία ήταν ο ασθενής της Ευρώπης. Όλοι τότε τη θαύμασαν για τα μέτρα που πήρε για να τονώσει την οικονομία της και να μειώσει την ανεργία. Σήμερα, η Γερμανία δίνει την ίδια συνταγή για την κρίση της Ευρωζώνης: περικοπές δαπανών και σφύξιμο στη ζώνη.

Όμως η Γερμανική επιτυχία έχει μια σκοτεινή πλευρά ιδιαίτερα απογοητευτική για τους αριστερούς της Ευρώπης. Τα μέτρα για την επαναφορά των χωρών της Ευρωζώνης σε τάξη έρχονται κυρίως απο τον πάτο της σκάλας. Οι μισθωτοί δεν είδαν μόνο σημαντικές μειώσεις του πραγματικού τους εισοδήματος απο το 2000 και μετά, αλλά (είδαν) και το ποσοστό των χαμηλόμισθων να αυξάνεται. Ταυτόχρονα οι πλούσιοι αυξάνονται. Τον περασμένο χρόνο, η Γερμανία είχε για πρώτη φορά πάνω απο εκατό δισεκατομμυριούχους (σε Ευρώ).

Προς τα τέλη της δεκαετίας του 90, καταγράφεται μια αυξανόμενη πόλωση στο Γερμανικό εισόδημα, σύμφωνα με τα λεγόμενα του ερευνητή Markus Grabka του Γερμανικού Ινστιτούτου Οικονομικής Έρευνας (DIW) στη Der Spiegel. "Σχεδόν αποκλειστικά" οι πλούσιοι κέρδισαν απο την οικονομική ανάπτυξη των τελευταίων χρόνων. Και συνεχίζει: "Η τάση αυτή πιθανότατα θα συνεχιστεί".

Οι Γερμανικοί μισθοί είναι μερικές φορές τόσο χαμηλοί που οι άνθρωποι δεν μπορούν να ζήσουν απο τη δουλειά τους, παρόλο που πολλοί απο αυτούς δουλεύουν παραπάνω απο 50 ώρες την εβδομάδα.

Ο δημοσιογράφος Günter Wallraff έρχεται με νέες αποκαλύψεις των συμβάσεων εργασίας σκλάβων και τις σκληρές συνθήκες του πάτου της εισοδηματικής σκάλας στη γερμανική κοινωνία.

ΟΙ ΧΑΜΗΛΟΜΙΣΘΟΙ ΚΑΙ ΟΙ ΦΤΩΧΟΙ ΠΛΗΡΩΝΟΥΝ ΤΟ ΛΟΓΑΡΙΑΣΜΟ

Ο Άντι Φίσερ είναι 28 χρονών και πιάνει δουλειά κάθε μέρα στις 5 το πρωί. Ξεκινάει τότε να φορτώσει τα πακέτα στο φορτηγάκι διανομών, 230 τον αριθμό (μερικά απο αυτά έχουν βάρος μέχρι και 50 κιλά). 130 στάσεις ξεφορτώματος χωρίς διάλειμμα. Κατα τις 7 το βράδι, τελειώνει απο τη δουλειά του.

Για αυτές τις 14 ώρες καθημερινής εργασίας, πέντε μέρες τη βδομάδα, ο Φίσερ βγάζει το μήνα 10000 Νορβηγικές κορώνες (1340 Ευρώ), μεικτά.

Αυτή είναι η καθημερινή ζωή για πολλούς στην πλούσια Γερμανία. O Άντι Φίσερ είναι ένα απο τα χαρακτηριστικά παραδείγματα που αναφέρονται στο πρόσφατο ντοκυμαντέρ του παγκοσμίου φήμης δημοσιογράφου Günter Wallraff. Προσποιούμενος τον απλό εργάτη, ο Wallraff δούλεψε για πολλούς μήνες για την εταιρεία GLS, ιδιοκτησίας της Βρετανικής Royal Mail (Βασιλικό Ταχυδρομείο).

ΤΟ ΚΑΤΡΑΚΥΛΙΣΜΑ ΤΩΝ ΜΙΣΘΩΝ

Ως χαμηλόμισθοι υπολογίζονται αυτοί που κερδίζουν λιγότερο απο το 60% του μέσου μισθού. Το 2010, αυτό σημαίνει ωριαία μεικτή αποζημίωση κάτω των 9.5 Ευρώ την ώρα.
Πρόσφατη έρευνα δείχνει οτι 25% των χαμηλόμισθων δουλεύουν τουλάχιστον 50 ώρες την εβδομάδα.
Το 22% των Γερμανών εργαζομένων είναι χαμηλόμισθοι. Το αντίστοιχο ποσοστό στα μέσα της δεκαετίας του 90 ήταν 15%.

Η Γερμανία είναι τώρα η πιο σταθερή οικονομία της Ευρώπης. Απαλλάχθηκε απο μια τεράστια ανεργία μέσω σκληρών μεταρρυθμίσεων απο το Σοσιαλδημοκράτη Καγκελάριο Γκέρχαρντ Σρέντερ στο πρώτο εξάμηνο του 2000.

Περιορισμοί στον τομέα των συντάξεων, των επιδομάτων ανεργίας και κοινωνικής πρόνοιας πήγαιναν χέρι-χέρι με αποτελεσματικά μέτρα «Kurzarbeit», όπως η εισαγωγή μικρότερης εργασιακής μέρας στη βιομηχανία για να δημιουργήσουν ευελιξία και να αποτρέψουν τις απολύσεις. Οι Γερμανοί εργαζόμενοι είδαν το μισθό τους να μειώνεται κατα 5% σε σχέση με το έτος 2000.

Υπήρξε επίσης και μια αποδοχή μιας τακτικής που θέλει τους μισθωτούς να πληρώνονται τόσο χαμηλά που να μην μπορούν να επιζήσουν ακόμα και με μια δουλειά πλήρους απασχόλησης και έφτιαξαν ένα σύστημα για κοινωνική βοήθεια ώστε να αντιμετωπίσουν τους χαμηλόμισθους. Και αυτά τα επείγοντα μέτρα κοινωνικής βοήθειας τείνουν να γίνουν μόνιμα. 'Ετσι παρόλη την μείωση της ανεργίας (βρίσκεται τώρα γύρω στο 7 τοις εκατό, η χαμηλότερη των τελευταίων 20 ετών), τα επίπεδα φτώχειας δεν έχουν μεταβληθεί, σύμφωνα με σχετική έρευνα της εφημερίδας Die Welt. Και αυτό γιατί οι περισσότερες νέες θέσεις εργασίας δημιουργούνται με χαμηλούς μισθούς, στους τομείς παροχής υπηρεσιών και σε κάποιους τομείς υπηρεσιών υγείας.

12 εκατομμύρια Γερμανοί ζουν κάτω απο το φόβο να αγγίξουν τα όρια της φτώχειας, μια κατάσταση που πολλοί φοβούνται οτι θα αρχίζει να σχετίζεται και με συγκεκριμένες κοινωνικές ομάδες, όπως για παράδειγμα οι φοιτητές. Ιδιαίτερη ανησυχητική είναι η κατάσταση στην Ruhr, όπου το ποσοστό φτώχειας σε πολλές πόλεις έχει ξεπεράσει το 20%.

Η Διαθνής Οργάνωση Εργασίας (ILO) πιστεύει ότι η Γερμανική πολιτική των χαμηλών μισθών έχει συμβάλλει στη διαμόρφωση της κρίσης της Ευρωζώνης.
Οι Γερμανικοί μισθοί ήταν τόσο χαμηλοί σε σημείο που άλλες χώρες της Ευρωζώνης ήταν αδύνατο να τους συναγωνιστούν. Η Γερμανία έχει εισάγει πολύ λίγα απο άλλες χώρες της Ευρωζώνης, ενώ αντίθετα έχει εξάγει πάρα πολλά σε αυτές. Για αυτούς τους λόγους ο ILO πιστεύει ότι ένα τέλος της πολιτικής χαμηλών μισθών θα έχει θετική επίπτωση στην εκτόνωση της κρίσης στην ευρωζώνη.

ΚΑΤΩΤΑΤΟΣ ΜΙΣΘΟΣ

Έστω και αν η συντηρητική καγκελάριος 'Ανγκελα Μέρκελ και το κόμμα της το CDU προσπαθούν τώρα να εισάγουν κατώτατους μισθούς, η πολιτική πίεση για πραγματικές αλλαγές είναι χαμηλή. Με τα μάτια των Νορβηγών, η έλλειψη κοινωνικής αλληλλεγύης προς στους χαμηλόμισθους είναι εκπληκτική. Η έλλειψη αυτή δικαιολογείται μερικώς απο το γεγονός οτι οι σοσιαλδημοκράτες και οι συνδικαλιστές συμμετείχαν στην καθιέρωση της χαμηλής οροφής των μισθών, η οποία οδήγησε στο βύθισμα της ανεργίας.

Η εφημερίδα Süddeutsche επεσήμανε πρόσφατα ότι η εμμονή της Γερμανίας να συνταγολογεί το φάρμακο κατα της κρίσης της Ευρωζώνης δεν λαμβάνει υπόψη ότι στη Γερμανία υπάρχει μια τεράστια μεσαία τάξη, δηλαδή καινοτόμες μικρές και μεσαίου μεγέθους επιχειρήσεις, οι οποίες σκέφτονται μακροπρόθεσμα και είναι υπεύθυνες για ένα μεγάλο μέρος των εξαγωγών. Μια τέτοια μεσαία τάξη δεν υπάρχει στις χώρες που βρίσκονται σε κρίση στον Ευρωπαικό Νότο, και επομένως η συνταγή δε θα είναι επιτυχημένη.

ΑΟΡΑΤΟΙ

Ο Wallraff γράφει οτι φοβόνταν οτι θα αποκαλύπτονταν η πραγματική ταυτότητά του αλλά γρήγορα ανακάλυψε ότι η θέση του περιβάλλονταν απο έναν αόρατο μανδύα. Οι άνθρωποι στις χαμηλόμισθες θέσεις που συμμετέχουν σε όλες τις πτυχές της καθημερινής ζωής, είναι πολύ κουρασμένοι και δεν έχουν ούτε χρόνο ούτε χρήματα (για να είναι ορατοί). Αυτοί που κάθε μέρα ενοχλούν με το να παρκάρουν δίπλα στις θέσεις ποδηλάτων και αφήνουν πακέτα σε τυχαίους γειτονές μας, γράφει ο Wallraff εξηγώντας στο ρεπορτάζ του πλήρως τις συνθήκες εργασίας τους.

Ταυτόχρονα με την δημοσίευση του ρεπορτάζ του, ο Wallraff καλέσθηκε ως μάρτυρας σε δίκη για την οποία νωρίτερα είχε αποκαλύψει απαίσιες συνθήκες εργασίας σε έναν φούρνο. Οι εργαζόμενοι εκεί δούλευαν ασταμάτητα, και δεν είχαν δικαίωμα να σταματήσουν την παραγωγή ακόμα και αν το αίμα τους έπεφτε πάνω στα κουλούρια.

Monday, May 7, 2012

Ο απολογισμός και η ερμηνεία του εκλογικού αποτελέσματος

Βλέπω ένα τσούρμο πολιτικών, που δεν μπορούν να κολυμπήσουν στο βάθος της διερευνητικής εντολής. Δηλώνουν έτοιμοι να σώσουν τους Έλληνες απο τη μια, και απο την άλλη όταν η φωνή της λογικής για συναίνεση τους χτυπά την πόρτα, δε θέλουν να βγουν απο το κατώφλι και να πιούν νερό όταν η βρύση είναι δίπλα. Φαίνεται οτι η ιεράρχηση της ανάγκης διακυβέρνησης μιας χώρας βρίσκεται σε κατώτερη μοίρα απο τις πολιτικές αρχές και το image τους. Ακόμα και όταν η χώρα είναι σχεδόν στο τέλος της χαράδρας.

Σκέφτομαι αυτούς που ψήφισαν κάποια αποβράσματα που φιλοδοξούν να μπουν στο Ελληνικό κοινοβούλιο. Δικαίωμά τους είναι η ψήφος. Δείχνουν έτσι οτι η Δημοκρατία απαιτεί Παιδεία που τους λείπει.

Θετικό το ότι βυθίστηκε το ΠΑΣΟΚ, ένα κόμμα του οποίου η τακτική της απογραφής έφερε το ΔΝΤ μέσα στην Ευρώπη, όχι μόνο στη χώρα μας. Χάρηκα πάρα πολύ για το ότι βυθίστηκε ένα κόμμα του οποίου ηγήθηκε για πρώτη φορά o Ε. Βενιζέλος. Μια προσωπικότητα που στο όνομα της διαπραγμάτευσης, άλλα έλεγε και άλλα έπραττε και πολλά άλλα δεν έπραττε (περιορισμό των δαπανών), με καταστροφικές συνέπειες για το λαό. Στην άλλη (υποτίθεται) όχθη, η επίσης καθηγήτρια Πανεπιστημίου κ. Κατσέλη, εκ των βασικών παραγόντων του πυρήνα του "λεφτά υπάρχουν" πάτωσε εντελώς. Η κ. Κατσέλη, με προυπηρεσία στον τομέα των Οικονομικών Επιστημών σε Yale, Birbeck College και άλλα Διεθνή και αξιόλογα ιδρύματα, δεν είδε τη λαίλαπα που έρχονταν και πρότεινε στο Γ. Παπανδρέου μια δημοσιονομική προσαρμογή ανεφάρμοστη, διότι αντι να αποδυναμώνει τον κρατισμό, τον ενίσχυε. Άλλο όμως ο κρατισμός και άλλο ο υπερβολικός κομματικός κρατισμός, τον οποίο η κ. Κατσέλη δεν έβλεπε.

Για τη Νέα Δημοκρατία λυπάμαι ειλικρινά. Πρώτον γιατί το εκλογικό αποτέλεσμα οφείλεται σε τραγικά λάθη του Α. Σαμαρά. Το τραγικό της υπόθεσης είναι ότι ενώ είχε δίκιο απο την αρχή στο ότι οι όροι του μνημονίου δεν έβγαιναν, δεν ξεκαθάρισε τη θέση του απο την αρχή. Αμφιταλαντεύθηκε ανάμεσα στην αντιμνημονιακή ρητορική και την ανάγκη διαπραγμάτευσης, με ασυνέπειες και τσαπατσουλιές (Ζάππειο 1) και αλλαγή πλεύσης με απειλές στην κοινοβουλευτική του ομάδα, υποκύπτοντας στις Γερμανικές πιέσεις. Αποτέλεσμα, να αποσχιστεί ο πυρήνας Καμμένου, με όλα τα επακόλουθα. Ο Σαμαράς δεν πλήρωσε για τη συμμετοχή του στη μνημονιακή κυβέρνηση, όπως υποστηρίζουν μερικοί. Η απώλεια θα ήταν στα επίπεδα του ΠΑΣΟΚ αν συνέβαινε αυτό. Πλήρωσε την ασυνέπεια και αδυναμία του να βάλει σε τάξη τα πράγματα εντός της Νέας Δημοκρατίας, στερόντας απο τον τόπο έναν αναγκαίο πυρήνα κοινοβουλευτικής σταθερότητας.

Για τον κ. Καρατζαφέρη έχω να πω οτι για να μείνεις στην πολιτική, δεν αρκεί μόνο να ξέρεις τα κόλπα των ΜΜΕ και να προβάλλεις ένα Εθνικο-Χριστιανικό μοντέλο υπερσυντηρητισμού. Τα υπόλοιπα είναι γνωστά.

Έρχομαι τώρα στους "νικητές" των εκλογών, η πιο σωστά στις νέες "δυνάμεις". Τα εισαγωγικά τα χρησιμοποιώ γιατί όσο η χώρα δεν έχει σχέδιο και κυβέρνηση, κανένας δεν είναι νικητής ή δύναμη. Αυτό που τους λέει ο λαός ΔΕΝ είναι να βγάλουν τη χώρα απο το Ευρώ. Το λαικό μήνυμα είναι σαφές, ισχυρό και πολύ δύσκολο στην εφαρμογή του: Πολιτικός πλουραλισμός, όχι εξουσίες σε μια κομματική νοοτροπία και επαναδιαπραγμάτευση των όρων. Με τον Ολάντ στα σκύπτρα της Γαλλίας, αυτό δε σημαίνει μια Ελλάδα που ορθοποδεί. Η Ελλάδα θα ορθοποδήσει όταν ο λαός μιλά και οι πολιτικοί συννενοούνται. Συννενοηθείτε λοιπόν γιατί ο λαός μίλησε. O αντι/υπέρ μνημονιακός λαικισμός τελείωσε. Ιδού η Ρόδος, ιδού και το πήδημα.

Friday, January 6, 2012

The bioinformatics sysadmin craftmanship: Installing the MRS v5 platform: Part 2

In Part 1 of the article series, we examined the basics of what MRS is and its computer hardware requirements. It is about time we get our hands dirty and install a production MRS server.

A basic production setup

The image above illustrates a basic production setup for MRS. You do not have to follow this setup, you could have a single server to handle everything. However, the above setup has a number of advantages that I shall explain.

There are two servers here. The front-end one serves the user queries, whereas the back-end server is used for the MRS index build process. You will notice that the front end is more beefed up hardware-wise than the backend. This is because (as explained in Part 1) the MRS queries can scale in terms of CPU, I/O and RAM. In contrast, that is not the case with the index building process, which beyond the 8 cores and the I/O it can create, will not scale to a large number of CPUs/cores. As a result, it makes sense to have the most capable machine at the query response end and keep an 8 core CPU with an adequate amount of RAM to crunch your datasets periodically.

The disk I/O setup reflects the same need/trend. I would recommend to place your disks at the front-end machine and have a capable disk controller (Directly Attached Storage SAS, Fiber Channel, Fiber Channel over Ethernet). The backend machine can access these disks to build the index by means of a well performing NFS setup over 10 Gigabit Ethernet. Plain Gigabit Ethernet should also be acceptable, however, I found that a "jumbo frame" enabled 10 Gigabit Ethernet in comparison to plain Gigabit Ethernet cuts the index generation time by 40-60% on average.

This setup is designed to achieve two things:

To place the performance where is mostly needed (MRS queries), especially if MRS is used as part of a pipeline (command-line or Galaxy based).
To increase the impact of the index generation process on a busy/hard-working server that is hit by queries.

The disadvantage is of course that you have to keep two MRS instances running, so what I describe below should be applied to both servers in order to keep things in sync. However, you will see that once you get a basic instance up and running, most of your attention will turn to post-installation issues and not really on keeping two instances in sync, installation-wise.

Software prerequisites

Before we get to the specifics of an MRS server installation, let us go through some important software requirements for installing on a RHEL 6 platform. If your distro is Redhat based (Fedora, CentOS, and Scientific Linux are some of the most well known free derivatives of RHEL), the instructions should carry you through to a functional MRS installation. If your distro is not RHEL based, you can at least have a good appreciation of what building blocks are required for the proper operation of the system. Here is a list of them:

gcc 4.4.x compiler or more recent versions (see comment below)
PERL version 5.10 or more recent versions
perl-XML-LibXSLT module
The Boost C++ library versions >=1.42<= 1.48
The libarchive interface
A copy of snarf.

Working as the root user, on a RHEL 6 platform, most of these components can be easily installed by the yum package manager with the exception of the Boost library and snarf:
yum install perl-XML-LibXSLT
yum libarchive libarchive-devel

Starting with the gcc compiler, due to some code optimization bug issues, there were issues when attempting to compile MRS and its prerequisites with a compiler more recent than a 4.4.x series gcc. By mid January 2012, this issue was addressed and is now possible to use more recent compilers than 4.4.x Nevertheless, the RedHat default 4.4 gcc compiler (in my case it was 4.4.6 20110731 (Red Hat 4.4.6-3) ) is a stable choice.

At the time of writing, RHEL 6.2 (Santiago) is equipped with Boost version 1.41, as part of its default yum package repository. That´s too old for MRS and thus it means that we have to uninstall the yum related Boost packages and install the Boost libs from source.

yum remove boost boost-dev boost-date-time boost-filesystem boost-graph boost-system boost-iostreams boost-thread boost-regex boost-serialization boost-signals

Then grub a copy of the libboost 1.47 from:
http://sourceforge.net/projects/boost/files/boost/

and complete the boost lib install here in a fixed path (in my case /usr/lsc/libs) by doing a:

tar xvfz boost_1_48_0.tar.gz
(Note: Earlier versions of libzeep had a problem with boost version >1.47 and would not build. Around mid of January 2012, it became possible to use boost version 1.48)
cd boost_1_47_0
./bootstrap.sh --prefix=/usr/lsc/libs
./b2 install

At that point, make sure that your shared library config (normally /etc/ld.so.conf should contain the /usr/lsc/libs/lib path and then you should do an ldconfig. Check with an ldconfig -p | grep /usr/lsc/lib/ to see that the boost shared libraries are in place.

For snarf, you need to install the utility in the system wide PATH.

MRS installation

At this point, we should be ready to start installing MRS itself. Libzeep is the first part of installing MRS. It is a bespoke W3C compliant XML processor that enables MRS to talk the SOAP. This allows users to query an MRS server using web services. Still working as the root user, get the latest version (at the time of writing) 2.6.3 from the CMBI SVN server:

svn co https://svn.cmbi.ru.nl/libzeep/trunk

(revision 337)

Modify the makefile and set the following parameters, having in mind a prefix where you want the libzeep to install:

BOOST_LIB_DIR = /usr/lsc/libs/lib
BOOST_INC_DIR = /usr/lsc/libs/include

PREFIX ?= /usr/lsc/libs

Then issue a:
make; make install; ldconfig

Do verify that you can see with an ldconfig -p | grep libzeep that the boost shared libraries are in place.:

libzeep.so.2.6 (libc6,x86-64) => /usr/lsc/libs/lib/libzeep.so.2.6
libzeep.so (libc6,x86-64) => /usr/lsc/libs/lib/libzeep.so

We are ready to install the actual MRS code now. Now let us install the MRS version. Grab the latest from the CMBI svn

svn co https://svn.cmbi.ru.nl/mrs/trunk

(checks out revision 1430)

Note: The MRS SVN repository is an active project and as such, the developers might be in the process of cleaning/modifying the code. It is possible that if you checkout the latest sources from the CMBI SVN server, that something might break/will not compile. When in doubt, please consult the MRS mailing list and verify the latest known working version. At the time of writing, you can be sure that revision 1430 is a working MRS version. If you wish to use it as a reference, you can issue the command:

svn co -r 1430 https://svn.cmbi.ru.nl/mrs/trunk

Then I shall make the directory under which I shall have the MRS binary utilities, as well as the directory where I am going to store the datasets (the large multi Tb volume we talked in Part 1 of the article series):

mkdir /usr/lsc/mrs

mkdir /storage/tools/mrsdata

Then, I am ready to initiate the configuration of the sources by selecting the prefix, the data directory location, as well pointing the location of the boost libraries which I installed from source, just to be safe and ensure the MRS coigure routine will find the right library paths, as shown below:

./configure --prefix=/usr/lsc/mrs --data-dir=/storage/tools/mrsdata --boost_lib=/usr/lsc/libs/lib --boost_inc=/usr/lsc/libs/include

Various checks will be performed and if no errors are returned at this stage, the configure command should be followed by a:

make; make install; ldconfig

If the compilation stage finishes with no errors (you will see plenty of warnings and you can normally safely ignore them), you have just completed the MRS installation stage. Congratulations!

Post installation config check and orientation

In this section, we will discuss what you should see/check, prior using MRS for the first time. After completing the make install and ldconfig steps as described above, you should familiarize yourself with the directory layout of MRS. So, let us take a tour and show the various MRS directories.

First of all, under the installation prefix (it was /usr/lsc/mrs) you should see the following directories:

bin: This is where the MRS utilities reside: mrs-blast mrs-config-value mrs-mirror mrs-run-and-log mrs-build mrs-lock-and-run mrs-query mrs-test and mrs-update. All these are tools you will be using to configure and query the various MRS datasets.
lib: This directory is meant to hold MRS library modules, but it is empty on 64bit systems (x86_64).
lib64: On 64-bit systems (x86_64), this contains the MRS.so shared library, as well as the MRS.pm Perl module, a vital module referenced by the dataset parsing scripts (share directory)
sbin: Here you should have the mrs-ws binary which is the SOAP web services MRS module.
share: This directory contains a series of Perl parsers, one for each databank MRS supports.

Next, you should navigate your shell to the /usr/local/etc/mrs directory. Under the directory, you should find a series of important configuration files. I shall not go into details on the syntax of these files in this article, but very briefly:

databank.info: This file instructs MRS how to fetch (location and method) and generate index for various databanks you can offer/query under MRS.
mrs-config.xml: This XML formatted file (its DTD schema is in the mrs-config.dtd) controls various operational parameters of MRS such as the location of the various MRS directories (most of them are auto-generated by the configure step of the MRS sources), the location/path of externally used utilities (clustalw, NCBI BLAST), as well as the port number and URL location of the MRS SOAP web services servers. Latter articles will explain this parameters in more detail.

Both of the previously mentioned files have a sample you can use for reference (*.dist files).

If you can see all of these things at this point, you are on good track to fire up MRS for the first time and check it out. We will do that by navigating back to the /usr/lsc/mrs/bin directory. We are going to fetch a simple databank and watch MRS generate the index so we can query the database. We shall do that by running the mrs-update utility:

./mrs-update enzyme

and if everything was compiled properly, MRS will issue the following output:

/usr/lsc/mrs/bin/mrs-run-and-log -r 5 -l /storage/tools/mrsdata/status/enzyme.fetch_log /usr/bin/make -f /usr/lsc/mrs/bin/mrs-update DATABANK=enzyme fetch

/usr/bin/make: success

/usr/lsc/mrs/bin/mrs-run-and-log -r 5 -l /storage/tools/mrsdata/status/enzyme.mrs_log /usr/bin/make -f /usr/lsc/mrs/bin/mrs-update DATABANK=enzyme mrs

/usr/bin/make: success

rm -f /storage/tools/mrsdata/flags/enzyme.fetch_done /storage/tools/mrsdata/flags/enzyme.mrs_done

After that, we can navigate under the MRS data directory to gain a basic understanding of what happens every time MRS generates the index of a databank. Under the data directory (in my case as indicated by the above output /storage/tools/mrsdata), you will find the following sub-directories:

mrs: This is where the MRS index files are produced and stored. Each databank has a number of associated .cmp files, together with an associated dictionary file .dict. For the enzyme databank, the produced files are enzyme.cmp and enzyme.dict.
raw: This directory holds the flatfiles of the databanks. These are downloaded from the URL and method, as specified in the databank.info file.
status: A useful directory for the MRS administrator, as it holds important logs about the status of the mrs-update process for the databanks. For each MRS hosted databank, you can see the fetch_log (whether the flatfile download procedure was completed), the mrs_log which outlines whether the MRS index generation was completed properly. Finally, if all was completed properly, an mrs_done file is created to indicate that MRS was successful in updating the databank. The logs for each databank auto-rotate.
docroot: This directory holds the CSS/HTML and web content of the MRS HTTP server. We will describe how to fire-up this server shortly, together with the system SOAP functionality.
flags: This directory is used internally by MRS to sync certain procedures of the databank fetching process.
blast: This directory contains the NCBI BLAST database index for each databank, in order to BLAST databases via the MRS system.

Hence, every time you mrs-update a databank, the latest flatfiles are fetched automatically under the raw directory. After that, the mrs-build utility will attempt to invoke the MRS parsers and create the index under the mrs directory.

If you wish to see which databanks you can fetch/update with the mrs-update utility, here is a list of them:

dbest embl_release embl_updates enzyme genbank_release gene go goa gpcrdb interpro omim oxford pdb pdbfinder2 dssp hssp pfam pmc prints prosite rebase refseq_release refseq_updates taxonomy unigene uniprot uniref50 uniref90 uniref100

I leave the mrs-index generation of them as an exercise to the reader with two hints:

Do not attempt to start multiple mrs-update processes in parallel. Remember, the index generation process does not scale.

Some of the largest databanks (embl_release, genbank, dbest, pdb, hssp) will require entire days to download and index. Thus, what I tend to do is to issue something like: nohup ./mrs-update embl_release & , to ensure that the process will not be interrupted by a terminal session timeout/disconnection.

Querying and firing up the MRS web and SOAP server

If you have followed all the previous instructions, you should have installed MRS and have indexed one or more databanks. What about querying them to ensure that MRS does indeed its job? After all that effort, you should really experience the power and simplicity of MRS searches.

The most user intuitive way is to fire up the MRS web server. Before you fire up the MRS web server interface, you should consider making a non-privileged user. Up to this point, we have been working as root. However, opening an HTTP/SOAP port bound to a process with superuser credentials is not the best thing for the security of your server. What I do is to make a normal system user:

useradd -d /home/users/mrsuser mrsuser

I assign a secure password to the user. As root, I make sure that this user can have access to the /var/log/mrsws.log file which logs the queries that hit the MRS SOAP server:

touch /var/log/mrsws.log

chown mrsuser /var/log/mrsws.log

and also change the owneship of the /usr/lsc/mrs directories to the mrsuser recursively:

chown -R mrsuser /usr/lsc/mrs

After that, I switch to the mrsuser and start the MRS SOAP server by navigating to the following directory:

su - mrsuser

cd /usr/lsc/mrs/sbin

nohup ./mrs-ws &

This starts a number of mrs-ws servers with the credentials of the mrsuser and not the root account. Make sure you do not have a firewall between your desktop and the server and point a recent web browser version (Firefox 8, Chrome) to the IP of your server, following the URL convention:

http://IP_of_your_server:18080

If you hit the Status tab, you will see the MRS web environment as shown above. You can enter your search terms in the top bar and search against all or specific databases.

There are other ways to search the databases that will be outlined in Part 3 of this article series. However, you have now the basic knowledge of how to kickstart MRS in a basic way. In the next article, we will discuss the production usage of MRS.

Search This Blog