Thoughts of a technocrat as letter sequences: August 2012

EMBOSS Database configuration

Part 1 of this article series covered a basic installation of EMBOSS from sources. The configuration of EMBOSS databases merits a separate article Part as it requires some knowledge of the indexing process and the various mechanisms to download and index flat file databases. Correspondence from the EMBOSS mailing list shows that this is a topic that confuses users and admins frequently. Thus, we are going to take a detailed look at it.

Remote data access methods and the emboss.default file

If you would like a recap of what is a flatfile database and what EMBOSS can do for you in terms of accessing indexed flatfile databases, you might like to take a look at some of the lectures I have given on the subject (slides, video). EMBOSS is not the fastest and most efficient way to index your flatfile databases. You should look at something like MRS and similar systems to have a more efficient way to index and perform comprehensive queries on flatfile databases. In fact, EMBOSS can access MRS indexed databases and in my opinion, this is better than a pure EMBOSS index system in many perspectives (speed of indexing/quering the index, storage efficiency etc). Nevertheless, EMBOSS does its job and this section describes only the process of indexing flatfile databases by using exclusively EMBOSS utilities.

One thing you need to understand is that in order to have access to indexed flatfile databases, you do not always have to index them locally. The EMBOSS applications support a variety of remote data retrieval methods to many useful datasets. Amongst the most popular of them we have:

MRS methods (mrs, mrs3 and mrs4): These allow you to search an MRS based index on a local or remote server.

DBFETCH method (dbfetch): Supported by servers at EBI.

WSDBFETCH method (wsdbfetch): A SOAP based EBI service similar to the DBFETCH method.

BIOMART method (biomart): Using the Biomart service.

To understand how to engage/activate these different data access methods, you will need to become familiar with the 'emboss.default' file. Part 1 of this article mentioned that the EMBOSS installation directory was under: /usr/lsc/emboss . You will need to navigate to the following directory:

/usr/lsc/emboss/share/EMBOSS

When you install EMBOSS for the first time in your system, you will see amongst others two files:

The 'emboss.default.template' file: This is a sample configuration file which shows the EMBOSS admin how to define databases. We will explain more in the process, but you can use this file as a reference to see many examples of how to configure properly various types of EMBOSS databases.
The emboss.standard file: This file also contains valid EMBOSS database configuration entries. However, the database definitions are included by default in your current setup.

The idea is that you have some default entries in the emboss.standard file which are included in your database list. So, if on your shell you issue a:

showdb

you will immediately get the following list of database entries by default:

Display information on configured databases
# Name          Type     ID Qry All Comment
# ============= ======== == === === =======
taxon           Taxonomy OK OK OK -
drcat           Resource OK OK OK -
chebi           Obo      OK OK OK -
eco             Obo      OK OK OK -
edam            Obo      OK OK OK -
edam_data       Obo      OK OK OK -
edam_format     Obo      OK OK OK -
edam_identifier Obo      OK OK OK -
edam_operation Obo      OK OK OK -
edam_topic      Obo      OK OK OK -
go              Obo      OK OK OK -
go_component    Obo      OK OK OK -
go_function     Obo      OK OK OK -
go_process      Obo      OK OK OK -
pw              Obo      OK OK OK -
ro              Obo      OK OK OK -
so              Obo      OK OK OK -
swo             Obo      OK OK OK -

If you wish to define any additional databases beyond this default list, you should create an emboss.default file, using the file 'emboss.default.template' as your reference (we are going to explain how shortly).

For now let's focus on these default databases defined by the emboss.standard file. They are a good example of how the new EMBOSS 6.5 enables remote data access from a variety of global public servers out of the box (I assume your Internet connection is working, right?). Let's use the EDAM ontology to retrieve data about an identifier. To do that I choose the ontotext EMBOSS application and I type:

ontotext edam_data:0849

The resulting file (0849.ontotext) contains the info which is retrieved from available servers. Let's take a look at the emboss.standard file to see how the edam_data database is defined:

DB edam_data [
type: "obo"
format: "obo"
method: "emboss"
dbalias: "edam"
namespace: "data|identifier"
indexdirectory: "$emboss_standard/index"
directory: "$emboss_standard/data"
field: "id ! identifier without the prefix"
field: "acc ! full name and any alternate identifier(s)"
field: "nam ! words in the name"
field: "isa ! parent identifier from is_a relation(s)"
field: "des ! words in the description"
field: "ns ! namespace"
field: "hasattr ! identifier(s) from has_attribute relation(s)"
field: "hasin ! identifier(s) from has_input relation(s)"
field: "hasout ! identifier(s) from has_output relation(s)"
field: "isid ! identifier(s) from is_identifier_of relation(s)"
field: "isfmt ! identifier(s) from is_format_of relation(s)"
field: "issrc ! identifier(s) from is_source_of relation(s)"
]
...

RES edamresource [
type: "Index"
fields: "id acc nam isa des ns hasattr hasin hasout
isid isfmt issrc"
acclen: "80"
namlen: "32"
deslen: "30"
accpagesize: "8192"
despagesize: "4096"
]

In general, an EMBOSS database definition has two main parts:

The DB definition part: It defines the name, type, format, access method and various fields of the database record.
The RES (resource definition) part: Where the length of the various record fields is defined in the index. (note that RES definitions are normally found towards the end of the file).

The DB and RES fields go together for each database definition. In addition, for remote data access methods, a SERVER definition might be necessary to necessitate access to remote information repositories.

Step 9:The 'emboss.default' file does not yet exist,so create it under the directory where the emboss.default.template. From now on, you will be editing the emboss.default file to define all aspects of the EMBOSS database configuration. Start with a minimal file like the one below:

#############################################
# EMBOSS environment variables
#############################################

SET emboss_tempdata /usr/lsc/emboss/share/EMBOSS/test

DB martensembl [
    method: "biomart"
    type: "P"
    url: "http://www.biomart.org:80/biomart/martservice"
    dbalias: "hsapiens_gene_ensembl"
    format: "biomart"
    filter: "chromosome_name=13"
    sequence: "peptide"
    return: "ensembl_gene_id,description,external_gene_id,chromosome_name"
]

Show here, we have defined the database 'martensembl' which could retrieve remotely entries from the Homo Sapiens Ensembl gene repository. Save the file and go back to your shell. You can repeat the 'showdb' command and verify that you can see the newly defined martensembl database. Now, test it by typing:

seqret martensembl:ENST00000380152

The resulting fasta file should contain the info you require and this was all the way from the remote Biomart server. Congratulations, you just setup your first remote database access in EMBOSS!

Browsing remote access repositories is a good idea and the EMBOSS team was right to enable the functionality in EMBOSS. However, accessing remote datasets does not always work very well if:

You go into a place where Internet availability is sketchy or of limited bandwidth capacity.
The datasets you need to access involve millions of sequences or Gigabytes of information.

In these case, your only reliable option is to setup a database locally and make a flatfile database index. This is explained in the next section.

How to define a local flatfile database index

What was said in the previous section about the main parts of an EMBOSS database definition in the emboss.standard file can also be applied to the emboss.default file. Let's provide an example and give you an example of how you can format the latest Uniprot/sprot database, in three steps:

Step A: Download and uncompress the latest file into your flatfile index area, a directory where you should have plenty of space to hold your flatfiles and the produce indices of your datasets. The file lies here (EBI FTP server). On the command line, you could do a:

wget ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.dat.gz

followed by a;

gunzip uniprot_sprot.dat.gz

Step B: Update the 'emboss.default' file by adding a database definition, as well as a resource definition, as shown below:

SET emboss_database_dir /storage/tools/embossdbs
SET emboss_db_dir /storage/tools/embossdbs

DB sprot [
        type: P
        method: emboss
        release: "57.1"
        format: swiss
        fields: "id acc sv des key org"
        directory: $emboss_db_dir/uniprotsprotfiles
        file: *.dat
        indexdirectory: $emboss_db_dir/uniprotsprotfiles
        comment: "UniProtKB/Swiss-Prot Latest Release "
]

RES sprot [
   type: Index
   idlen: 15
   acclen: 15
   svlen: 20
   keylen: 85
   deslen: 75
   orglen: 75
]

The first two lines are optional and provide an alias for the directory locations where you have uncompressed the flatfile and you are going to produce the index. After that you have the database (DB sprot) definition. It is a protein sequence database (type: P). The fields specification is important. It lists all the indices that are going to be produced. So, we know that we will be able to search the database by sprot IDs (id), accession number (acc), sequence version (sv), descriptive text from the sequence header (des), keyword (key) and taxonomy info (org).

Each of these index fields has a defined length as part of the associated RES (resource definition) entry. Note that it is important to define both the DB and the RES blocks. If you do not and for example you forget to define the RES record, the EMBOSS applications will complain until you resolve the issue with an error message similar to this one:

EMBOSS An error in ajnam.c at line 9126:
unknown resource 'sprot'

For now, save the file and do a showdb to verify that you can see the 'sprot' database. If you have omitted or misconfigured any important parts of the definition, the command should complain with informative errors.

Step C: Produce the index. Go to the directory where you have your uncompressed flatfile (.dat) (in my case this is under /storage/tools/embossdbs/uniprotsprotfiles) and type the following emboss command: dbxflat -outfile uniprotsprotout -directory /storage/tools/embossdbs/uniprotsprotfiles -idformat SWISS -filenames '*.dat' -fields id,acc,sv,des,key,org -compressed N -dbname sprot -dbresource sprot -release 2012_07 -date 03/08/12

You will need to wait a bit, as the system takes its time to crunch the index.

If all goes well, you should see the following index files in your directory where your flatfile lies:

-rw-r--r--. 1 root root 103 2012-08-03 19:28 sprot.ent
-rw-r--r--. 1 root root 299 2012-08-03 19:36 sprot.pxac
-rw-r--r--. 1 root root 301 2012-08-03 19:36 sprot.pxde
-rw-r--r--. 1 root root 295 2012-08-03 19:36 sprot.pxid
-rw-r--r--. 1 root root 297 2012-08-03 19:36 sprot.pxkw
-rw-r--r--. 1 root root 295 2012-08-03 19:36 sprot.pxsv
-rw-r--r--. 1 root root 299 2012-08-03 19:36 sprot.pxtx
-rw-r--r--. 1 root root 63M 2012-08-03 19:36 sprot.xac
-rw-r--r--. 1 root root 259M 2012-08-03 19:36 sprot.xde
-rw-r--r--. 1 root root 40M 2012-08-03 19:36 sprot.xid
-rw-r--r--. 1 root root 161M 2012-08-03 19:36 sprot.xkw
-rw-r--r--. 1 root root 38M 2012-08-03 19:36 sprot.xsv
-rw-r--r--. 1 root root 264M 2012-08-03 19:36 sprot.xtx
-rw-r--r--. 1 root root 2,5G 2012-08-03 19:26 uniprot_sprot.dat
-rw-r--r--. 1 root root 758 2012-08-03 19:36 uniprotsprotout

and you should be able to test your new database. For instance, to obtain all sequences that have the word influenza in the description index from your current sprot release, you could type:

seqret sprot-des:influenza

The same procedure could be used for nucleotide databases (type: N). Remember, you have the emboss.default.template as your guide. I hope you have a better understanding of how you can setup local databases in EMBOSS now.

Thoughts of a technocrat as letter sequences

Search This Blog

Friday, August 3, 2012

The bioinformatics sysadmin craftmanship: An EMBOSS 6.5 production server install: Part 2: EMBOSS database access setup

EMBOSS Database configuration

Remote data access methods and the emboss.default file

How to define a local flatfile database index