EMBOSS Database configuration
Part 1 of this article series covered a basic installation of EMBOSS from sources. The configuration of EMBOSS databases merits a separate article Part as it requires some knowledge of the indexing process and the various mechanisms to download and index flat file databases. Correspondence from the EMBOSS mailing list shows that this is a topic that confuses users and admins frequently. Thus, we are going to take a detailed look at it.Remote data access methods and the emboss.default file
If you would like a recap of what is a flatfile database and what EMBOSS can do for you in terms of accessing indexed flatfile databases, you might like to take a look at some of the lectures I have given on the subject (slides, video). EMBOSS is not the fastest and most efficient way to index your flatfile databases. You should look at something like MRS and similar systems to have a more efficient way to index and perform comprehensive queries on flatfile databases. In fact, EMBOSS can access MRS indexed databases and in my opinion, this is better than a pure EMBOSS index system in many perspectives (speed of indexing/quering the index, storage efficiency etc). Nevertheless, EMBOSS does its job and this section describes only the process of indexing flatfile databases by using exclusively EMBOSS utilities.One thing you need to understand is that in order to have access to indexed flatfile databases, you do not always have to index them locally. The EMBOSS applications support a variety of remote data retrieval methods to many useful datasets. Amongst the most popular of them we have:
- MRS methods (mrs, mrs3 and mrs4): These allow you to search an MRS based index on a local or remote server.
- DBFETCH method (dbfetch): Supported by servers at EBI.
- WSDBFETCH method (wsdbfetch): A SOAP based EBI service similar to the DBFETCH method.
- BIOMART method (biomart): Using the Biomart service.
To understand how to engage/activate these different data access methods, you will need to become familiar with the 'emboss.default' file. Part 1 of this article mentioned that the EMBOSS installation directory was under: /usr/lsc/emboss . You will need to navigate to the following directory:
/usr/lsc/emboss/share/EMBOSS
When you install EMBOSS for the first time in your system, you will see amongst others two files:
you will immediately get the following list of database entries by default:
Display information on configured databases
# Name Type ID Qry All Comment
# ============= ======== == === === =======
taxon Taxonomy OK OK OK -
drcat Resource OK OK OK -
chebi Obo OK OK OK -
eco Obo OK OK OK -
edam Obo OK OK OK -
edam_data Obo OK OK OK -
edam_format Obo OK OK OK -
edam_identifier Obo OK OK OK -
edam_operation Obo OK OK OK -
edam_topic Obo OK OK OK -
go Obo OK OK OK -
go_component Obo OK OK OK -
go_function Obo OK OK OK -
go_process Obo OK OK OK -
pw Obo OK OK OK -
ro Obo OK OK OK -
so Obo OK OK OK -
swo Obo OK OK OK -
If you wish to define any additional databases beyond this default list, you should create an emboss.default file, using the file 'emboss.default.template' as your reference (we are going to explain how shortly).
/usr/lsc/emboss/share/EMBOSS
When you install EMBOSS for the first time in your system, you will see amongst others two files:
- The 'emboss.default.template' file: This is a sample configuration file which shows the EMBOSS admin how to define databases. We will explain more in the process, but you can use this file as a reference to see many examples of how to configure properly various types of EMBOSS databases.
- The emboss.standard file: This file also contains valid EMBOSS database configuration entries. However, the database definitions are included by default in your current setup.
showdb
you will immediately get the following list of database entries by default:
Display information on configured databases
# Name Type ID Qry All Comment
# ============= ======== == === === =======
taxon Taxonomy OK OK OK -
drcat Resource OK OK OK -
chebi Obo OK OK OK -
eco Obo OK OK OK -
edam Obo OK OK OK -
edam_data Obo OK OK OK -
edam_format Obo OK OK OK -
edam_identifier Obo OK OK OK -
edam_operation Obo OK OK OK -
edam_topic Obo OK OK OK -
go Obo OK OK OK -
go_component Obo OK OK OK -
go_function Obo OK OK OK -
go_process Obo OK OK OK -
pw Obo OK OK OK -
ro Obo OK OK OK -
so Obo OK OK OK -
swo Obo OK OK OK -
If you wish to define any additional databases beyond this default list, you should create an emboss.default file, using the file 'emboss.default.template' as your reference (we are going to explain how shortly).
For now let's focus on these default databases defined by the emboss.standard file. They are a good example of how the new EMBOSS 6.5 enables remote data access from a variety of global public servers out of the box (I assume your Internet connection is working, right?). Let's use the EDAM ontology to retrieve data about an identifier. To do that I choose the ontotext EMBOSS application and I type:
ontotext edam_data:0849
The resulting file (0849.ontotext) contains the info which is retrieved from available servers. Let's take a look at the emboss.standard file to see how the edam_data database is defined:
DB edam_data [
type: "obo"
format: "obo"
method: "emboss"
dbalias: "edam"
namespace: "data|identifier"
indexdirectory: "$emboss_standard/index"
directory: "$emboss_standard/data"
field: "id ! identifier without the prefix"
field: "acc ! full name and any alternate identifier(s)"
field: "nam ! words in the name"
field: "isa ! parent identifier from is_a relation(s)"
field: "des ! words in the description"
field: "ns ! namespace"
field: "hasattr ! identifier(s) from has_attribute relation(s)"
field: "hasin ! identifier(s) from has_input relation(s)"
field: "hasout ! identifier(s) from has_output relation(s)"
field: "isid ! identifier(s) from is_identifier_of relation(s)"
field: "isfmt ! identifier(s) from is_format_of relation(s)"
field: "issrc ! identifier(s) from is_source_of relation(s)"
]
...
type: "obo"
format: "obo"
method: "emboss"
dbalias: "edam"
namespace: "data|identifier"
indexdirectory: "$emboss_standard/index"
directory: "$emboss_standard/data"
field: "id ! identifier without the prefix"
field: "acc ! full name and any alternate identifier(s)"
field: "nam ! words in the name"
field: "isa ! parent identifier from is_a relation(s)"
field: "des ! words in the description"
field: "ns ! namespace"
field: "hasattr ! identifier(s) from has_attribute relation(s)"
field: "hasin ! identifier(s) from has_input relation(s)"
field: "hasout ! identifier(s) from has_output relation(s)"
field: "isid ! identifier(s) from is_identifier_of relation(s)"
field: "isfmt ! identifier(s) from is_format_of relation(s)"
field: "issrc ! identifier(s) from is_source_of relation(s)"
]
...
RES edamresource [
type: "Index"
fields: "id acc nam isa des ns hasattr hasin hasout
isid isfmt issrc"
acclen: "80"
namlen: "32"
deslen: "30"
accpagesize: "8192"
despagesize: "4096"
]
type: "Index"
fields: "id acc nam isa des ns hasattr hasin hasout
isid isfmt issrc"
acclen: "80"
namlen: "32"
deslen: "30"
accpagesize: "8192"
despagesize: "4096"
]
In general, an EMBOSS database definition has two main parts:
- The DB definition part: It defines the name, type, format, access method and various fields of the database record.
- The RES (resource definition) part: Where the length of the various record fields is defined in the index. (note that RES definitions are normally found towards the end of the file).
The DB and RES fields go together for each database definition. In addition, for remote data access methods, a SERVER definition might be necessary to necessitate access to remote information repositories.
Step 9:The 'emboss.default' file does not yet exist,so create it under the directory where the emboss.default.template. From now on, you will be editing the emboss.default file to define all aspects of the EMBOSS database configuration. Start with a minimal file like the one below:
#############################################
# EMBOSS environment variables
#############################################
SET emboss_tempdata /usr/lsc/emboss/share/EMBOSS/test
DB martensembl [
method: "biomart"
type: "P"
url: "http://www.biomart.org:80/biomart/martservice"
dbalias: "hsapiens_gene_ensembl"
format: "biomart"
filter: "chromosome_name=13"
sequence: "peptide"
return: "ensembl_gene_id,description,external_gene_id,chromosome_name"
]
Show here, we have defined the database 'martensembl' which could retrieve remotely entries from the Homo Sapiens Ensembl gene repository. Save the file and go back to your shell. You can repeat the 'showdb' command and verify that you can see the newly defined martensembl database. Now, test it by typing:
The resulting fasta file should contain the info you require and this was all the way from the remote Biomart server. Congratulations, you just setup your first remote database access in EMBOSS!
Browsing remote access repositories is a good idea and the EMBOSS team was right to enable the functionality in EMBOSS. However, accessing remote datasets does not always work very well if:
#############################################
# EMBOSS environment variables
#############################################
SET emboss_tempdata /usr/lsc/emboss/share/EMBOSS/test
DB martensembl [
method: "biomart"
type: "P"
url: "http://www.biomart.org:80/biomart/martservice"
dbalias: "hsapiens_gene_ensembl"
format: "biomart"
filter: "chromosome_name=13"
sequence: "peptide"
return: "ensembl_gene_id,description,external_gene_id,chromosome_name"
]
Show here, we have defined the database 'martensembl' which could retrieve remotely entries from the Homo Sapiens Ensembl gene repository. Save the file and go back to your shell. You can repeat the 'showdb' command and verify that you can see the newly defined martensembl database. Now, test it by typing:
seqret martensembl:ENST00000380152
The resulting fasta file should contain the info you require and this was all the way from the remote Biomart server. Congratulations, you just setup your first remote database access in EMBOSS!
Browsing remote access repositories is a good idea and the EMBOSS team was right to enable the functionality in EMBOSS. However, accessing remote datasets does not always work very well if:
- You go into a place where Internet availability is sketchy or of limited bandwidth capacity.
- The datasets you need to access involve millions of sequences or Gigabytes of information.
How to define a local flatfile database index
What was said in the previous section about the main parts of an EMBOSS database definition in the emboss.standard file can also be applied to the emboss.default file. Let's provide an example and give you an example of how you can format the latest Uniprot/sprot database, in three steps:- Step A: Download and uncompress the latest file into your flatfile index area, a directory where you should have plenty of space to hold your flatfiles and the produce indices of your datasets. The file lies here (EBI FTP server). On the command line, you could do a:
wget ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.dat.gz
followed by a;
gunzip uniprot_sprot.dat.gz
- Step B: Update the 'emboss.default' file by adding a database definition, as well as a resource definition, as shown below:
SET emboss_db_dir /storage/tools/embossdbs
DB sprot [
type: P
method: emboss
release: "57.1"
format: swiss
fields: "id acc sv des key org"
directory: $emboss_db_dir/uniprotsprotfiles
file: *.dat
indexdirectory: $emboss_db_dir/uniprotsprotfiles
comment: "UniProtKB/Swiss-Prot Latest Release "
]
RES sprot [
type: Index
idlen: 15
acclen: 15
svlen: 20
keylen: 85
deslen: 75
orglen: 75
]
type: P
method: emboss
release: "57.1"
format: swiss
fields: "id acc sv des key org"
directory: $emboss_db_dir/uniprotsprotfiles
file: *.dat
indexdirectory: $emboss_db_dir/uniprotsprotfiles
comment: "UniProtKB/Swiss-Prot Latest Release "
]
RES sprot [
type: Index
idlen: 15
acclen: 15
svlen: 20
keylen: 85
deslen: 75
orglen: 75
]
The first two lines are optional and provide an alias for the directory locations where you have uncompressed the flatfile and you are going to produce the index. After that you have the database (DB sprot) definition. It is a protein sequence database (type: P). The fields specification is important. It lists all the indices that are going to be produced. So, we know that we will be able to search the database by sprot IDs (id), accession number (acc), sequence version (sv), descriptive text from the sequence header (des), keyword (key) and taxonomy info (org).
Each of these index fields has a defined length as part of the associated RES (resource definition) entry. Note that it is important to define both the DB and the RES blocks. If you do not and for example you forget to define the RES record, the EMBOSS applications will complain until you resolve the issue with an error message similar to this one:
EMBOSS An error in ajnam.c at line 9126:
unknown resource 'sprot'
unknown resource 'sprot'
For now, save the file and do a showdb to verify that you can see the 'sprot' database. If you have omitted or misconfigured any important parts of the definition, the command should complain with informative errors.
- Step C: Produce the index. Go to the directory where you have your uncompressed flatfile (.dat) (in my case this is under /storage/tools/embossdbs/uniprotsprotfiles) and type the following emboss command: dbxflat -outfile uniprotsprotout -directory /storage/tools/embossdbs/uniprotsprotfiles -idformat SWISS -filenames '*.dat' -fields id,acc,sv,des,key,org -compressed N -dbname sprot -dbresource sprot -release 2012_07 -date 03/08/12
If all goes well, you should see the following index files in your directory where your flatfile lies:
-rw-r--r--. 1 root root 103 2012-08-03 19:28 sprot.ent
-rw-r--r--. 1 root root 299 2012-08-03 19:36 sprot.pxac
-rw-r--r--. 1 root root 301 2012-08-03 19:36 sprot.pxde
-rw-r--r--. 1 root root 295 2012-08-03 19:36 sprot.pxid
-rw-r--r--. 1 root root 297 2012-08-03 19:36 sprot.pxkw
-rw-r--r--. 1 root root 295 2012-08-03 19:36 sprot.pxsv
-rw-r--r--. 1 root root 299 2012-08-03 19:36 sprot.pxtx
-rw-r--r--. 1 root root 63M 2012-08-03 19:36 sprot.xac
-rw-r--r--. 1 root root 259M 2012-08-03 19:36 sprot.xde
-rw-r--r--. 1 root root 40M 2012-08-03 19:36 sprot.xid
-rw-r--r--. 1 root root 161M 2012-08-03 19:36 sprot.xkw
-rw-r--r--. 1 root root 38M 2012-08-03 19:36 sprot.xsv
-rw-r--r--. 1 root root 264M 2012-08-03 19:36 sprot.xtx
-rw-r--r--. 1 root root 2,5G 2012-08-03 19:26 uniprot_sprot.dat
-rw-r--r--. 1 root root 758 2012-08-03 19:36 uniprotsprotout
and you should be able to test your new database. For instance, to obtain all sequences that have the word influenza in the description index from your current sprot release, you could type:
seqret sprot-des:influenza
The same procedure could be used for nucleotide databases (type: N). Remember, you have the emboss.default.template as your guide. I hope you have a better understanding of how you can setup local databases in EMBOSS now.