Note: Generated using Jquery and a TOC plugin.

Setup Guide

Introduction

Crawl provides a standardised API, in the form of web services, for displaying genomic data. It can work directly off a Chado database, or instead off an ElasticSearch cluster, which uses Lucene as a back end, and can be indexed off a combination of data sources.

It is currently used by Web-Artemis as an AJAX back end.

Chado

Chado is a standard relational database, typically deployed on postgres database. It has a very flexible schema, allowing all sorts of genomic features to be modelled. It is the underlying data store for GeneDB, FlyBase, ParameciumDB, VectorBase, and GNPAnnot.

The Crawl API exposes the contents of these databases via SQL queries.

ElasticSearch

Data can be indexed and stored using ElasticSearch, a distributed search enging that runs ontop of Lucene. There are two kinds of connections to ElasticSearch: local and transport. Local connections are useful to get going with, as an ElasticSearch cluster is automatically created and managed in the same process as Crawl. The only configuration needed for this is usually the location of the indices. Transport connections connect to a separately instantiated ElasticSearch cluster, which can be on the same or remote machines, and the configuration needed for this is host and port.

The indices can be populated from various data sources: valid GFF3, Chado, and we are working on adding DAS. It might seem odd to want to index a Chado database when Crawl will run directly off of it, but this might be useful in situations where you want to merge annotations from Chado with other sources (like DAS or GFF3).

The Crawl API exposes the contents of these indices via ElasticSearch (Lucene) queries.

Next-Gen sequencing data

Crawl can retrieve segments from Next-Generation sequencing alignments stored in SAM/BAM files, for overlaying ontop of genomic features. It can also do the same for VCF/BCF files, for overlaying variation data.

Setup, build and test

This is all done through the use of Gradle. You don't need to install it, though, because there is an executable provided in the root folder of the checkout ('gradlew').

The Essential build

The initial build setup might take a while because it will download dependencies, but you should only ever have to run it all once. After that any updates to dependencies should be incremental.To setup the build, run :

./gradlew build

in the crawl checkout root folder. This can take 5-10 minutes depending on your internet connection.

Also, as the build step involves downloading dependencies, and if you're behind a proxy, you may have to initially supply proxyHost and proxyPort Java settings, e.g. :

./gradlew build -Dhttp.proxyHost=wwwcache.sanger.ac.uk -Dhttp.proxyPort=3128

This will perform all the build steps, including dependency download and running the test harness. You shouldn't worry about configurations for the test harness, because the default configurations should just work out of the box. The reason for that will be explained later, but the gist of it is that it connects to the GeneDB public snapshot for its database tests, and creates a local ElasticSearch cluster for its indexing tests.

If it's all worked fine, you can safely skip the next section and move onto indexing.

Optional build steps

Fetching dependencies from the GeneDB repository

Because of the unreliability of some external repositories, the dependencies are hosted on the genedb developer site as well. If you find the resolve step fails, you can try you can enable download from this site by adding the genedb developer https certificate. This certificate can be downloaded like this on a Mac (in bash) :

openssl s_client -connect developer.genedb.org:443 2>&1 | \
    sed -ne '/-BEGIN CERTIFICATE-/,/-END CERTIFICATE-/p' \
    > /tmp/developer.genedb.org

And imported as follows :

sudo keytool -import -alias developer.genedb.org \
    -keystore /Library/Java/Home/lib/security/cacerts \
    -file /tmp/developer.genedb.org 

The keytool password is 'changeit'.

Finally, you will have to uncomment the repositories in the build.gradle file that have a url that starts with :

https://developer.genedb.org/nexus/content/repositories/

Customizing eclipse

If you want to be able to browse and edit the code inside eclipse, you can run :

./gradlew eclipse

which will configure eclipse to be aware of crawl's dependencies.

If you have any classpath issues, try running

./gradlew cleanEclipse build eclipse -x test

Unit testing

crawl is unit tested everytime it's built. However, sometimes you want to only test a singe case :

./gradlew -Dtest.single=ClientTest clean test

and sometimes you don't want to test at all

./gradlew build -x test

Installation

If you need to be able to run (or, more likely, have your users run) the crawl.jar somewhere outside the project checkout, then you can run :

./gradlew install -Pdir=/usr/local/crawl

The folder specified here will be populated with a bin (containing scripts) and lib folder (containing jars).

Please note that property files will not be copied over. If you're only ever going to run things as yourself, this step might is entirely superfluous.

Indexing

These are done using shell scripts. If using a stand-alone ElasticSearch cluster, then it should have been started prior to running these steps. The examples here use a local cluster, which is instantiated by crawl itself. All examples here use bash.

Creating an organism manually

An organism is created by supplying its json using the -o option. Here's an example adding Plasmodium falciparum chromosome 1.

./crawl org2es -pe resource-elasticsearch-local.properties -o '{
    "ID":27,
    "common_name":"Pfalciparum",
    "genus":"Plasmodium",
    "species":"falciparum",
    "translation_table":11,
    "taxonID":5833
}'

Here's an example adding Trypanosoma brucei 927 :

./crawl org2es -pe resource-elasticsearch-local.properties -o '{
    "ID":19,
    "common_name":"Tbruceibrucei927",
    "genus":"Trypanosoma",
    "species":"brucei brucei, strain 927",
    "translation_table":11,
    "taxonID":185431
}'

The ID field is taken out of the GeneDB public repository, in this case, but you can choose whatever convention you like.

Note : any JSON can be passed either as a file or a string on the command line.

Indexing GFF files

GFF files do not contain information about the organism, so this must be supplied. Here is an example of a small Plasmodium falciparum chromosome 1 :

./crawl gff2es -g src/test/resources/data/Pf3D7_01.gff.gz  \
    -o '{"common_name":"Pfalciparum"}' \
    -pe resource-elasticsearch-local.properties 

Here is an example with a larger Trypanosoma brucei chromosome 11 :

./crawl gff2es -g src/test/resources/data/Tb927_11_01_v4.gff.gz  \
    -o '{"common_name":"Tbruceibrucei927"}' \
    -pe resource-elasticsearch-local.properties 

You only need to supply a unique identifier for the organism at this stage, which is either its ID or common name.

The -g flag (gffs) is used to specify a path to a GFF file or a folder containing them. If they are gzipped, the files will automatically be unzipped by the parser.

Here is an example of a Bacterium :

./crawl gff2es -pe resource-elasticsearch-local.properties -o '{
    "ID":999,
    "common_name":"Spneumoniae",
    "genus":"Streptococcus",
    "species":"pneumoniae",
    "translation_table":1,
    "taxonID":1313
}' -g src/test/resources/data/Spn23f.gff

Please note how the organism can be created at the same time.

Indexing both GFF3 & organisms together from reference blocks specified in a JSON

The above example of creating an organism with one command-line parameter and supplying its annotation file(s) with another is fine if your repository if you have only a few organisms. However, if you have loads, and/or are adding all the time, then it might be easier to generate a single JSON for all these GFF/Organism combinations.

We call this combination of organism + annotation a reference sequence, since it's likely that alignments and variants from sub-strains will be plotted against this.

A references block can be provided using the -r parameter. It is a collection of one or more organism+annotation entries, and can be included in the same JSON file along with the alignments and variant blocks (see below).

There are several methods to then tell crawl to index the organism. It can referred to using "alignments" property in the property file.

./crawl ref2es -pe resource-elasticsearch-vrtrack-gv1.properties

The file can be referred to on command line instead :

./crawl ref2es -pe resource-elasticsearch-vrtrack-gv1.properties \
    -r src/test/resources/alignments-vrtrack.json

So in this above example, the "alignments" section of the file is ignored for the purposes of identifying the references, and the file passed using the -r option is used instead.

Finally, the references block can be supplied as a string :

./crawl ref2es -pe resource-elasticsearch-vrtrack-gv1.properties -r '{
    "references": [{
        "file":"src/test/resources/data/Spn23f_bodged.gff",
        "organism": {
            "ID":"999",
            "common_name":"Streptococcus_pneumoniae_23F_FM211187",
            "genus":"Streptococcus",
            "species":"pneumoniae_23F_FM211187",
            "translation_table":"1",
            "taxonID":"1313"
        }
    }]
}'

So to summarise, if you don't specify a -r parameter, crawl will look inside the property file for an "alignments" property and try to find a "references" block in there. If OTOH you do supply a -r parameter, that can either be a path to a JSON file or a JSON string, that should contain an "references" block as well.

Ontologies and controlled vocabularies

Crawl can query controlled vocabularies and ontology files in OBO format. To index the gene ontology, you can do :


    ./crawl cv2es -cv src/test/resources/cv/gene_ontology_ext.obo -pe resource-elasticsearch-local.properties -vn go -ns biological_process -ns molecular_function -ns cellular_component

Indexing DAS sources

For this example, we are going to overlay features available from an external DAS source onto the reference sequence and annotations provided in a GFF files.

./crawl gff2es -pe resource-elasticsearch-local.properties -o '{
    "ID":29,
    "common_name":"Pberghei",
    "genus":"Plasmodium",
    "species":"berghei",
    "translation_table":11,
    "taxonID":5821
}' -g src/test/resources/data/berg01.gff.gz

./crawl das2es -pe resource-elasticsearch-local.properties \
    -u http://das.sanger.ac.uk/das \
    -s pbg \
    -r berg01 \
    -o '{"common_name":"Pberghei"}' 

In this example the features overlayed are of type 'clone_genomic_insert'

The first step should be familiar to you by now, it's indexing GFF files as before - just also doing the organism creation/update step also done in the same run. The next step uses a das2es script to query a remote DAS server and populate the indices.

Please note that currently the DAS support is quite minimal, only features, locations (on segments / regions) and their types are put in.

DAS capability requirements

DAS allows quite a bit of flexibility - a DAS source does not have to implement the full specification. Crawl, however, has a few minimum requirements for it to be able to find what it's looking for :

Indexing Chado

Crawl can serve up information from Chado directly, but there are situations when one might want to merge data that is in Chado with other kinds of data sources. Also, certain searches are likely to benefit from a Lucene rather than a database approach.

Indexing options

As indexing from Chado to ElasticSearch requires both kinds of connection details, you will need to specify two property files.

There are a few options applicable to all Chado indexing strategies :

There is a default types list, which currently is :

["gene", "pseudogene", "match_part", "repeat_region", "repeat_unit", "direct_repeat", 
"EST_match", "region", "polypeptide", "mRNA", "pseudogenic_transcript", "nucleotide_match", 
"exon", "pseudogenic_exon", "gap", "contig", "ncRNA", "tRNA", 
"five_prime_UTR", "three_prime_UTR", "polypeptide_motif"]

To exclude a list of types means that Crawl will ignore those types for indexing. To include a list of types means, the opposite, Crawl will ignore all other types. As the default is exclude=false, Crawl will by default ignore all other feature types other than the ones in the types list for indexing.

Indexing by organism

As with all examples of copying data from Chado, two properties files are specified: one for Chado connection details, and the other for ElasticSearch. This example connects to the GeneDB public Chado snapshot. To just copy one organism entry (without its features), you can do :

./crawl chado2es -pc resource-chado-public.properties \
    -pe resource-elasticsearch-local.properties \
    -o Tbruceibrucei927

To copy all of them, ommit the -o organism option :

./crawl chado2es -pc resource-chado-public.properties \
    -pe resource-elasticsearch-local.properties 

Use the -f option to index an organism as well as its features :

./crawl chado2es -pc resource-chado-public.properties \
    -pe resource-elasticsearch-local.properties \
    -t '["gene", "pseudogene", "mRNA", "exon", "polypeptide"]' \
    -o Tbruceibrucei927 -f
    

Please note that for scaleability reasons the -f option won't work if you try to index all the organisms at once. Also the -t option is used here to selectively index only certain types.

Indexing regions from Chado

Use the -r option to specify a region :

./crawl chado2es -pc resource-chado-public.properties \
    -pe resource-elasticsearch-local.properties \
    -r Pf3D7_01

As with organisms, you have to specify the -f option to index the region's features :

./crawl chado2es -pc resource-chado-public.properties \
    -pe resource-elasticsearch-local.properties \
    -r Pf3D7_01 -f 

Incremental indexing

This would typically be run periodically, e.g. in a cron job, after an initial bulk index has been performed. Incremental indexing relies on the use of the timelastmodified stamp. You can indexing everything that has changed since , as follows :

./crawl chado2es -pc resource-chado-public.properties \
    -pe resource-elasticsearch-local.properties -s 

You can filter on organism:

./crawl chado2es -pc resource-chado-public.properties \
    -pe resource-elasticsearch-local.properties -o Lmajor -s  

If the audit schema exists in this Chado database, then an attempt will be made to remove deleted features from the index.

Configuration for Next-Gen sequencing data

External alignment files are configured using a JSON file, which contains information about SAM/BAM alignments), VCF/BCF files, and optionally reference sequence alias names (explained below).

Pointing to the alignments is fairly straightforward - an example configuration file is provided in etc/alignments.json. All the files listed there are available on GeneDB, so this file should work as an example to get you going without having to configure your own.

Both the alignment file and index properties can be specified using http urls (as in the example files) or unix folder paths.

This file can get quite long for large BAM repositories, and it's not expected that in this case you manually craft these. The intention is that it would be autogenerated from a tracking database, LIMS system, filesystem folder hierarchy. For that reason, it's a JSON file, to make it easy to generate.

Aliasing sequences

Sometimes the FASTA sequence names in the alignments file may not correspond to the sequences in the annotation repository. The alignments.json file therefore has mappings of sequence names present in the BAMs with the chromosomes/contigs in the repository.

Alignments file format

Below is an example of an alignments file. It is a hash of three arrays: sequences, variants and alignments. The sequences array defines any sequence aliases (as described above). The alignments and variants arrays are used to define SAM/BAM and VCF/BCF next gen sequence data files.

{
    "sequences" : [
        {
            "alignment" : "MAL1",
            "reference" : "Pf3D7_01"
        }
    ],
    "variants" : [
        {
            "file" : "http://www.genedb.org/artemis/NAR/Spneumoniae/4882_6_10_variant.bcf",
            "organism" : "Spneumoniae"
        }
    ],
    "alignments" : [
        {
            "file" : "http://www.genedb.org/workshops/Lisbon2010/data/Malaria_RNASeq/MAL_0h.bam",
            "index" : "http://www.genedb.org/workshops/Lisbon2010/data/Malaria_RNASeq/MAL_0h.bam.bai",
            "organism" : "Pfalciparum"
        }
    ]
}

Deploying

Packaging the war

Here you must specify a config property file. Several of examples are bundled in the top level folder. For example :

./gradlew -Pconfig=resource-elasticsearch-local.properties war

Packaging the war and including Web-Artemis

This requires git on your system.

./gradlew -Pconfig=resource-elasticsearch-local.properties -PpullWebArtemis=true war

Extra Web-Artemis cloning parameters.

If you have your own checkout and branch of Web-Artemis, or if you want to change the default chromosome, these extra parameters can help :


gradle -Pconfig=resource-elasticsearch-local.properties -PpullWebArtemis=true \
    -PwebArtemisGitUri=/Users/gv1/git/Web-Artemis/ \
    -PwebArtemisGitBranch=extra_excludes \
    -PwebArtemisInitialChromosome=FM211187  \
    jettyRunWar 

Where webArtemisGitUri is the uri to the checkout, webArtemisGitBranch is the branch you want to clone, and webArtemisInitialChromosome lets you set the starting chromosome.

Testing the war using Jetty-runner

This can be used to run crawler without having to configure a servlet container. This is not good for production use. Only use it for testing.

./gradlew -Pconfig=resource-elasticsearch-local.properties -PpullWebArtemis=true jettyRunWar

Now point your browser to http://localhost:8080/services/index.html.

Deploying the war using to a J2EE container

For production, or even beta test sites, it should be deployed to a container as follows.

./gradlew -Pconfig=resource-elasticsearch-local.properties deploy

Cleaning up

Cleaning up builds

To be run whenever method signatures change and you want to spot external references to them.

./gradlew clean

Cleaning up a local ElasticSearch data

This task is only useful for non-trasnsport, i.e. local, clusters, and requires the correct config :

./gradlew -Pconfig=resource-elasticsearch-local.properties cleanes

Property files

The purpose of these is to specify environmental variables used by crawl. Sometimes only one is used, as is the case when building a war. Sometimes two are used, as is the case when indexing from one data source to another. The following table describes what they are for :

PropertyDescriptionConfigurations
resource.type Currently be either chado-postgres, elasticsearch-local, elasticsearch-remote all
deploy.dir The tomcat webapps folder to which any war that is built using this file is deployed to. all
deploy.name what the war will be called, which will reflected in the URL domain path. all
alignments the path to an alignments.json file. This is where BAM alignments are specified. An example alignments.json file is included in the etc folder. all
showParameters whether or not to always return request parameters in the reponse. all
dbhost the chado database host chado-postgres
dbport the chado database port chado-postgres
dbname the chado database name chado-postgres
dbuser the chado database user chado-postgres
dbpassword the chado database password chado-postgres
resource.elasticsearch.index The name of the index all ElasticSearch configurations
resource.elasticsearch.regionType The name of the type representing regions all ElasticSearch configurations
resource.elasticsearch.featureType The name of the type representing features all ElasticSearch configurations
resource.elasticsearch.organismType The name of the type representing organisms all ElasticSearch configurations
resource.elasticsearch.address.host the elastic search host elasticsearch-remote
resource.elasticsearch.address.port the elastic search port elasticsearch-remote
resource.elasticsearch.cluster.name the name of the cluster you wish to connect to elasticsearch-remote
resource.elasticsearch.local.pathlogs the elastic search log path elasticsearch-local
resource.elasticsearch.local.pathdata the elastic search data path elasticsearch-local

Not all of these parameters are used in all property files, the third column specifies in what context they are applicable.