Using a Reference Database
There are several ways to classify reference genomes using a reference database. The classify_wf
command automates several steps including downloading a reference database (if one is not found in the default directory), creating a new project, and controlling the daemon. Depending on the number of genomes and whether or not a reference database is already available, this method can take a significant amount of time and the terminal is not usable while the process is running. Thus it is best to download the database ahead of time in a tmux session if using MiGA on AWS or by submission if using a cluster. Also of note, several steps typically included in processing genomes are skipped when using classify_wf
.
Alternatively, a reference database can be attached to a project when (or after) creating it and the taxonomy step will be included when the project is run (or re-run). In either of these cases, it is necessary to first obtain the reference database.
Download a Reference Database
To save time, this tutorial uses the smaller database Phyla_Lite. If you have not already installed it, see the instructions in the section Add Reference Databases from the Command Line.
Add classification to an existing project
To classify the reference genomes in a previously created project, use the edit command with the m flag to link the project to a reference database and then start the daemon. For example, to classify the reference genomes in the pseudo project, enter the commands below (with MiGA running, of course!).
This exercise takes approximately 90 minutes to run interactively. The same exercise is included under Submitting MiGA Jobs if you would rather run it that way.
Monitor progress in the usual way:
If you started the daemon with the --shutdown_when_done
flag, you can also check that the project is finished by listing the files in the project daemon directory:
If a file ending with pid
is present, the daemon is still running.
After the project finishes, stop the daemon and view the results:
This should return something like:
P. aeruginosa was not classified because it was submitted as a query genome. Only reference genomes are classified by this method, and only as far as the p-value for the rank is significant. Results for individual reference genomes can be examined for more detailed information:
Include classification in a new project
Create a new project and add reference genomes to it. Then, BEFORE starting the daemon, edit the project to include a link to the reference database. The remaining steps are as above:
Monitor progress in the usual way:
If you used the ---shutdown_when_done
flag, you can also check if the project is done by listing the contents of the daemon directory:
If a file ending in pid
is present, the daemon is still running.
After the job finishes, view the results with:
And the results are the same.
Last updated