picardmetrics
- Run Picard tools and collate multiple metrics files
picardmetrics
run [-f FILE] [-o DIR] [-r] [-k] <file.bam>
picardmetrics
collate PREFIX DIR
picardmetrics
refFlat <file.gtf[.gz]>
picardmetrics
rRNA <file.gtf[.gz]>
# Example with provided data:
picardmetrics
run -r -o out data/project1/sample1/sample1.bam
picardmetrics
run -r -o out data/project1/sample2/sample2.bam
picardmetrics
collate out/project1 out
Picardmetrics is a Bash script that simplifies calling Picard tools and collates the different output files generated by Picard. It also has functions for generating the two input files required by CollectRnaSeqMetrics.
In order, picardmetrics run will do the following:
-k
.After running the tools, use picardmetrics collate to merge all of the generated metrics from multiple BAM files into tab-delimited files. Additionally, all of these tab-delimited files are consolidated into a single file with all metrics from all BAM files and all Picard tools.
These are the tools called by picardmetrics:
Read about the meaning of each metric: Picard metrics definitions.
picardmetrics
run [-f FILE] [-o DIR] [-r] <file.bam>
-f
FILE
The configuration file with specified variables. In order of preference,
picardmetrics
will use (1) the specified file, (2) a file named
picardmetrics.conf
in the current directory or (3) in the user's HOME
directory. If no file is found, an error will be thrown and the program
will exit.
-o
DIR
Write 21 output files in this directory. By default, write to the current directory. The output files include:
(with option -k) 1 BAM file sorted by coordinate and with duplicates marked
6 log files for the Picard tools
4 PDF files with plots generated by Picard tools
10 text files with metrics and histograms
-r
The BAM file is from an RNA-seq experiment. By default, this is not true.
When this option is used, the CollectRnaSeqMetrics Picard tool is run.
This option requires the variables REF_FLAT
and RIBOSOMAL_INTERVALS
to
be set. You should specify them in the configuration file.
-k
Keep the output BAM file. By default, delete it after metrics are created.
picardmetrics
collate PREFIX DIR
Collate output metrics files in DIR into one file with all metrics from all Picard tools and all BAM files:
PREFIX-all-metrics.tsv
Also write 5 collated histogram files:
PREFIX-base-distribution-by-cycle-histogram.tsv
PREFIX-gc-bias-histogram.tsv
PREFIX-insert-size-histogram.tsv
PREFIX-library-complexity-histogram.tsv
PREFIX-quality-histogram.tsv
picardmetrics
refFlat <file.gtf[.gz]>
Create <file.refFlat> for the REF_FLAT
argument of the
CollectRnaSeqMetrics tool. Run this command on your optionally gzipped GTF
file, and the output file will be written to the same directory as the GTF
file.
picardmetrics run
will automatically create the .refFlat file for you if you
define the GTF
variable in the configuration file.
picardmetrics
rRNA <file.gtf[.gz]>
Create <file.rRNA.list> for the RIBOSOMAL_INTERVALS
argument of the
CollectRnaSeqMetrics tool. Run this command on your optionally gzipped GTF
file, and the output file will be written to the same directory as the GTF
file.
picardmetrics run
will automatically create the .rRNA.list file for you if
you define the GTF
variable in the configuration file.
The picardmetrics.conf
file must define the following variables:
TEMP_DIR
A directory where temporary files will be written.
The sequence dictionary is taken from the BAM header and written to a
.list file in this directory. This file is used as the header of the
RIBOSOMAL_INTERVALS
file passed to CollectRnaSeqMetrics.
A copy of the input BAM file is written to the temporary directory. Then
ReorderSam, SortSam, and MarkDuplicates are run on this copy. By default,
it is deleted after picardmetrics
is done. Use option -k
to move the
sorted and deduplicated BAM file to the output folder -o
instead.
NICENESS
A number between 0 and 20 specifying the niceness to use for all jobs. Use a number greater than 0 to avoid interrupting interactive jobs such as vim or emacs.
PICARD_JAR
The full path to a downloaded picard.jar file. Get the file here: https://broadinstitute.github.io/picard/index.html
PICARD
Your preferred way to invoke Java to call Picard. For example:
PICARD="java -Xms5g -Xmx5g -jar $PICARD_JAR"
REFERENCE_SEQUENCE
The full path to the organism's genome sequence in FASTA format. Required for: CollectMultipleMetrics, CollectRnaSeqMetrics, CollectGcBiasMetrics.
GTF
Ful path to a .gtf or .gtf.gz file with gene annotations. picardmetrics
will use this to automatically create a .refFlat file and .rRNA.list file.
REF_FLAT
(overrides GTF
)
Full path to a text file with annotations of all gene features in UCSC format. Can be generated from a GFF or GTF file. Required for: CollectRnaSeqMetrics.
RIBOSOMAL_INTERVALS
(overrides GTF
)
Full path to a text file with genomic coordinates of all ribosomal RNA genes in Picard format. Required for CollectRnaSeqMetrics.
Here are three examples of how you can run the program:
Run picardmetrics sequentially in a for loop on multiple BAM files.
Run in parallel with GNU parallel, using multiple processors or multiple servers.
Run in parallel with an LSF queue, distributing jobs to multiple servers.
Run picardmetrics on the provided example BAM files:
for f in data/project1/sample?/sample?.bam; do
picardmetrics run -r -o out $f
done
Collate the generated metrics files:
picardmetrics collate out/project1 out
Next, use the file out/project1-all-metrics.tsv
to explore the metrics.
Run 2 jobs in parallel:
parallel -j2 \
picardmetrics run -o /path/to/out -r {} ::: data/project1/sample?/sample?.bam
If you have many files, or if you want to run jobs on multiple servers, it's a good idea to put the full paths in a text file.
Here, we have ssh access to server1
and server2
. We're launching 16 jobs
on server1
and 8 jobs on server2
. You'll have to make sure that
picardmetrics is in your PATH
on all servers.
ls /full/path/to/data/project1/sample*/sample*.bam > bams.txt
parallel -S 16/server1,8/server2 \
picardmetrics run -r -o /path/to/out {} :::: bams.txt
I recommend you install and use asub
(see below) to submit jobs easily. This
command will submit a job for each BAM file to the myqueue
LSF queue.
cat bams.txt | xargs -i echo picardmetrics run -r -o /path/to/out {} \
| asub -j picardmetrics_jobs -q myqueue
Find the source code here:
https://github.com/slowkow/picardmetrics
Please report issues here:
https://github.com/slowkow/picardmetrics/issues
Kamil Slowikowski from Harvard University wrote picardmetrics. Many developers at the Broad Institute wrote Picard. Heng Li from the Sanger Institute wrote samtools. Aaron Quinlan from the University of Utah wrote stats.
Picard
samtools
stats
GNU parallel
LSF
asub