picardmetrics(1) - Run Picard tools and collate multiple metrics files

picardmetrics(1)
picardmetrics manual
picardmetrics(1)

NAME

picardmetrics - Run Picard tools and collate multiple metrics files

SYNOPSIS

picardmetrics run [-f FILE] [-o DIR] [-r] [-k] <file.bam>

picardmetrics collate PREFIX DIR

picardmetrics refFlat <file.gtf[.gz]>

picardmetrics rRNA <file.gtf[.gz]>

# Example with provided data:

picardmetrics run -r -o out data/project1/sample1/sample1.bam

picardmetrics run -r -o out data/project1/sample2/sample2.bam

picardmetrics collate out/project1 out

DESCRIPTION

Picardmetrics is a Bash script that simplifies calling Picard tools and collates the different output files generated by Picard. It also has functions for generating the two input files required by CollectRnaSeqMetrics.

In order, picardmetrics run will do the following:

Automatically create a sequence dictionary using your reference sequence.
Create a new temporary BAM file that you can keep with option -k.
Reorder the header of the BAM file to match the reference.
Sort the reads in the BAM file by coordinate.
Mark duplicates in the BAM file and report duplicate metrics.
Run up to 8 additional Picard tools.

After running the tools, use picardmetrics collate to merge all of the generated metrics from multiple BAM files into tab-delimited files. Additionally, all of these tab-delimited files are consolidated into a single file with all metrics from all BAM files and all Picard tools.

These are the tools called by picardmetrics:

Read about the meaning of each metric: Picard metrics definitions.

COMMANDS AND OPTIONS

run

picardmetrics run [-f FILE] [-o DIR] [-r] <file.bam>

-f FILE

The configuration file with specified variables. In order of preference, picardmetrics will use (1) the specified file, (2) a file named picardmetrics.conf in the current directory or (3) in the user's HOME directory. If no file is found, an error will be thrown and the program will exit.
-o DIR

Write 21 output files in this directory. By default, write to the current directory. The output files include:

(with option -k) 1 BAM file sorted by coordinate and with duplicates marked
6 log files for the Picard tools
4 PDF files with plots generated by Picard tools
10 text files with metrics and histograms
-r

The BAM file is from an RNA-seq experiment. By default, this is not true. When this option is used, the CollectRnaSeqMetrics Picard tool is run. This option requires the variables REF_FLAT and RIBOSOMAL_INTERVALS to be set. You should specify them in the configuration file.
-k

Keep the output BAM file. By default, delete it after metrics are created.

collate

picardmetrics collate PREFIX DIR

Collate output metrics files in DIR into one file with all metrics from all Picard tools and all BAM files:

PREFIX-all-metrics.tsv

Also write 5 collated histogram files:

PREFIX-base-distribution-by-cycle-histogram.tsv
PREFIX-gc-bias-histogram.tsv
PREFIX-insert-size-histogram.tsv
PREFIX-library-complexity-histogram.tsv
PREFIX-quality-histogram.tsv

refFlat

picardmetrics refFlat <file.gtf[.gz]>

Create <file.refFlat> for the REF_FLAT argument of the CollectRnaSeqMetrics tool. Run this command on your optionally gzipped GTF file, and the output file will be written to the same directory as the GTF file.

picardmetrics run will automatically create the .refFlat file for you if you define the GTF variable in the configuration file.

rRNA

picardmetrics rRNA <file.gtf[.gz]>

Create <file.rRNA.list> for the RIBOSOMAL_INTERVALS argument of the CollectRnaSeqMetrics tool. Run this command on your optionally gzipped GTF file, and the output file will be written to the same directory as the GTF file.

picardmetrics run will automatically create the .rRNA.list file for you if you define the GTF variable in the configuration file.

CONFIGURATION FILE

The picardmetrics.conf file must define the following variables:

TEMP_DIR

A directory where temporary files will be written.

The sequence dictionary is taken from the BAM header and written to a .list file in this directory. This file is used as the header of the RIBOSOMAL_INTERVALS file passed to CollectRnaSeqMetrics.

A copy of the input BAM file is written to the temporary directory. Then ReorderSam, SortSam, and MarkDuplicates are run on this copy. By default, it is deleted after picardmetrics is done. Use option -k to move the sorted and deduplicated BAM file to the output folder -o instead.
NICENESS

A number between 0 and 20 specifying the niceness to use for all jobs. Use a number greater than 0 to avoid interrupting interactive jobs such as vim or emacs.
PICARD_JAR

The full path to a downloaded picard.jar file. Get the file here: https://broadinstitute.github.io/picard/index.html
PICARD

Your preferred way to invoke Java to call Picard. For example:

PICARD="java -Xms5g -Xmx5g -jar $PICARD_JAR"
REFERENCE_SEQUENCE

The full path to the organism's genome sequence in FASTA format. Required for: CollectMultipleMetrics, CollectRnaSeqMetrics, CollectGcBiasMetrics.
GTF

Ful path to a .gtf or .gtf.gz file with gene annotations. picardmetrics will use this to automatically create a .refFlat file and .rRNA.list file.
REF_FLAT (overrides GTF)

Full path to a text file with annotations of all gene features in UCSC format. Can be generated from a GFF or GTF file. Required for: CollectRnaSeqMetrics.
RIBOSOMAL_INTERVALS (overrides GTF)

Full path to a text file with genomic coordinates of all ribosomal RNA genes in Picard format. Required for CollectRnaSeqMetrics.

EXAMPLES

Here are three examples of how you can run the program:

Run picardmetrics sequentially in a for loop on multiple BAM files.
Run in parallel with GNU parallel, using multiple processors or multiple servers.
Run in parallel with an LSF queue, distributing jobs to multiple servers.

Example 1: Sequential

Run picardmetrics on the provided example BAM files:

for f in data/project1/sample?/sample?.bam; do
  picardmetrics run -r -o out $f
done

Collate the generated metrics files:

picardmetrics collate out/project1 out

Next, use the file out/project1-all-metrics.tsv to explore the metrics.

Example 2: GNU parallel

Run 2 jobs in parallel:

parallel -j2 \
  picardmetrics run -o /path/to/out -r {} ::: data/project1/sample?/sample?.bam

If you have many files, or if you want to run jobs on multiple servers, it's a good idea to put the full paths in a text file.

Here, we have ssh access to server1 and server2. We're launching 16 jobs on server1 and 8 jobs on server2. You'll have to make sure that picardmetrics is in your PATH on all servers.

ls /full/path/to/data/project1/sample*/sample*.bam > bams.txt
parallel -S 16/server1,8/server2 \
  picardmetrics run -r -o /path/to/out {} :::: bams.txt

Example 3: LSF

I recommend you install and use asub (see below) to submit jobs easily. This command will submit a job for each BAM file to the myqueue LSF queue.

cat bams.txt | xargs -i echo picardmetrics run -r -o /path/to/out {} \
  | asub -j picardmetrics_jobs -q myqueue

SOURCE CODE

Find the source code here:
https://github.com/slowkow/picardmetrics

BUGS

Please report issues here:
https://github.com/slowkow/picardmetrics/issues

AUTHOR

Kamil Slowikowski from Harvard University wrote picardmetrics. Many developers at the Broad Institute wrote Picard. Heng Li from the Sanger Institute wrote samtools. Aaron Quinlan from the University of Utah wrote stats.