CLI Reference¶
This page provides documentation for CapCruncher command line tools.
capcruncher¶
An end to end solution for processing: Capture-C, Tri-C and Tiled-C data.
Usage:
capcruncher [OPTIONS] COMMAND [ARGS]...
Options:
--version Show the version and exit.
--help Show this message and exit.
alignments¶
Alignment annotation, identification and deduplication.
Usage:
capcruncher alignments [OPTIONS] COMMAND [ARGS]...
Options:
--help Show this message and exit.
annotate¶
Annotates a bed file with other bed files using bedtools intersect.
Whilst bedtools intersect allows for interval names and counts to be used for annotating intervals, this command provides the ability to annotate intervals with both interval names and counts at the same time. As the pipeline allows for empty bed files, this command has built in support to deal with blank/malformed bed files and will return default N/A values.
Prior to interval annotation, the bed file to be intersected is validated and duplicate entries/multimapping reads are removed to ensure consistent annotations and prevent issues with reporter identification.
Usage:
capcruncher alignments annotate [OPTIONS] SLICES
Options:
-a, --actions [get|count] Determines if the overlaps are counted or if
the name should just be reported
-b, --bed_files TEXT Bed file(s) to intersect with slices
-n, --names TEXT Names to use as column names for the output
tsv file.
-f, --overlap_fractions FLOAT The minimum overlap required for an
intersection between two intervals to be
reported.
-t, --dtypes TEXT Data type for column
-o, --output TEXT Path for the annotated slices to be output.
--duplicates [remove] Method to use for reconciling duplicate
slices (i.e. multimapping). Currently only
'remove' is supported.
-p, --n_cores INTEGER Intersections are performed in parallel, set
this to the number of intersections required
--invalid_bed_action [ignore|error]
Method to deal with invalid bed files e.g.
blank or incorrectly formatted. Setting this
to 'ignore' will report default N/A values
(either '.' or 0) for invalid files
--blacklist TEXT Regions to remove from the BAM file prior to
annotation
--prioritize-cis-slices Attempts to prevent slices on the most
common chromosome in a fragment (ideally cis
to the viewpoint) being removed by
deduplication
--priority-chroms TEXT A comma separated list of chromosomes to
prioritize during deduplication
--help Show this message and exit.
filter¶
Removes unwanted aligned slices and identifies reporters.
Parses a BAM file and merges this with a supplied annotation to identify unwanted slices. Filtering can be tuned for Capture-C, Tri-C and Tiled-C data to ensure optimal filtering.
Usage:
capcruncher alignments filter [OPTIONS] {capture|tri|tiled}
Options:
-b, --bam TEXT Bam file to process [required]
-a, --annotations TEXT Annotations for the bam file that must contain
the required columns, see description.
[required]
--custom-filtering TEXT Custom filtering to be used. This must be
supplied as a path to a yaml file.
-o, --output_prefix TEXT Output prefix for deduplicated fastq file(s)
--statistics TEXT Output path for stats file
--sample-name TEXT Name of sample e.g. DOX_treated_1
--read-type [flashed|pe] Type of read
--fragments / --no-fragments Determines if read fragment aggregations are
produced
--help Show this message and exit.
fastq¶
Fastq splitting, deduplication and digestion.
Usage:
capcruncher fastq [OPTIONS] COMMAND [ARGS]...
Options:
--help Show this message and exit.
deduplicate¶
Identifies PCR duplicate fragments from Fastq files.
PCR duplicates are very commonly present in Capture-C/Tri-C/Tiled-C data and must be removed for accurate analysis. These commands attempt to identify and remove duplicate reads/fragments from fastq file(s) to speed up downstream analysis.
Usage:
capcruncher fastq deduplicate [OPTIONS]
Options:
-1, --fastq1 TEXT Read 1 FASTQ files [required]
-2, --fastq2 TEXT Read 2 FASTQ files [required]
-o, --output-prefix TEXT Output prefix for deduplicated FASTQ files
--sample-name TEXT Name of sample e.g. DOX_treated_1
-s, --statistics TEXT Statistics output file name
--shuffle Shuffle reads before deduplication
--help Show this message and exit.
digest¶
Performs in silico digestion of one or a pair of fastq files.
Usage:
capcruncher fastq digest [OPTIONS] FASTQS...
Options:
-r, --restriction_enzyme TEXT Restriction enzyme name or sequence to use
for in silico digestion. [required]
-m, --mode [flashed|pe] Digestion mode. Combined (Flashed) or non-
combined (PE) read pairs. [required]
-o, --output_file TEXT
--minimum_slice_length INTEGER
--statistics TEXT Output path for stats file
--sample-name TEXT Name of sample e.g. DOX_treated_1. Required
for correct statistics.
--help Show this message and exit.
split¶
Splits fastq file(s) into equal chunks of n reads.
Usage:
capcruncher fastq split [OPTIONS] INPUT_FILES...
Options:
-m, --method [python|unix] Method to use for splitting
-o, --output_prefix TEXT Output prefix for deduplicated fastq file(s)
--compression_level INTEGER Level of compression for output files
-n, --n_reads INTEGER Number of reads per fastq file
--gzip / --no-gzip Determines if files are gziped or not
-p, --n_cores INTEGER
-s, --suffix TEXT Suffix to add to output files (ignore
{read_number}.fastq as this is added
automatically)
--help Show this message and exit.
genome¶
Genome wide methods digestion.
Usage:
capcruncher genome [OPTIONS] COMMAND [ARGS]...
Options:
--help Show this message and exit.
digest¶
Performs in silico digestion of a genome in fasta format.
Digests the supplied genome fasta file and generates a bed file containing the locations of all restriction fragments produced by the supplied restriction enzyme.
A log file recording the number of restriction fragments for the suplied genome is also generated.
Usage:
capcruncher genome digest [OPTIONS] INPUT_FASTA
Options:
-r, --recognition_site TEXT Recognition enzyme or sequence [required]
-l, --logfile TEXT Path for digestion log file
-o, --output_file TEXT Output file path
--remove_cutsite BOOLEAN Exclude the recognition sequence from the
output
--sort Sorts the output bed file by chromosome and
start coord.
--help Show this message and exit.
interactions¶
Reporter counting, storing, comparison and pileups
Usage:
capcruncher interactions [OPTIONS] COMMAND [ARGS]...
Options:
--help Show this message and exit.
compare¶
Compare bedgraphs and CapCruncher cooler files.
These commands allow for specific viewpoints to be extracted from CapCruncher HDF5 files and perform:
1. User defined groupby aggregations.
2. Comparisons between conditions.
3. Identification of differential interactions between conditions.
See subcommands for details.
Usage:
capcruncher interactions compare [OPTIONS] COMMAND [ARGS]...
Options:
--help Show this message and exit.
concat¶
Usage:
capcruncher interactions compare concat [OPTIONS] INFILES...
Options:
-f, --format [auto|bedgraph|cooler]
Input file format
-o, --output TEXT Output file name
-v, --viewpoint TEXT Viewpoint to extract
-r, --resolution TEXT Resolution to extract
--region TEXT Limit to specific coordinates in the format
chrom:start-end
--normalisation [raw|n_cis|region]
Method to use interaction normalisation
--normalisation-regions TEXT Regions to use for interaction
normalisation. The --normalisation method
MUST be 'region'
--scale_factor INTEGER Scale factor to use for bedgraph
normalisation
-p, --n_cores INTEGER Number of cores to use for extracting
bedgraphs
--help Show this message and exit.
differential¶
Perform differential testing on CapCruncher HDF5 files.
This command performs differential testing on CapCruncher HDF5 files. It requires a design matrix and a contrast to test. The design matrix should be a tab separated file with the first column containing the sample names and the remaining columns containing the conditions. The contrast should specify the name of the column in the design matrix to test. The output is a tab separated bedgraph.
Usage:
capcruncher interactions compare differential [OPTIONS] INTERACTION_FILES...
Options:
-o, --output-prefix TEXT Output file prefix
-v, --viewpoint TEXT Viewpoint to extract [required]
-d, --design-matrix TEXT Design matrix file [required]
-c, --contrast TEXT Contrast to test
-r, --regions-of-interest TEXT Regions of interest to test for differential
interactions
--viewpoint-distance INTEGER Distance from viewpoint to test for
differential interactions
--threshold-count INTEGER Minimum number of interactions to test for
differential interactions
--threshold-q FLOAT Minimum q-value to test for differential
interactions
--help Show this message and exit.
summarise¶
Usage:
capcruncher interactions compare summarise [OPTIONS] INFILE
Options:
-d, --design-matrix TEXT Design matrix file, should be formatted as a
tab separated file with the first column
containing the sample names and the other
column containing the conditions.
-o, --output-prefix TEXT Output file prefix
-f, --output-format [bedgraph|tsv]
-m, --summary-methods TEXT Summary methods to use for aggregation. Can
be any method in numpy or scipy.stats
-n, --group-names TEXT Group names for aggregation
-c, --group-columns TEXT Column names/numbers (0 indexed, the first
column after the end coordinate counts as 0)
for aggregation.
--subtraction Perform subtration between aggregated groups
--suffix TEXT Add a suffix before the file extension
--help Show this message and exit.
count¶
Determines the number of captured restriction fragment interactions genome wide.
Counts the number of interactions between each restriction fragment and all other restriction fragments in the fragment.
The output is a cooler formatted HDF5 file containing a single group containing the interactions between restriction fragments.
See https://cooler.readthedocs.io/en/latest/
for further details.
Usage:
capcruncher interactions count [OPTIONS] REPORTERS
Options:
-o, --output TEXT Name of output file
--remove_exclusions Prevents analysis of fragments marked as
proximity exclusions
--remove_capture Prevents analysis of capture fragment
interactions
--subsample FLOAT Subsamples reporters before analysis of
interactions
-f, --fragment-map TEXT Path to digested genome bed file
-v, --viewpoint-path TEXT Path to viewpoints file
-p, --n-cores INTEGER Number of cores to use for counting.
--assay [capture|tri|tiled]
--help Show this message and exit.
counts-to-cooler¶
Stores restriction fragment interaction combinations at the restriction fragment level.
Parses reporter restriction fragment interaction counts produced by
"capcruncher reporters count" and gerates a cooler formatted group in an HDF5 File.
See https://cooler.readthedocs.io/en/latest/
for further details.
Usage:
capcruncher interactions counts-to-cooler [OPTIONS] COUNTS
Options:
-f, --fragment-map TEXT Path to digested genome bed file [required]
-v, --viewpoint-path TEXT Path to viewpoints file [required]
-n, --viewpoint-name TEXT Name of viewpoint to store
-g, --genome TEXT Name of genome
--suffix TEXT Suffix to append after the capture name for the
output file
-o, --output TEXT Name of output file. (Cooler formatted hdf5 file)
--help Show this message and exit.
deduplicate¶
Identifies and removes duplicated aligned fragments.
PCR duplicates are very commonly present in Capture-C/Tri-C/Tiled-C data and must be removed for accurate analysis. Unlike fastq deduplicate, this command removes fragments with identical genomic coordinates.
Non-combined (pe) and combined (flashed) reads are treated slightly differently due to the increased confidence that the ligation junction has been captured for the flashed reads.
Usage:
capcruncher interactions deduplicate [OPTIONS] SLICES
Options:
-o, --output TEXT Output prefix for directory of deduplicated slices
--statistics TEXT Output prefix for stats file(s)
--sample-name TEXT Name of sample e.g. DOX_treated_1
--read-type [flashed|pe] Type of read
--help Show this message and exit.
fragments-to-bins¶
Convert a cooler group containing restriction fragments to constant genomic windows
Parses a cooler group and aggregates restriction fragment interaction counts into genomic bins of a specified size. If the normalise option is selected, columns containing normalised counts are added to the pixels table of the output
Usage:
capcruncher interactions fragments-to-bins [OPTIONS] COOLER_PATH
Options:
-b, --binsizes INTEGER Binsizes to use for windowing
--normalise Enables normalisation of interaction counts
during windowing
--overlap_fraction FLOAT Minimum overlap between genomic bins and
restriction fragments for overlap
-p, --n_cores INTEGER Number of cores used for binning
--scale-factor INTEGER Scaling factor used for normalisation
--conversion_tables TEXT Pickle file containing pre-computed fragment ->
bin conversions.
-o, --output TEXT Name of output file. (Cooler formatted hdf5
file)
--assay [capture|tri|tiled]
--help Show this message and exit.
merge¶
Merges capcruncher HDF5 files together.
Produces a unified cooler with both restriction fragment and genomic bins whilst reducing the storage space required by hard linking the "bins" tables to prevent duplication.
Usage:
capcruncher interactions merge [OPTIONS] COOLERS...
Options:
-o, --output TEXT Output file name
--help Show this message and exit.
pileup¶
Extracts reporters from a capture experiment and generates a bedgraph file.
Identifies reporters for a single probe (if a probe name is supplied) or all capture probes present in a capture experiment HDF5 file.
The bedgraph generated can be normalised by the number of cis interactions for inter experiment comparisons and/or extract pilups binned into even genomic windows.
Usage:
capcruncher interactions pileup [OPTIONS] URI
Options:
-n, --viewpoint_names TEXT Viewpoint to extract and convert to
bedgraph, if not provided will transform
all.
-o, --output_prefix TEXT Output prefix for bedgraphs
--normalisation [raw|n_cis|region]
Method to use interaction normalisation
--normalisation-regions TEXT Regions to use for interaction
normalisation. The --normalisation method
MUST be 'region'
--binsize INTEGER Binsize to use for converting bedgraph to
evenly sized genomic bins
--gzip Compress output using gzip
--scale-factor INTEGER Scale factor to use for bedgraph
normalisation
--sparse / --dense Produce bedgraph containing just positive
bins (sparse) or all bins (dense)
-f, --format [bedgraph|bigwig] Output file format
--help Show this message and exit.
pipeline¶
Runs the data processing pipeline
Usage:
capcruncher pipeline [OPTIONS] [PIPELINE_OPTIONS]...
Options:
-h, --help
--version
--logo / --no-logo Show the capcruncher logo [default: logo]
--version Show the version and exit.
pipeline-config¶
Configures the data processing pipeline
Usage:
capcruncher pipeline-config [OPTIONS]
Options:
-h, --help
--version
--version Show the version and exit.
-i, --input PATH
--generate-design
plot¶
Generates plots for the outputs produced by CapCruncher
Usage:
capcruncher plot [OPTIONS]
Options:
-r, --region TEXT Genomic coordinates of the region to plot [required]
-t, --template TEXT TOML file containing the template for the plot
[required]
-o, --output TEXT Output file path. The file extension determines the
output format.
--help Show this message and exit.
utilities¶
Contains miscellaneous functions
Usage:
capcruncher utilities [OPTIONS] COMMAND [ARGS]...
Options:
--help Show this message and exit.
cis-and-trans-stats¶
Usage:
capcruncher utilities cis-and-trans-stats [OPTIONS] SLICES
Options:
-o, --output TEXT Output file name
--sample-name TEXT Name of sample e.g. DOX_treated_1
--assay [capture|tri|tiled] Assay used to generate slices
--help Show this message and exit.
dump¶
Dumps the contents of a cooler or capcruncher parquet file to a TSV file
Args: path (str): Path to cooler or capcruncher parquet file viewpoint (str, optional): Viewpoint to extract. Defaults to None. resolution (int, optional): Resolution to extract. Only used for cooler (hdf5) files. Defaults to None. output (str, optional): Output file name. Defaults to "capcruncher_dump.tsv".
Usage:
capcruncher utilities dump [OPTIONS] PATH
Options:
-v, --viewpoint TEXT Viewpoint to extract
-r, --resolution TEXT Resolution to extract. Only used for cooler (hdf5)
files
-o, --output TEXT Output file name
--help Show this message and exit.
gtf-to-bed12¶
Converts a GTF file to a BED12 file containing only 5' UTRs, 3' UTRs, and exons.
Args: gtf (str): Path to the input GTF file. output (str): Path to the output BED12 file.
Returns: None
Usage:
capcruncher utilities gtf-to-bed12 [OPTIONS] GTF
Options:
-o, --output TEXT Output file name
--help Show this message and exit.
make-chicago-maps¶
Restriction map file (.rmap) - a bed file containing coordinates of the restriction fragments. By default, 4 columns: chr, start, end, fragmentID. Bait map file (.baitmap) - a bed file containing coordinates of the baited restriction fragments, and their associated annotations. By default, 5 columns: chr, start, end, fragmentID, baitAnnotation. The regions specified in this file, including their fragmentIDs, must be an exact subset of those in the .rmap file. The baitAnnotation is a text field that is used only to annotate the output and plots.
Usage:
capcruncher utilities make-chicago-maps [OPTIONS]
Options:
--fragments TEXT Path to fragments file (default: capcruncher_output/re
sources/restriction_fragments/genome.digest.bed.gz)
--viewpoints TEXT Path to viewpoints file used for capcruncher
[required]
-o, --outputdir TEXT Path to output directory [required]
--help Show this message and exit.
regenerate-fastq¶
Regenerates a FASTQ file from a parquet file containing the required reads
Args: fastq1 (str): Path to the first FASTQ file fastq2 (str): Path to the second FASTQ file parquet_file (str, optional): Path to the parquet file from which to extract the required reads. Defaults to None. output (str, optional): Prefix for the output file. Defaults to "regenerated_".
Raises: AssertionError: If the specified parquet file does not exist.
Returns: None
Usage:
capcruncher utilities regenerate-fastq [OPTIONS]
Options:
-1, --fastq1 TEXT Path to FASTQ file 1 [required]
-2, --fastq2 TEXT Path to FASTQ file 2 [required]
-p, --parquet-file TEXT Path to parquet file from which to extract the
required reads [required]
-o, --output-prefix TEXT Output file prefix
--help Show this message and exit.
viewpoint-coordinates¶
Aligns viewpoints to a genome and returns the coordinates of the viewpoint in the genome.
Viewpoints can be supplied as a FASTA file or a TSV file with the first column containing the name of the viewpoint and the second column containing the sequence of the viewpoint.
Args: viewpoints (os.PathLike): Path to viewpoints genome (os.PathLike): Path to genome fasta file genome_indicies (os.PathLike, optional): Path to genome bowtie2 indices. Defaults to None. recognition_site (str, optional): Restriction site used. Defaults to "dpnii". output (os.PathLike, optional): Output file name. Defaults to "viewpoint_coordinates.bed".
Raises: ValueError: If viewpoints are not supplied in the correct format ValueError: If no bowtie2 indices are supplied
Usage:
capcruncher utilities viewpoint-coordinates [OPTIONS]
Options:
-v, --viewpoints TEXT Path to viewpoints [required]
-g, --genome TEXT Path to genome fasta file [required]
-i, --genome-indicies TEXT Path to genome bowtie2 indices [required]
-r, --recognition-site TEXT Restriction site used
-o, --output TEXT Output file name
--help Show this message and exit.