CLI Reference¶

This page provides documentation for CapCruncher command line tools.

capcruncher¶

An end to end solution for processing: Capture-C, Tri-C and Tiled-C data.

Usage:

capcruncher [OPTIONS] COMMAND [ARGS]...

Options:

  --version  Show the version and exit.
  --help     Show this message and exit.

alignments¶

Alignment annotation, identification and deduplication.

Usage:

capcruncher alignments [OPTIONS] COMMAND [ARGS]...

Options:

  --help  Show this message and exit.

annotate¶

Annotates a bed file with other bed files using bedtools intersect.

Whilst bedtools intersect allows for interval names and counts to be used for annotating intervals, this command provides the ability to annotate intervals with both interval names and counts at the same time. As the pipeline allows for empty bed files, this command has built in support to deal with blank/malformed bed files and will return default N/A values.

Prior to interval annotation, the bed file to be intersected is validated and duplicate entries/multimapping reads are removed to ensure consistent annotations and prevent issues with reporter identification.

Usage:

capcruncher alignments annotate [OPTIONS] SLICES

Options:

  -a, --actions [get|count]       Determines if the overlaps are counted or if
                                  the name should just be reported
  -b, --bed_files TEXT            Bed file(s) to intersect with slices
  -n, --names TEXT                Names to use as column names for the output
                                  tsv file.
  -f, --overlap_fractions FLOAT   The minimum overlap required for an
                                  intersection between two intervals to be
                                  reported.
  -t, --dtypes TEXT               Data type for column
  -o, --output TEXT               Path for the annotated slices to be output.
  --duplicates [remove]           Method to use for reconciling duplicate
                                  slices (i.e. multimapping). Currently only
                                  'remove' is supported.
  -p, --n_cores INTEGER           Intersections are performed in parallel, set
                                  this to the number of intersections required
  --invalid_bed_action [ignore|error]
                                  Method to deal with invalid bed files e.g.
                                  blank or incorrectly formatted. Setting this
                                  to 'ignore' will report default N/A values
                                  (either '.' or 0) for invalid files
  --blacklist TEXT                Regions to remove from the BAM file prior to
                                  annotation
  --prioritize-cis-slices         Attempts to prevent slices on the most
                                  common chromosome in a fragment (ideally cis
                                  to the viewpoint) being removed by
                                  deduplication
  --priority-chroms TEXT          A comma separated list of chromosomes to
                                  prioritize during deduplication
  --help                          Show this message and exit.

filter¶

Removes unwanted aligned slices and identifies reporters.

Parses a BAM file and merges this with a supplied annotation to identify unwanted slices. Filtering can be tuned for Capture-C, Tri-C and Tiled-C data to ensure optimal filtering.

Usage:

capcruncher alignments filter [OPTIONS] {capture|tri|tiled}

Options:

  -b, --bam TEXT                Bam file to process  [required]
  -a, --annotations TEXT        Annotations for the bam file that must contain
                                the required columns, see description.
                                [required]
  --custom-filtering TEXT       Custom filtering to be used. This must be
                                supplied as a path to a yaml file.
  -o, --output_prefix TEXT      Output prefix for deduplicated fastq file(s)
  --statistics TEXT             Output path for stats file
  --sample-name TEXT            Name of sample e.g. DOX_treated_1
  --read-type [flashed|pe]      Type of read
  --fragments / --no-fragments  Determines if read fragment aggregations are
                                produced
  --help                        Show this message and exit.

fastq¶

Fastq splitting, deduplication and digestion.

Usage:

capcruncher fastq [OPTIONS] COMMAND [ARGS]...

Options:

  --help  Show this message and exit.

deduplicate¶

Identifies PCR duplicate fragments from Fastq files.

PCR duplicates are very commonly present in Capture-C/Tri-C/Tiled-C data and must be removed for accurate analysis. These commands attempt to identify and remove duplicate reads/fragments from fastq file(s) to speed up downstream analysis.

Usage:

capcruncher fastq deduplicate [OPTIONS]

Options:

  -1, --fastq1 TEXT         Read 1 FASTQ files  [required]
  -2, --fastq2 TEXT         Read 2 FASTQ files  [required]
  -o, --output-prefix TEXT  Output prefix for deduplicated FASTQ files
  --sample-name TEXT        Name of sample e.g. DOX_treated_1
  -s, --statistics TEXT     Statistics output file name
  --shuffle                 Shuffle reads before deduplication
  --help                    Show this message and exit.

digest¶

Performs in silico digestion of one or a pair of fastq files.

Usage:

capcruncher fastq digest [OPTIONS] FASTQS...

Options:

  -r, --restriction_enzyme TEXT   Restriction enzyme name or sequence to use
                                  for in silico digestion.  [required]
  -m, --mode [flashed|pe]         Digestion mode. Combined (Flashed) or non-
                                  combined (PE) read pairs.  [required]
  -o, --output_file TEXT
  --minimum_slice_length INTEGER
  --statistics TEXT               Output path for stats file
  --sample-name TEXT              Name of sample e.g. DOX_treated_1. Required
                                  for correct statistics.
  --help                          Show this message and exit.

split¶

Splits fastq file(s) into equal chunks of n reads.

Usage:

capcruncher fastq split [OPTIONS] INPUT_FILES...

Options:

  -m, --method [python|unix]   Method to use for splitting
  -o, --output_prefix TEXT     Output prefix for deduplicated fastq file(s)
  --compression_level INTEGER  Level of compression for output files
  -n, --n_reads INTEGER        Number of reads per fastq file
  --gzip / --no-gzip           Determines if files are gziped or not
  -p, --n_cores INTEGER
  -s, --suffix TEXT            Suffix to add to output files (ignore
                               {read_number}.fastq as this is added
                               automatically)
  --help                       Show this message and exit.

genome¶

Genome wide methods digestion.

Usage:

capcruncher genome [OPTIONS] COMMAND [ARGS]...

Options:

  --help  Show this message and exit.

digest¶

Performs in silico digestion of a genome in fasta format.

Digests the supplied genome fasta file and generates a bed file containing the locations of all restriction fragments produced by the supplied restriction enzyme.

A log file recording the number of restriction fragments for the suplied genome is also generated.

Usage:

capcruncher genome digest [OPTIONS] INPUT_FASTA

Options:

  -r, --recognition_site TEXT  Recognition enzyme or sequence  [required]
  -l, --logfile TEXT           Path for digestion log file
  -o, --output_file TEXT       Output file path
  --remove_cutsite BOOLEAN     Exclude the recognition sequence from the
                               output
  --sort                       Sorts the output bed file by chromosome and
                               start coord.
  --help                       Show this message and exit.

interactions¶

Reporter counting, storing, comparison and pileups

Usage:

capcruncher interactions [OPTIONS] COMMAND [ARGS]...

Options:

  --help  Show this message and exit.

compare¶

Compare bedgraphs and CapCruncher cooler files.

These commands allow for specific viewpoints to be extracted from CapCruncher HDF5 files and perform:

1. User defined groupby aggregations.

2. Comparisons between conditions.

3. Identification of differential interactions between conditions.

See subcommands for details.

Usage:

capcruncher interactions compare [OPTIONS] COMMAND [ARGS]...

Options:

  --help  Show this message and exit.

concat¶

Usage:

capcruncher interactions compare concat [OPTIONS] INFILES...

Options:

  -f, --format [auto|bedgraph|cooler]
                                  Input file format
  -o, --output TEXT               Output file name
  -v, --viewpoint TEXT            Viewpoint to extract
  -r, --resolution TEXT           Resolution to extract
  --region TEXT                   Limit to specific coordinates in the format
                                  chrom:start-end
  --normalisation [raw|n_cis|region]
                                  Method to use interaction normalisation
  --normalisation-regions TEXT    Regions to use for interaction
                                  normalisation. The --normalisation method
                                  MUST be 'region'
  --scale_factor INTEGER          Scale factor to use for bedgraph
                                  normalisation
  -p, --n_cores INTEGER           Number of cores to use for extracting
                                  bedgraphs
  --help                          Show this message and exit.

differential¶

Perform differential testing on CapCruncher HDF5 files.

This command performs differential testing on CapCruncher HDF5 files. It requires a design matrix and a contrast to test. The design matrix should be a tab separated file with the first column containing the sample names and the remaining columns containing the conditions. The contrast should specify the name of the column in the design matrix to test. The output is a tab separated bedgraph.

Usage:

capcruncher interactions compare differential [OPTIONS] INTERACTION_FILES...

Options:

  -o, --output-prefix TEXT        Output file prefix
  -v, --viewpoint TEXT            Viewpoint to extract  [required]
  -d, --design-matrix TEXT        Design matrix file  [required]
  -c, --contrast TEXT             Contrast to test
  -r, --regions-of-interest TEXT  Regions of interest to test for differential
                                  interactions
  --viewpoint-distance INTEGER    Distance from viewpoint to test for
                                  differential interactions
  --threshold-count INTEGER       Minimum number of interactions to test for
                                  differential interactions
  --threshold-q FLOAT             Minimum q-value to test for differential
                                  interactions
  --help                          Show this message and exit.

summarise¶

Usage:

capcruncher interactions compare summarise [OPTIONS] INFILE

Options:

  -d, --design-matrix TEXT        Design matrix file, should be formatted as a
                                  tab separated file with the first column
                                  containing the sample names and the other
                                  column containing the conditions.
  -o, --output-prefix TEXT        Output file prefix
  -f, --output-format [bedgraph|tsv]
  -m, --summary-methods TEXT      Summary methods to use for aggregation. Can
                                  be any method in numpy or scipy.stats
  -n, --group-names TEXT          Group names for aggregation
  -c, --group-columns TEXT        Column names/numbers (0 indexed, the first
                                  column after the end coordinate counts as 0)
                                  for aggregation.
  --subtraction                   Perform subtration between aggregated groups
  --suffix TEXT                   Add a suffix before the file extension
  --help                          Show this message and exit.

count¶

Determines the number of captured restriction fragment interactions genome wide.

Counts the number of interactions between each restriction fragment and all other restriction fragments in the fragment.

The output is a cooler formatted HDF5 file containing a single group containing the interactions between restriction fragments.

See https://cooler.readthedocs.io/en/latest/ for further details.

Usage:

capcruncher interactions count [OPTIONS] REPORTERS

Options:

  -o, --output TEXT            Name of output file
  --remove_exclusions          Prevents analysis of fragments marked as
                               proximity exclusions
  --remove_capture             Prevents analysis of capture fragment
                               interactions
  --subsample FLOAT            Subsamples reporters before analysis of
                               interactions
  -f, --fragment-map TEXT      Path to digested genome bed file
  -v, --viewpoint-path TEXT    Path to viewpoints file
  -p, --n-cores INTEGER        Number of cores to use for counting.
  --assay [capture|tri|tiled]
  --help                       Show this message and exit.

counts-to-cooler¶

Stores restriction fragment interaction combinations at the restriction fragment level.

Parses reporter restriction fragment interaction counts produced by "capcruncher reporters count" and gerates a cooler formatted group in an HDF5 File. See https://cooler.readthedocs.io/en/latest/ for further details.

Usage:

capcruncher interactions counts-to-cooler [OPTIONS] COUNTS

Options:

  -f, --fragment-map TEXT    Path to digested genome bed file  [required]
  -v, --viewpoint-path TEXT  Path to viewpoints file  [required]
  -n, --viewpoint-name TEXT  Name of viewpoint to store
  -g, --genome TEXT          Name of genome
  --suffix TEXT              Suffix to append after the capture name for the
                             output file
  -o, --output TEXT          Name of output file. (Cooler formatted hdf5 file)
  --help                     Show this message and exit.

deduplicate¶

Identifies and removes duplicated aligned fragments.

PCR duplicates are very commonly present in Capture-C/Tri-C/Tiled-C data and must be removed for accurate analysis. Unlike fastq deduplicate, this command removes fragments with identical genomic coordinates.

Non-combined (pe) and combined (flashed) reads are treated slightly differently due to the increased confidence that the ligation junction has been captured for the flashed reads.

Usage:

capcruncher interactions deduplicate [OPTIONS] SLICES

Options:

  -o, --output TEXT         Output prefix for directory of deduplicated slices
  --statistics TEXT         Output prefix for stats file(s)
  --sample-name TEXT        Name of sample e.g. DOX_treated_1
  --read-type [flashed|pe]  Type of read
  --help                    Show this message and exit.

fragments-to-bins¶

Convert a cooler group containing restriction fragments to constant genomic windows

Parses a cooler group and aggregates restriction fragment interaction counts into genomic bins of a specified size. If the normalise option is selected, columns containing normalised counts are added to the pixels table of the output

Usage:

capcruncher interactions fragments-to-bins [OPTIONS] COOLER_PATH

Options:

  -b, --binsizes INTEGER       Binsizes to use for windowing
  --normalise                  Enables normalisation of interaction counts
                               during windowing
  --overlap_fraction FLOAT     Minimum overlap between genomic bins and
                               restriction fragments for overlap
  -p, --n_cores INTEGER        Number of cores used for binning
  --scale-factor INTEGER       Scaling factor used for normalisation
  --conversion_tables TEXT     Pickle file containing pre-computed fragment ->
                               bin conversions.
  -o, --output TEXT            Name of output file. (Cooler formatted hdf5
                               file)
  --assay [capture|tri|tiled]
  --help                       Show this message and exit.

merge¶

Merges capcruncher HDF5 files together.

Produces a unified cooler with both restriction fragment and genomic bins whilst reducing the storage space required by hard linking the "bins" tables to prevent duplication.

Usage:

capcruncher interactions merge [OPTIONS] COOLERS...

Options:

  -o, --output TEXT  Output file name
  --help             Show this message and exit.

pileup¶

Extracts reporters from a capture experiment and generates a bedgraph file.

Identifies reporters for a single probe (if a probe name is supplied) or all capture probes present in a capture experiment HDF5 file.

The bedgraph generated can be normalised by the number of cis interactions for inter experiment comparisons and/or extract pilups binned into even genomic windows.

Usage:

capcruncher interactions pileup [OPTIONS] URI

Options:

  -n, --viewpoint_names TEXT      Viewpoint to extract and convert to
                                  bedgraph, if not provided will transform
                                  all.
  -o, --output_prefix TEXT        Output prefix for bedgraphs
  --normalisation [raw|n_cis|region]
                                  Method to use interaction normalisation
  --normalisation-regions TEXT    Regions to use for interaction
                                  normalisation. The --normalisation method
                                  MUST be 'region'
  --binsize INTEGER               Binsize to use for converting bedgraph to
                                  evenly sized genomic bins
  --gzip                          Compress output using gzip
  --scale-factor INTEGER          Scale factor to use for bedgraph
                                  normalisation
  --sparse / --dense              Produce bedgraph containing just positive
                                  bins (sparse) or all bins (dense)
  -f, --format [bedgraph|bigwig]  Output file format
  --help                          Show this message and exit.

pipeline¶

Runs the data processing pipeline

Usage:

capcruncher pipeline [OPTIONS] [PIPELINE_OPTIONS]...

Options:

  -h, --help
  --version
  --logo / --no-logo  Show the capcruncher logo  [default: logo]
  --version           Show the version and exit.

pipeline-config¶

Configures the data processing pipeline

Usage:

capcruncher pipeline-config [OPTIONS]

Options:

  -h, --help
  --version
  --version          Show the version and exit.
  -i, --input PATH
  --generate-design

plot¶

Generates plots for the outputs produced by CapCruncher

Usage:

capcruncher plot [OPTIONS]

Options:

  -r, --region TEXT    Genomic coordinates of the region to plot  [required]
  -t, --template TEXT  TOML file containing the template for the plot
                       [required]
  -o, --output TEXT    Output file path. The file extension determines the
                       output format.
  --help               Show this message and exit.

utilities¶

Contains miscellaneous functions

Usage:

capcruncher utilities [OPTIONS] COMMAND [ARGS]...

Options:

  --help  Show this message and exit.

cis-and-trans-stats¶

Usage:

capcruncher utilities cis-and-trans-stats [OPTIONS] SLICES

Options:

  -o, --output TEXT            Output file name
  --sample-name TEXT           Name of sample e.g. DOX_treated_1
  --assay [capture|tri|tiled]  Assay used to generate slices
  --help                       Show this message and exit.

dump¶

Dumps the contents of a cooler or capcruncher parquet file to a TSV file

Args: path (str): Path to cooler or capcruncher parquet file viewpoint (str, optional): Viewpoint to extract. Defaults to None. resolution (int, optional): Resolution to extract. Only used for cooler (hdf5) files. Defaults to None. output (str, optional): Output file name. Defaults to "capcruncher_dump.tsv".

Usage:

capcruncher utilities dump [OPTIONS] PATH

Options:

  -v, --viewpoint TEXT   Viewpoint to extract
  -r, --resolution TEXT  Resolution to extract. Only used for cooler (hdf5)
                         files
  -o, --output TEXT      Output file name
  --help                 Show this message and exit.

gtf-to-bed12¶

Converts a GTF file to a BED12 file containing only 5' UTRs, 3' UTRs, and exons.

Args: gtf (str): Path to the input GTF file. output (str): Path to the output BED12 file.

Returns: None

Usage:

capcruncher utilities gtf-to-bed12 [OPTIONS] GTF

Options:

  -o, --output TEXT  Output file name
  --help             Show this message and exit.

make-chicago-maps¶

Restriction map file (.rmap) - a bed file containing coordinates of the restriction fragments. By default, 4 columns: chr, start, end, fragmentID. Bait map file (.baitmap) - a bed file containing coordinates of the baited restriction fragments, and their associated annotations. By default, 5 columns: chr, start, end, fragmentID, baitAnnotation. The regions specified in this file, including their fragmentIDs, must be an exact subset of those in the .rmap file. The baitAnnotation is a text field that is used only to annotate the output and plots.

Usage:

capcruncher utilities make-chicago-maps [OPTIONS]

Options:

  --fragments TEXT      Path to fragments file (default: capcruncher_output/re
                        sources/restriction_fragments/genome.digest.bed.gz)
  --viewpoints TEXT     Path to viewpoints file used for capcruncher
                        [required]
  -o, --outputdir TEXT  Path to output directory  [required]
  --help                Show this message and exit.

regenerate-fastq¶

Regenerates a FASTQ file from a parquet file containing the required reads

Args: fastq1 (str): Path to the first FASTQ file fastq2 (str): Path to the second FASTQ file parquet_file (str, optional): Path to the parquet file from which to extract the required reads. Defaults to None. output (str, optional): Prefix for the output file. Defaults to "regenerated_".

Raises: AssertionError: If the specified parquet file does not exist.

Returns: None

Usage:

capcruncher utilities regenerate-fastq [OPTIONS]

Options:

  -1, --fastq1 TEXT         Path to FASTQ file 1  [required]
  -2, --fastq2 TEXT         Path to FASTQ file 2  [required]
  -p, --parquet-file TEXT   Path to parquet file from which to extract the
                            required reads  [required]
  -o, --output-prefix TEXT  Output file prefix
  --help                    Show this message and exit.

viewpoint-coordinates¶

Aligns viewpoints to a genome and returns the coordinates of the viewpoint in the genome.

Viewpoints can be supplied as a FASTA file or a TSV file with the first column containing the name of the viewpoint and the second column containing the sequence of the viewpoint.

Args: viewpoints (os.PathLike): Path to viewpoints genome (os.PathLike): Path to genome fasta file genome_indicies (os.PathLike, optional): Path to genome bowtie2 indices. Defaults to None. recognition_site (str, optional): Restriction site used. Defaults to "dpnii". output (os.PathLike, optional): Output file name. Defaults to "viewpoint_coordinates.bed".

Raises: ValueError: If viewpoints are not supplied in the correct format ValueError: If no bowtie2 indices are supplied

Usage:

capcruncher utilities viewpoint-coordinates [OPTIONS]

Options:

  -v, --viewpoints TEXT        Path to viewpoints  [required]
  -g, --genome TEXT            Path to genome fasta file  [required]
  -i, --genome-indicies TEXT   Path to genome bowtie2 indices  [required]
  -r, --recognition-site TEXT  Restriction site used
  -o, --output TEXT            Output file name
  --help                       Show this message and exit.