Clustering
STARK has built-in clustering algorithms such as Higashi, Fast-higashi, deepnanoHi-C, and schicluster. These algorithms are all based on clustering from pairs.gz files.
To run clustering, use the following command:
stark clustering --method <method_name> \
--config <config_file> \
example:
stark clustering --method higashi \
--config config.json
The configuration file should be in JSON format and include the following fields:
Attention:
if you want to running Higashi, Fast-Higashi or DeepnanoHi-C, you need to install torch and cuda first.
You find the installation instructions here: https://pytorch.org/get-started/locally/
Higashi:
{
"data_dir": "/path/to/final_dir",
"structured": true,
"input_format": "higashi_v2",
"header_included": true,
"temp_dir": "/path/to/final_dir",
"genome_reference_path": "g38.fa.chrom.sizes",
"cytoband_path": "cytoBand_hg38.txt",
"chrom_list": ["chr1", "chr2", "chr3", "chr4", "chr5",
"chr6", "chr7", "chr8", "chr9", "chr10",
"chr11", "chr12", "chr13", "chr14", "chr15",
"chr16", "chr17", "chr18", "chr19", "chr20",
"chr21", "chr22"],
"resolution": 50000,
"resolution_cell": 50000,
"resolution_fh": [50000],
"embedding_name": "test",
"minimum_distance": 50000,
"maximum_distance": -1,
"local_transfer_range": 0,
"loss_mode": "zinb",
"dimensions": 128,
"impute_list": ["chr1", "chr2", "chr3", "chr4", "chr5",
"chr6", "chr7", "chr8", "chr9", "chr10",
"chr11", "chr12", "chr13", "chr14", "chr15",
"chr16", "chr17", "chr18", "chr19", "chr20",
"chr21", "chr22"],
"neighbor_num": 5,
"cpu_num": 10,
"gpu_num": 8,
"embedding_epoch": 60,
"correct_be_impute": true
}
Fast-higashi:
{
"data_dir": "/path/to/final_dir",
"structured": true,
"input_format": "higashi_v2",
"header_included": true,
"temp_dir": "/path/to/final_dir",
"genome_reference_path": "hg38.fa.chrom.sizes",
"cytoband_path": "cytoBand_hg38.txt",
"chrom_list": ["chr1", "chr2", "chr3", "chr4", "chr5",
"chr6", "chr7", "chr8", "chr9", "chr10",
"chr11", "chr12", "chr13", "chr14", "chr15",
"chr16", "chr17", "chr18", "chr19", "chr20",
"chr21", "chr22"],
"resolution": 50000,
"resolution_cell": 50000,
"resolution_fh": [50000],
"embedding_name": "test",
"minimum_distance": 50000,
"maximum_distance": -1,
"local_transfer_range": 0,
"loss_mode": "zinb",
"dimensions": 128,
"impute_list": ["chr1", "chr2", "chr3", "chr4", "chr5",
"chr6", "chr7", "chr8", "chr9", "chr10",
"chr11", "chr12", "chr13", "chr14", "chr15",
"chr16", "chr17", "chr18", "chr19", "chr20",
"chr21", "chr22"],
"neighbor_num": 5,
"cpu_num": 10,
"gpu_num": 8,
"embedding_epoch": 60,
"correct_be_impute": true
}
DeepnanoHi-C:
{
"data_dir": "/path/to/final_dir",
"temp_dir": ""/path/to/final_dir"",
"structured": true,
"input_format": "higashi_v2",
"header_included": true,
"genome_reference_path": "hg38.fa.chrom.sizes",
"cytoband_path": "cytoBand_hg38.txt",
"chrom_list": ["chr1", "chr2", "chr3", "chr4", "chr5",
"chr6", "chr7", "chr8", "chr9", "chr10",
"chr11", "chr12", "chr13", "chr14", "chr15",
"chr16", "chr17", "chr18", "chr19", "chr20",
"chr21", "chr22"],
"resolution": 500000,
"resolution_cell": 500000,
"resolution_fh": [500000],
"embedding_name": "test",
"minimum_distance": 50000,
"maximum_distance": -1,
"local_transfer_range": 0,
"loss_mode": "zinb",
"dimensions": 128,
"impute_list": ["chr1", "chr2", "chr3", "chr4", "chr5",
"chr6", "chr7", "chr8", "chr9", "chr10",
"chr11", "chr12", "chr13", "chr14", "chr15",
"chr16", "chr17", "chr18", "chr19", "chr20",
"chr21", "chr22"],
"neighbor_num": 5,
"cpu_num": 10,
"gpu_num": 8,
"embedding_epoch": 60,
"correct_be_impute": true
}
Schicluster:
{
"data_dir": "/path/to/final_dir",
"temp_dir": "/path/to/final_dir",
"filelist": "/path/to/final_dir/filelist.txt",
"genome_reference_path": "hg38.fa.chrom.sizes",
"chrom_list": ["chr1", "chr2", "chr3", "chr4", "chr5",
"chr6", "chr7", "chr8", "chr9", "chr10",
"chr11", "chr12", "chr13", "chr14", "chr15",
"chr16", "chr17", "chr18", "chr19", "chr20",
"chr21", "chr22"],
"impute_list": ["chr1", "chr2", "chr3", "chr4", "chr5",
"chr6", "chr7", "chr8", "chr9", "chr10",
"chr11", "chr12", "chr13", "chr14", "chr15",
"chr16", "chr17", "chr18", "chr19", "chr20",
"chr21", "chr22"],
"dist": 2500,
"res": 50000,
"n_jobs": 10
}
You need to give the filelist.txt file in the target data_dir directory, which contains the pairs.gz files of all the cells you need to cluster, like:
/path/to/final_dir/[Even2Bo10][Odd2Bo69][DPM6bot1].pairs.gz
/path/to/final_dir/[Even2Bo11][Odd2Bo19][DPM6bot31].pairs.gz
/path/to/final_dir/[Even2Bo11][Odd2Bo1][DPM6bot75].pairs.gz