preprocess the sc3DG
Typical Workflow
1. scHi-C
Test data: /tutorial/scHi-C/data
stark count -o /absolute/path/to/result \
-f /absolute/path/to/data/scHic_no \
-t scHic \
-e MboI \
-i /absolute/path/to/mm10/mm10.fa \
--thread 60
Parameter Description:
-o: Location to save the result, note that all paths must be absolute paths.
-f: Directory where the sequencing data is located.
-t: Type of single-cell Hi-C.
-e: Type of restriction enzyme used, must be consistent with the experiment.
-r: Resolution used to convert pairs files into cool files.
-i: Directory of the genome file used for alignment, the final hg38.fa is the type of genome, not part of the directory, consistent with the -g parameter of STARK index.
-a: The software used for assembly, optional bwa, bowtie2, bismark, minimap2. Here it should be consistent with the index produced by STARK index.
2. scHi-C+
Test data: /tutorial/scHic_index/data
It should be noted that the number of fastq files generated by this technique should be 4, where _1 and _4 are reads, and _2 and _3 are the corresponding barcode files.
stark count -o /absolute/path/to/result \
-f /absolute/path/to/data/scHic_index \
-t scHic \
-e MboI \
-i /absolute/path/to/mm10/mm10.fa \
--thread 60 \
--exist-barcode
Note that the only difference with scHi-C is that you need to specify that it has barcodes with the parameter –exist-barcode.
3. Dip-C
Test data: /tutorial/Dip-C/data
stark count -o /absolute/path/to/result \
-f /absolute/path/to/data/dipC \
-t dipC \
-e MboI \
-i /absolute/path/to/hg38/hg38.fa \
--thread 60
Similar with scHi-C, STARK will automatically process the following steps according to the parameter -t.
4. HiRES
Test data: /tutorial/HRIES/data
stark count -o /absolute/path/to/result \
-f /absolute/path/to/data/HIRES \
-t HIRES \
-e MboI \
-i /absolute/path/to/mm10/mm10.fa \
--thread 60
Similar with scHi-C, STARK will automatically process the follow-up according to the parameter -t.
5. sn-m3C
Test data: /tutorial/sn-m3C/data
stark count -o/absolute/path/to/result \
-f /absolute/path/to/data/sn_m3c \
-t sn_m3c \
-e MboI \
-r 10000 \
-i /absolute/path/to/bowtie2/hg38/hg38.fa \
--aligner bowtie2 \
--thread 60
Since the sn-m3C sequencing simultaneously methylation and Hi-C, only bismark can be used for assembly. Correspondingly, there is no multiple choice for -i and -a. However, bismark is based on bowtie2, you can write the parameters as bowtie2.
6. scSPRITE
Test data: /tutorial/scSPRITE/data
stark count -o /absolute/path/to/result \
-f /absolute/path/to/data/scSPRITE \
-t scSPRITE \
-e HpyCH4V \
-i /absolute/path/to/mm10/mm10.fa \
--thread 60 \
--repeat-masker /absolute/path/to/mm10_rmsk.bed \
--exist-barcode
Parameter Description:
–sprite-config: A txt file for scSPRITE generating barcode.
–repeat-masker: A bed file for genome masking of repetitive regions.
7. sciHi-C
Test data: /tutorial/sciHi-C/data. It should be noted that not only two sequencing fastq files are needed in each sequencing file directory, but also txt files corresponding to inner_barcode.txt and outer_barcode.txt are needed. The format is as shown in the example.
stark count -o /absolute/path/to/result \
-f /absolute/path/to/data/sciHic \
-t sciHic \
-e DpnII \
-i /absolute/path/to/mm10/mm10.fa \
--thread 60 \
--exist-barcode
Similar with scHic_index, STARK will automatically carry out subsequent processing according to the parameter -t.
8. snHi-C
Test data: /tutorial/snHi-C/data
stark count -o /absolute/path/to/result \
-f /absolute/path/to/data/sciHic \
-t sciHic \
-e DpnII \
-i /absolute/path/to/mm10/mm10.fa \
--thread 60 \
--exist-barcode
Similar with scHi-C, STARK will automatically process the follow-up according to the parameter -t.
9. snHi-C+
Test data: /tutorial/snHi-C+/data
stark count -o /absolute/path/to/result \
-f /absolute/path/to/data/snHic_index \
-t snHic \
-e MboI \
-i /absolute/path/to/hg38/hg38.fa \
--thread 60 \
--exist-barcode
Similar with scHi-C, STARK will automatically process the following steps according to the parameter -t.
10. scNanoHi-C
Test data: /tutorial/scNanoHi-C/data
It is worth noting that scNanoHi-C uses third-generation sequencing, and the sequencing data directory should contain the fastq file, TN5.txt, PCR.txt, and index.txt.
stark count -o /absolute/path/to/result \
-f /absolute/path/to/data/scNano/data \
-t scNano \
-e MboI \
-i /absolute/path/to/mm10/mm10.fa \
--thread 60 \
--exist-barcode
scNanoHi-C uses third-generation sequencing technology and uses minimap2 by default. So the -a parameter will become ineffective.
11. scMethyl
Test data: /tutorial/scNanoHi-C/data
stark count -o /absolute/path/to/result \
-f /absolute/path/to/data/scMeth \
-t scMethyl \
-e DpnII \
-i //absolute/path/to/bowtie2/mm10/mm10 \
--aligner bowtie2 \
--thread 60 \
--exist-barcode
12. LiMAC
Test data: /tutorial/LiMAC/data
stark count -o /absolute/path/to/result \
-f /absolute/path/to/data/LiMAC \
-t LiMAC \
-e MboI \
-i /absolute/path/to/hg38/hg38.fa \
--thread 60
Parameter Description:
-o: Location to save the result, note that all paths must be absolute paths.
-f: Directory where the sequencing data is located.
-t: Type of single-cell Hi-C.
-e: Type of restriction enzyme used, must be consistent with the experiment.
-r: Resolution used to convert pairs files into cool files.
-i: Directory of the genome file used for alignment, the final hg38.fa is the type of genome, not part of the directory, consistent with the -g parameter of STARK index.
-a: The software used for assembly, optional bwa, bowtie2, bismark, minimap2. Here it should be consistent with the index produced by STARK index.
13. GAGE-seq
Test data: /tutorial/GAGE-seq/data
stark count -o /absolute/path/to/result \
-f /absolute/path/to/data/GAGE-seq \
-t GAGE-seq \
-e MboI \
-i /absolute/path/to/mm10/mm10.fa \
--thread 60 \
--exist-barcode
Parameter Description: - -o: Location to save the result, note that all paths must be absolute paths. - -f: Directory where the sequencing data is located. - -t: Type of single-cell Hi-C. - -e: Type of restriction enzyme used, must be consistent with the experiment. - -r: Resolution used to convert pairs files into cool files. - -i: Directory of the genome file used for alignment, the final hg38.fa is the type of genome, not part of the directory, consistent with the -g parameter of STARK index. - -a: The software used for assembly, optional bwa, bowtie2, bismark, minimap2. Here it should be consistent with the index produced by STARK index.
14. Droplet Hi-C
Test data: /tutorial/Droplet/data
Please make sure you have installed bowtie before running
conda install bioconda::bowtie==1.3.1
after that, you need to build the bowtie index for the 10x barcode reference.
bowtie-build /path/to/10x/barcode/reference/ref.fa /path/to/10x/bowtie/index
Then you can run the following command to process the droplet data.
stark count -t droplet \
--ref-10x /path/to/bowtie/index \
-f /absolute/path/to/data/droplet \
-i /cluster2/home/Kangwen/common/mm10/mm10.fa \
-e MboI \
-o /absolute/path/to/mm10/mm10.fa \
--exist-barcode \
--thread 32
Parameter Description:
-o: Location to save the result, note that all paths must be absolute paths.
-f: Directory where the sequencing data is located.
-t: Type of single-cell Hi-C.
–ref-10x: Directory of the bowtie index for 10x barcode reference.
-i: bwa index of the genome file used for alignment
15. Paired
Test data: /tutorial/Paired/data
Please make sure you have installed bowtie before running
conda install bioconda::bowtie==1.3.1
after that, you need to build the bowtie index for the 10x barcode reference.
bowtie-build /path/to/10x/barcode/reference/ref.fa /path/to/10x/bowtie/index
Then you can run the following command to process the droplet data.
stark count -t Paired \
--ref-10x /path/to/bowtie/index \
-f /absolute/path/to/data/droplet \
-i /cluster2/home/Kangwen/common/mm10/mm10.fa \
-e MboI \
-o /absolute/path/to/mm10/mm10.fa \
--exist-barcode \
--thread 32
Parameter Description:
-o: Location to save the result, note that all paths must be absolute paths.
-f: Directory where the sequencing data is located.
-t: Type of single-cell Hi-C.
–ref-10x: Directory of the bowtie index for 10x barcode reference.
-i: bwa index of the genome file used for alignment
The illumination of the Result
Here is the introduction to the results:
scSPRITE_test_tmp/
├── Result
│ ├── cool_folder
│ ├── [Even2Bo10][Odd2Bo69][DPM6bot1]_10000.cool
│ ├── [Even2Bo10][Odd2Bo69][DPM6bot1]10000.KR.cool
│ ├── [Even2Bo11][Odd2Bo19][DPM6bot31]_10000.cool
│ ├── [Even2Bo11][Odd2Bo19][DPM6bot31]10000.KR.cool
│ ├── [Even2Bo11][Odd2Bo1][DPM6bot75]_10000.cool
│ ├── [Even2Bo11][Odd2Bo1][DPM6bot75]10000.KR.cool
│ ├── mcool_folder
│ │ ├── [Even2Bo10][Odd2Bo69][DPM6bot1].mcool
│ │ ├── [Even2Bo11][Odd2Bo19][DPM6bot31].mcool
│ │ ├── [Even2Bo11][Odd2Bo1][DPM6bot75].mcool
│ └── SCpair
│ ├── [Even2Bo10][Odd2Bo69][DPM6bot1].pairs.gz
│ ├── [Even2Bo11][Odd2Bo19][DPM6bot31].pairs.gz
│ ├── [Even2Bo11][Odd2Bo1][DPM6bot75].pairs.gz
├── test.bam
├── test_logging.log
└── trimmed.fastp.json
scSPRITE_test_tmp is the root directory of the output, where ‘test’ is the name of the sample.
Result is directory the main result saved in it.
cool_folder is the directory stores all cells’ cool files before and after KR correction.
mcool _folder is the ****directory stores all cells’ mcool files.
SCpair is the directory stores all cells’ pair files.
test.bam is the bam file of the sequencing data.
test_logging.log records all the parameters of processing the data by STARK, as well as the time spent on each step.
trimmed.fastp.json records the result of fastp processing the data.