preprocess the sc3DG

Typical Workflow

1. scHi-C

Test data: /tutorial/scHi-C/data

stark count -o /absolute/path/to/result \
 -f /absolute/path/to/data/scHic_no \
 -t scHic \
 -e  MboI \
 -i /absolute/path/to/mm10/mm10.fa \
 --thread 60

Parameter Description:

  • -o: Location to save the result, note that all paths must be absolute paths.

  • -f: Directory where the sequencing data is located.

  • -t: Type of single-cell Hi-C.

  • -e: Type of restriction enzyme used, must be consistent with the experiment.

  • -r: Resolution used to convert pairs files into cool files.

  • -i: Directory of the genome file used for alignment, the final hg38.fa is the type of genome, not part of the directory, consistent with the -g parameter of STARK index.

  • -a: The software used for assembly, optional bwa, bowtie2, bismark, minimap2. Here it should be consistent with the index produced by STARK index.

2. scHi-C+

Test data: /tutorial/scHic_index/data

It should be noted that the number of fastq files generated by this technique should be 4, where _1 and _4 are reads, and _2 and _3 are the corresponding barcode files.

stark count -o /absolute/path/to/result \
 -f /absolute/path/to/data/scHic_index \
 -t scHic \
 -e  MboI \
 -i /absolute/path/to/mm10/mm10.fa \
 --thread 60 \
 --exist-barcode

Note that the only difference with scHi-C is that you need to specify that it has barcodes with the parameter –exist-barcode.

3. Dip-C

Test data: /tutorial/Dip-C/data

stark count -o /absolute/path/to/result \
 -f /absolute/path/to/data/dipC \
 -t dipC \
 -e  MboI \
 -i /absolute/path/to/hg38/hg38.fa \
 --thread 60

Similar with scHi-C, STARK will automatically process the following steps according to the parameter -t.

4. HiRES

Test data: /tutorial/HRIES/data

stark count -o /absolute/path/to/result \
 -f /absolute/path/to/data/HIRES \
 -t HIRES \
 -e  MboI \
 -i /absolute/path/to/mm10/mm10.fa \
 --thread 60

Similar with scHi-C, STARK will automatically process the follow-up according to the parameter -t.

5. sn-m3C

Test data: /tutorial/sn-m3C/data

stark count -o/absolute/path/to/result \
    -f /absolute/path/to/data/sn_m3c \
    -t sn_m3c \
    -e MboI \
    -r 10000 \
    -i /absolute/path/to/bowtie2/hg38/hg38.fa \
    --aligner bowtie2 \
    --thread 60

Since the sn-m3C sequencing simultaneously methylation and Hi-C, only bismark can be used for assembly. Correspondingly, there is no multiple choice for -i and -a. However, bismark is based on bowtie2, you can write the parameters as bowtie2.

6. scSPRITE

Test data: /tutorial/scSPRITE/data

stark count -o /absolute/path/to/result \
     -f /absolute/path/to/data/scSPRITE \
     -t scSPRITE \
     -e HpyCH4V \
     -i /absolute/path/to/mm10/mm10.fa \
     --thread 60 \
     --repeat-masker /absolute/path/to/mm10_rmsk.bed \
     --exist-barcode

Parameter Description:

  • –sprite-config: A txt file for scSPRITE generating barcode.

  • –repeat-masker: A bed file for genome masking of repetitive regions.

7. sciHi-C

Test data: /tutorial/sciHi-C/data. It should be noted that not only two sequencing fastq files are needed in each sequencing file directory, but also txt files corresponding to inner_barcode.txt and outer_barcode.txt are needed. The format is as shown in the example.

stark count -o /absolute/path/to/result \
 -f /absolute/path/to/data/sciHic \
 -t sciHic \
 -e  DpnII \
 -i /absolute/path/to/mm10/mm10.fa \
 --thread 60 \
 --exist-barcode

Similar with scHic_index, STARK will automatically carry out subsequent processing according to the parameter -t.

8. snHi-C

Test data: /tutorial/snHi-C/data

stark count -o /absolute/path/to/result \
 -f /absolute/path/to/data/sciHic \
 -t sciHic \
 -e  DpnII \
 -i /absolute/path/to/mm10/mm10.fa \
 --thread 60 \
 --exist-barcode

Similar with scHi-C, STARK will automatically process the follow-up according to the parameter -t.

9. snHi-C+

Test data: /tutorial/snHi-C+/data

stark count -o /absolute/path/to/result \
 -f /absolute/path/to/data/snHic_index \
 -t snHic \
 -e  MboI \
 -i /absolute/path/to/hg38/hg38.fa \
 --thread 60 \
 --exist-barcode

Similar with scHi-C, STARK will automatically process the following steps according to the parameter -t.

10. scNanoHi-C

Test data: /tutorial/scNanoHi-C/data

It is worth noting that scNanoHi-C uses third-generation sequencing, and the sequencing data directory should contain the fastq file, TN5.txt, PCR.txt, and index.txt.

stark count -o /absolute/path/to/result \
 -f /absolute/path/to/data/scNano/data \
 -t scNano \
 -e  MboI \
 -i /absolute/path/to/mm10/mm10.fa \
 --thread 60 \
 --exist-barcode

scNanoHi-C uses third-generation sequencing technology and uses minimap2 by default. So the -a parameter will become ineffective.

11. scMethyl

Test data: /tutorial/scNanoHi-C/data

stark count -o /absolute/path/to/result \
 -f /absolute/path/to/data/scMeth \
 -t scMethyl \
 -e  DpnII \
 -i //absolute/path/to/bowtie2/mm10/mm10 \
 --aligner bowtie2 \
 --thread 60 \
 --exist-barcode

12. LiMAC

Test data: /tutorial/LiMAC/data

stark count -o /absolute/path/to/result \
 -f /absolute/path/to/data/LiMAC \
 -t LiMAC \
 -e  MboI \
 -i /absolute/path/to/hg38/hg38.fa \
 --thread 60

Parameter Description:

  • -o: Location to save the result, note that all paths must be absolute paths.

  • -f: Directory where the sequencing data is located.

  • -t: Type of single-cell Hi-C.

  • -e: Type of restriction enzyme used, must be consistent with the experiment.

  • -r: Resolution used to convert pairs files into cool files.

  • -i: Directory of the genome file used for alignment, the final hg38.fa is the type of genome, not part of the directory, consistent with the -g parameter of STARK index.

  • -a: The software used for assembly, optional bwa, bowtie2, bismark, minimap2. Here it should be consistent with the index produced by STARK index.

13. GAGE-seq

Test data: /tutorial/GAGE-seq/data

stark count -o /absolute/path/to/result \
 -f /absolute/path/to/data/GAGE-seq \
 -t GAGE-seq \
 -e  MboI \
 -i /absolute/path/to/mm10/mm10.fa \
 --thread 60 \
 --exist-barcode

Parameter Description: - -o: Location to save the result, note that all paths must be absolute paths. - -f: Directory where the sequencing data is located. - -t: Type of single-cell Hi-C. - -e: Type of restriction enzyme used, must be consistent with the experiment. - -r: Resolution used to convert pairs files into cool files. - -i: Directory of the genome file used for alignment, the final hg38.fa is the type of genome, not part of the directory, consistent with the -g parameter of STARK index. - -a: The software used for assembly, optional bwa, bowtie2, bismark, minimap2. Here it should be consistent with the index produced by STARK index.

14. Droplet Hi-C

Test data: /tutorial/Droplet/data

Please make sure you have installed bowtie before running

conda install bioconda::bowtie==1.3.1

after that, you need to build the bowtie index for the 10x barcode reference.

bowtie-build /path/to/10x/barcode/reference/ref.fa /path/to/10x/bowtie/index

Then you can run the following command to process the droplet data.

stark count -t droplet \
        --ref-10x /path/to/bowtie/index \
        -f /absolute/path/to/data/droplet \
        -i /cluster2/home/Kangwen/common/mm10/mm10.fa \
        -e MboI \
        -o /absolute/path/to/mm10/mm10.fa \
        --exist-barcode \
        --thread 32

Parameter Description:

  • -o: Location to save the result, note that all paths must be absolute paths.

  • -f: Directory where the sequencing data is located.

  • -t: Type of single-cell Hi-C.

  • –ref-10x: Directory of the bowtie index for 10x barcode reference.

  • -i: bwa index of the genome file used for alignment

15. Paired

Test data: /tutorial/Paired/data

Please make sure you have installed bowtie before running

conda install bioconda::bowtie==1.3.1

after that, you need to build the bowtie index for the 10x barcode reference.

bowtie-build /path/to/10x/barcode/reference/ref.fa /path/to/10x/bowtie/index

Then you can run the following command to process the droplet data.

stark count -t Paired \
        --ref-10x /path/to/bowtie/index \
        -f /absolute/path/to/data/droplet \
        -i /cluster2/home/Kangwen/common/mm10/mm10.fa \
        -e MboI \
        -o /absolute/path/to/mm10/mm10.fa \
        --exist-barcode \
        --thread 32

Parameter Description:

  • -o: Location to save the result, note that all paths must be absolute paths.

  • -f: Directory where the sequencing data is located.

  • -t: Type of single-cell Hi-C.

  • –ref-10x: Directory of the bowtie index for 10x barcode reference.

  • -i: bwa index of the genome file used for alignment

The illumination of the Result

Here is the introduction to the results:

scSPRITE_test_tmp/
    ├── Result
       ├── cool_folder
           ├── [Even2Bo10][Odd2Bo69][DPM6bot1]_10000.cool
           ├── [Even2Bo10][Odd2Bo69][DPM6bot1]10000.KR.cool
           ├── [Even2Bo11][Odd2Bo19][DPM6bot31]_10000.cool
           ├── [Even2Bo11][Odd2Bo19][DPM6bot31]10000.KR.cool
           ├── [Even2Bo11][Odd2Bo1][DPM6bot75]_10000.cool
           ├── [Even2Bo11][Odd2Bo1][DPM6bot75]10000.KR.cool
       ├── mcool_folder
          ├── [Even2Bo10][Odd2Bo69][DPM6bot1].mcool
          ├── [Even2Bo11][Odd2Bo19][DPM6bot31].mcool
          ├── [Even2Bo11][Odd2Bo1][DPM6bot75].mcool
       └── SCpair
           ├── [Even2Bo10][Odd2Bo69][DPM6bot1].pairs.gz
           ├── [Even2Bo11][Odd2Bo19][DPM6bot31].pairs.gz
           ├── [Even2Bo11][Odd2Bo1][DPM6bot75].pairs.gz
    ├── test.bam
    ├── test_logging.log
    └── trimmed.fastp.json

scSPRITE_test_tmp is the root directory of the output, where ‘test’ is the name of the sample.

Result is directory the main result saved in it.

cool_folder is the directory stores all cells’ cool files before and after KR correction.

mcool _folder is the ****directory stores all cells’ mcool files.

SCpair is the directory stores all cells’ pair files.

test.bam is the bam file of the sequencing data.

test_logging.log records all the parameters of processing the data by STARK, as well as the time spent on each step.

trimmed.fastp.json records the result of fastp processing the data.