preprocess the sc3DG
========================

Navigation
--------------------------------------
* :ref:`scHic`
* :ref:`scHi-C+`
* :ref:`Dip-C`
* :ref:`HiRES`
* :ref:`sn-m3C`
* :ref:`scSPRITE`
* :ref:`sciHi-C`
* :ref:`snHi-C`
* :ref:`snHi-C`
* :ref:`scNanoHi-C`
* :ref:`scMethyl`
* :ref:`LiMAC`
* :ref:`GAGE-seq`
* :ref:`Droplet`
* :ref:`Paired`


Typical Workflow
----------------

.. _scHic:

1. scHi-C
~~~~~~~~~

Test data: /tutorial/scHi-C/data

.. code-block:: bash

    stark count -o /absolute/path/to/result \
     -f /absolute/path/to/data/scHic_no \
     -t scHic \
     -e  MboI \
     -i /absolute/path/to/mm10/mm10.fa \
     --thread 60

Parameter Description:

- **-o**: Location to save the result, note that **all paths must be absolute paths.**
- **-f**: Directory where the sequencing data is located.
- **-t**: Type of single-cell Hi-C.
- **-e**: Type of restriction enzyme used, must be consistent with the experiment.
- **-r**: Resolution used to convert pairs files into cool files.
- **-i**: Directory of the genome file used for alignment, the final hg38.fa is the type of genome, not part of the directory, consistent with the -g parameter of STARK index.
- **-a**: The software used for assembly, optional bwa, bowtie2, bismark, minimap2. Here it should be consistent with the index produced by STARK index.

.. _scHi-C+:

2. scHi-C+
~~~~~~~~~~~~~~~

Test data: /tutorial/scHic_index/data

It should be noted that the number of fastq files generated by this technique should be 4, where _1 and _4 are reads, and _2 and _3 are the corresponding barcode files.

.. code-block:: bash

    stark count -o /absolute/path/to/result \
     -f /absolute/path/to/data/scHic_index \
     -t scHic \
     -e  MboI \
     -i /absolute/path/to/mm10/mm10.fa \
     --thread 60 \
     --exist-barcode

Note that the only difference with scHi-C is that you need to specify that it has barcodes with the parameter --exist-barcode.

.. _Dip-C:

3. Dip-C
~~~~~~~~

Test data: /tutorial/Dip-C/data

.. code-block:: bash

    stark count -o /absolute/path/to/result \
     -f /absolute/path/to/data/dipC \
     -t dipC \
     -e  MboI \
     -i /absolute/path/to/hg38/hg38.fa \
     --thread 60

Similar with scHi-C, STARK will automatically process the following steps according to the parameter -t.

.. _HiRES:

4. HiRES
~~~~~~~~

Test data: /tutorial/HRIES/data

.. code-block:: bash

    stark count -o /absolute/path/to/result \
     -f /absolute/path/to/data/HIRES \
     -t HIRES \
     -e  MboI \
     -i /absolute/path/to/mm10/mm10.fa \
     --thread 60

Similar with scHi-C, STARK will automatically process the follow-up according to the parameter -t.


.. _sn-m3C:

5. sn-m3C
~~~~~~~~~

Test data: /tutorial/sn-m3C/data

.. code-block:: bash


    stark count -o/absolute/path/to/result \
        -f /absolute/path/to/data/sn_m3c \
        -t sn_m3c \
        -e MboI \
        -r 10000 \
        -i /absolute/path/to/bowtie2/hg38/hg38.fa \
        --aligner bowtie2 \
        --thread 60


Since the sn-m3C sequencing simultaneously methylation and Hi-C, only bismark can be used for assembly. Correspondingly, there is no multiple choice for -i and -a. However, bismark is based on bowtie2, you can write the parameters as bowtie2.

.. _scSPRITE:

6. scSPRITE
~~~~~~~~~~~

Test data: /tutorial/scSPRITE/data

.. code-block:: bash

    stark count -o /absolute/path/to/result \
         -f /absolute/path/to/data/scSPRITE \
         -t scSPRITE \
         -e HpyCH4V \
         -i /absolute/path/to/mm10/mm10.fa \
         --thread 60 \
         --repeat-masker /absolute/path/to/mm10_rmsk.bed \
         --exist-barcode


Parameter Description:

- **--sprite-config**: A txt file for scSPRITE generating barcode.
- **--repeat-masker**: A bed file for genome masking of repetitive regions.

.. _sciHi-C:

7. sciHi-C
~~~~~~~~~~

Test data: /tutorial/sciHi-C/data. It should be noted that not only two sequencing fastq files are needed in each sequencing file directory, but also txt files corresponding to inner_barcode.txt and outer_barcode.txt are needed. The format is as shown in the example.

.. code-block:: bash

    stark count -o /absolute/path/to/result \
     -f /absolute/path/to/data/sciHic \
     -t sciHic \
     -e  DpnII \
     -i /absolute/path/to/mm10/mm10.fa \
     --thread 60 \
     --exist-barcode

Similar with scHic_index, STARK will automatically carry out subsequent processing according to the parameter -t.

.. _snHi-C:

8. snHi-C
~~~~~~~~~

Test data: /tutorial/snHi-C/data

.. code-block:: bash

    stark count -o /absolute/path/to/result \
     -f /absolute/path/to/data/sciHic \
     -t sciHic \
     -e  DpnII \
     -i /absolute/path/to/mm10/mm10.fa \
     --thread 60 \
     --exist-barcode

Similar with scHi-C, STARK will automatically process the follow-up according to the parameter -t.

9. snHi-C+
~~~~~~~~~~

Test data: /tutorial/snHi-C+/data

.. code-block:: bash

    stark count -o /absolute/path/to/result \
     -f /absolute/path/to/data/snHic_index \
     -t snHic \
     -e  MboI \
     -i /absolute/path/to/hg38/hg38.fa \
     --thread 60 \
     --exist-barcode

Similar with scHi-C, STARK will automatically process the following steps according to the parameter -t.

.. _scNanoHi-C:

10. scNanoHi-C
~~~~~~~~~~~~~~

Test data: /tutorial/scNanoHi-C/data

It is worth noting that scNanoHi-C uses third-generation sequencing, and the sequencing data directory should contain the fastq file, TN5.txt, PCR.txt, and index.txt.

.. code-block:: bash

    stark count -o /absolute/path/to/result \
     -f /absolute/path/to/data/scNano/data \
     -t scNano \
     -e  MboI \
     -i /absolute/path/to/mm10/mm10.fa \
     --thread 60 \
     --exist-barcode

scNanoHi-C uses third-generation sequencing technology and uses minimap2 by default. So the -a parameter will become ineffective.

.. _scMethyl:

11. scMethyl
~~~~~~~~~~~~

Test data: /tutorial/scNanoHi-C/data

.. code-block:: bash

    stark count -o /absolute/path/to/result \
     -f /absolute/path/to/data/scMeth \
     -t scMethyl \
     -e  DpnII \
     -i //absolute/path/to/bowtie2/mm10/mm10 \
     --aligner bowtie2 \
     --thread 60 \
     --exist-barcode

.. _LiMAC:

12. LiMAC
~~~~~~~~~

Test data: /tutorial/LiMAC/data

.. code-block:: bash

    stark count -o /absolute/path/to/result \
     -f /absolute/path/to/data/LiMAC \
     -t LiMAC \
     -e  MboI \
     -i /absolute/path/to/hg38/hg38.fa \
     --thread 60

Parameter Description:

- **-o**: Location to save the result, note that **all paths must be absolute paths.**
- **-f**: Directory where the sequencing data is located.
- **-t**: Type of single-cell Hi-C.
- **-e**: Type of restriction enzyme used, must be consistent with the experiment.
- **-r**: Resolution used to convert pairs files into cool files.
- **-i**: Directory of the genome file used for alignment, the final hg38.fa is the type of genome, not part of the directory, consistent with the -g parameter of STARK index.
- **-a**: The software used for assembly, optional bwa, bowtie2, bismark, minimap2. Here it should be consistent with the index produced by STARK index.

.. _GAGE-seq:

13. GAGE-seq
~~~~~~~~~~

Test data: /tutorial/GAGE-seq/data

.. code-block:: bash

    stark count -o /absolute/path/to/result \
     -f /absolute/path/to/data/GAGE-seq \
     -t GAGE-seq \
     -e  MboI \
     -i /absolute/path/to/mm10/mm10.fa \
     --thread 60 \
     --exist-barcode

Parameter Description:
- **-o**: Location to save the result, note that **all paths must be absolute paths.**
- **-f**: Directory where the sequencing data is located.
- **-t**: Type of single-cell Hi-C.
- **-e**: Type of restriction enzyme used, must be consistent with the experiment.
- **-r**: Resolution used to convert pairs files into cool files.
- **-i**: Directory of the genome file used for alignment, the final hg38.fa is the type of genome, not part of the directory, consistent with the -g parameter of STARK index.
- **-a**: The software used for assembly, optional bwa, bowtie2, bismark, minimap2. Here it should be consistent with the index produced by STARK index.


.. _Droplet:

14. Droplet Hi-C
~~~~~~~~~

Test data: /tutorial/Droplet/data

Please make sure you have installed bowtie before running


.. code-block:: bash

    conda install bioconda::bowtie==1.3.1

after that, you need to build the bowtie index for the 10x barcode reference.

.. code-block:: bash

    bowtie-build /path/to/10x/barcode/reference/ref.fa /path/to/10x/bowtie/index

Then you can run the following command to process the droplet data.


.. code-block:: bash

    stark count -t droplet \
            --ref-10x /path/to/bowtie/index \
            -f /absolute/path/to/data/droplet \
            -i /cluster2/home/Kangwen/common/mm10/mm10.fa \
            -e MboI \
            -o /absolute/path/to/mm10/mm10.fa \
            --exist-barcode \
            --thread 32

Parameter Description:

- **-o**: Location to save the result, note that **all paths must be absolute paths.**
- **-f**: Directory where the sequencing data is located.
- **-t**: Type of single-cell Hi-C.
- **--ref-10x**: Directory of the bowtie index for 10x barcode reference.
- **-i**: bwa index of the genome file used for alignment


.. _Paired:

15. Paired
~~~~~~~~~

Test data: /tutorial/Paired/data

Please make sure you have installed bowtie before running


.. code-block:: bash

    conda install bioconda::bowtie==1.3.1

after that, you need to build the bowtie index for the 10x barcode reference.

.. code-block:: bash

    bowtie-build /path/to/10x/barcode/reference/ref.fa /path/to/10x/bowtie/index

Then you can run the following command to process the droplet data.


.. code-block:: bash

    stark count -t Paired \
            --ref-10x /path/to/bowtie/index \
            -f /absolute/path/to/data/droplet \
            -i /cluster2/home/Kangwen/common/mm10/mm10.fa \
            -e MboI \
            -o /absolute/path/to/mm10/mm10.fa \
            --exist-barcode \
            --thread 32

Parameter Description:

- **-o**: Location to save the result, note that **all paths must be absolute paths.**
- **-f**: Directory where the sequencing data is located.
- **-t**: Type of single-cell Hi-C.
- **--ref-10x**: Directory of the bowtie index for 10x barcode reference.
- **-i**: bwa index of the genome file used for alignment


The illumination of the Result
---------------------------------

Here is the introduction to the results:

.. code-block:: bash

    scSPRITE_test_tmp/
        ├── Result
        │   ├── cool_folder
        │       ├── [Even2Bo10][Odd2Bo69][DPM6bot1]_10000.cool
        │       ├── [Even2Bo10][Odd2Bo69][DPM6bot1]10000.KR.cool
        │       ├── [Even2Bo11][Odd2Bo19][DPM6bot31]_10000.cool
        │       ├── [Even2Bo11][Odd2Bo19][DPM6bot31]10000.KR.cool
        │       ├── [Even2Bo11][Odd2Bo1][DPM6bot75]_10000.cool
        │       ├── [Even2Bo11][Odd2Bo1][DPM6bot75]10000.KR.cool
        │   ├── mcool_folder
        │   │   ├── [Even2Bo10][Odd2Bo69][DPM6bot1].mcool
        │   │   ├── [Even2Bo11][Odd2Bo19][DPM6bot31].mcool
        │   │   ├── [Even2Bo11][Odd2Bo1][DPM6bot75].mcool
        │   └── SCpair
        │       ├── [Even2Bo10][Odd2Bo69][DPM6bot1].pairs.gz
        │       ├── [Even2Bo11][Odd2Bo19][DPM6bot31].pairs.gz
        │       ├── [Even2Bo11][Odd2Bo1][DPM6bot75].pairs.gz
        ├── test.bam
        ├── test_logging.log
        └── trimmed.fastp.json
        
**scSPRITE_test_tmp** is the root directory of the output, where ‘test’ is the name of the sample.

**Result** is directory the main result saved in it.

**cool_folder** is the directory stores all cells’ cool files before and after **KR** correction.

**mcool _folder** is the ****directory stores all cells’ mcool files.

**SCpair** is the directory stores all cells’ pair files.

**test.bam** is the bam file of the sequencing data.

**test_logging.log** records all the parameters of processing the data by STARK, as well as the time spent on each step.

**trimmed.fastp.json** records the result of fastp processing the data.