CyVerse_logo

Home_Icon Learning Center Home

Peak calling

In this step, we will use bowtie alignment files to perform peak calling. Peak calling programs identify set of read enriched regions in the genome that represent binding sites from your protein of interest. We will be using MACS2 (Model-based Analysis for ChIP-seq) app in CyVerse DE for peak identification.


Input Data:

Input Description Location
Bowtie output files Alignment files iplantcollaborative > example_data > chipseq_webinar -> bowtie_output

Run MACS2 in the CyVerse Discovery Environment

  1. Click on “Apps” tab in the Discovery Environment and search for “macs2”.
  2. Click on the app icon.

macs2_app_icon

  1. Change the name of the analysis and output folder as needed or leave for defaults.

  2. Under Callpeaks input section, browse the treatment and control files from the datastore. Provide experiment name ‘ecoli’.

    Example treatment file- iplantcollaborative > example_data > chipseq_webinar -> bowtie_output -> bowtieout_chipIP.sam

    Example control file- iplantcollaborative > example_data > chipseq_webinar -> bowtie_output -> bowtieout_input.sam

  3. Provide mappable genome size- ‘gsize 4639675’ for E. coli genome. Leave the rest of the parameters to defaults. For next section “Resource Requirements” request resources as needed or leave for defaults. Click on Launch Analysis.

  4. Click on the Analyses to check the status of your job. When the analysis completes, click on the right three dots menu and click on ‘Go to output folder’ to access you output files.

Should you discard duplicate reads before peak identification?

Best practice is to remove duplicates prior to peak calling. MACS2 default is to keep a single read at each location but provides different options to deal with duplicates. Bona fide peaks will have multiple overlapping reads with offsets, while samples with only PCR duplicates will stack up perfectly without offsets. These duplicates can arise from experimental artifacts, but can also contribute to true ChIP-signals.

Note

The bad kind of duplicates: If initial starting material is low this can lead to overamplification of this material before sequencing. Any biases in PCR will compound this problem and can lead to artificially enriched regions. Also blacklisted (repeat) regions with ultra high signal will also be high in duplicates. Masking these regions prior to analysis can help remove this problem.

The good kind of duplicates: You can expect some biological duplicates with ChIP-seq since you are only sequencing a small part of the genome. This number can increase if your depth of coverage is excessive or if your protein only binds to few sites. If there are a good proportion of biological dupicates, removal can lead to an underestimation of the ChIP signal (Credits: HBC ChIP-seq workshop for summarizing this info)

Output/Results

Output Description Example
NAME_peaks.narrowPeak Contains the peak locations and other information ecoli_peaks.narrowPeak
NAME_peaks.xls Tabular file which contains information about called peaks ecoli_peaks.xls
NAME_summits.bed Peak summits locations for every peak ecoli_summits.bed

Description of output and results

We will be using ecoli_peaks.narrowPeak output file for further analysis. For more information on MACS2 parameters and output files, check the github read me for MACS2 https://github.com/taoliu/MACS.

Brief description of narrowpeak output file format (BED6+4 format):

Col1- name of the chromosome

Col2- Peak start position

Col3- Peak end position

Col4- Peak name

Col5- Peak score

Col6- Strand

Col7- Fold enrichment

Col8- log10.pvalue

Col9- log10.qvalue

Col10- peak

For more information about BED format, check here

Note

MACS2 mfold parameter specifies an interval of high-confidence enrichment ratio against the background on which to build the model. The default value 10, 30 means that a model will be built on the basis of regions having read counts that are 10- to 30-fold of the background. Check the effect of changing mfold range to 5,30 on number of resulting peaks.


Fix or improve this documentation


Home_Icon Learning Center Home