Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some questions/problems concerning the program pipeline #1

Open
ulah opened this issue Nov 25, 2016 · 6 comments
Open

Some questions/problems concerning the program pipeline #1

ulah opened this issue Nov 25, 2016 · 6 comments

Comments

@ulah
Copy link

ulah commented Nov 25, 2016

Hi eyay,

there are some things with the posted pipeline that I do not understand / do not work for me and it would be great if you could tell me, what I'm eventually doing wrong.

  1. Prerequisite "GenomeFetch.py": Looking briefly into the script it doesn't seem to support hg38 which you were using for the mapping. For what do you need the script?

  2. Remove UMI from reads: What is the difference between your script (remove_umi_from_fastq_v4.py) and the one from the umi_tools package (umi_tools extract)? The parallelization doesn't give me any speed advantage on the data?! Additionally, your output is a fq, why not compressing it which is supported by umi_tools extract?

  3. It seems that you are using also multi mapped reads in later steps of the analysis, don't you? If so, why can you use those and why didn't you change the outFilterMultimapNmax parameter from Star, as everything above 20 multi hits will be flagged as unmapped?!

3-4) Star mapping and clipped read removal: I'm missing a lot of file name conversions here. From the star command, you are creating generic output sam files, but in the clipped read removal step, you require bam files that are split into unique and multi-mappings with very rigid naming convention. Though I could create files that satisfy the name requirements (I split them based on the NH field and compress and sort them afterwards), I'm getting now an error I do not understand:

`
python /usr/local/smallseq/src/remove_softclipped_reads_parallelized_v2.py -i 01_Mapping -o 01_Mapping_clipped -p 6 -n 0 -t 3
Traceback (most recent call last):
File "/usr/local/smallseq/src/remove_softclipped_reads_parallelized_v2.py", line 93, in
Parallel(n_jobs=o.numCPU)(delayed(remove_clipped_reads)(sample) for sample in samplenames_with_fullpath)
File "/usr/local/lib/python2.7/dist-packages/joblib/parallel.py", line 768, in call
self.retrieve()
File "/usr/local/lib/python2.7/dist-packages/joblib/parallel.py", line 719, in retrieve
raise exception
joblib.my_exceptions.JoblibSamtoolsError: JoblibSamtoolsError


Multiprocessing exception:
...........................................................................
/usr/local/smallseq/src/remove_softclipped_reads_parallelized_v2.py in ()
88 samplenames_with_fullpath.append(multi_bam)
89
90 path_outbam = os.path.join(o.outstardir, sample)
91 if not os.path.exists(path_outbam): safe_mkdir(path_outbam)
92
---> 93 Parallel(n_jobs=o.numCPU)(delayed(remove_clipped_reads)(sample) for sample in samplenames_with_fullpath)
94
95
96
97

...........................................................................
/usr/local/lib/python2.7/dist-packages/joblib/parallel.py in call(self=Parallel(n_jobs=6), iterable=<generator object >)
763 if pre_dispatch == "all" or n_jobs == 1:
764 # The iterable was consumed all at once by the above for loop.
765 # No need to wait for async callbacks to trigger to
766 # consumption.
767 self._iterating = False
--> 768 self.retrieve()
self.retrieve = <bound method Parallel.retrieve of Parallel(n_jobs=6)>
769 # Make sure that we get a last message telling us we are done
770 elapsed_time = time.time() - self._start_time
771 self._print('Done %3i out of %3i | elapsed: %s finished',
772 (len(self._output), len(self._output),


Sub-process traceback:

SamtoolsError Fri Nov 25 10:47:38 2016
PID: 12961 Python 2.7.6: /usr/bin/python
...........................................................................
/usr/local/lib/python2.7/dist-packages/joblib/parallel.py in call(self=<joblib.parallel.BatchedCalls object>)
126 def init(self, iterator_slice):
127 self.items = list(iterator_slice)
128 self._size = len(self.items)
129
130 def call(self):
--> 131 return [func(*args, **kwargs) for func, args, kwargs in self.items]
func =
args = ('01_Mapping/SRR3495785/SRR3495785_unique.bam',)
kwargs = {}
self.items = [(, ('01_Mapping/SRR3495785/SRR3495785_unique.bam',), {})]
132
133 def len(self):
134 return self._size
135

...........................................................................
/usr/local/smallseq/src/remove_softclipped_reads_parallelized_v2.py in remove_clipped_reads(inbam='01_Mapping/SRR3495785/SRR3495785_unique.bam')
56
57 outbam.write(read)
58 reads_after_clip_removal += 1
59 outbam.close()
60 #sort and index the final bam file
---> 61 pysam.sort(bam_out, bam_out_sorted)
bam_out = '01_Mapping_clipped/SRR3495785/SRR3495785_unique_tmp.bam'
bam_out_sorted = '01_Mapping_clipped/SRR3495785/SRR3495785_unique'
62 pysam.index(bam_out_sorted+".bam", template=inbamPysamObj)
63
64 print reads_after_clip_removal, total_reads
65 print "%s percentage of removed reads = %.2f" %(p[-1], 100*(1 -reads_after_clip_removal/total_reads))

...........................................................................
/usr/local/lib/python2.7/dist-packages/pysam/utils.py in call(self=<pysam.utils.PysamDispatcher object>, *args=('01_Mapping_clipped/SRR3495785/SRR3495785_unique_tmp.bam', '01_Mapping_clipped/SRR3495785/SRR3495785_unique'), **kwargs={})
70 "%s returned with error %i: "
71 "stdout=%s, stderr=%s" %
72 (self.collection,
73 retval,
74 stdout,
---> 75 stderr))
stderr = '[bam_sort] Use -T PREFIX / -o FILE to specify te... Reference sequence FASTA FILE [null]\n'
76
77 self.stderr = stderr
78
79 # call parser for stdout:

SamtoolsError: 'samtools returned with error 1: stdout=, stderr=[bam_sort] Use -T PREFIX / -o FILE to specify temporary and final output files\nUsage: samtools sort [options...] [in.bam]\nOptions:\n -l INT Set compression level, from 0 (uncompressed) to 9 (best)\n -m INT Set maximum memory per thread; suffix K/M/G recognized [768M]\n -n Sort by read name\n -o FILE Write final output to FILE rather than standard output\n -T PREFIX Write temporary files to PREFIX.nnnn.bam\n -@, --threads INT\n Set number of sorting and compression threads [1]\n --input-fmt-option OPT[=VAL]\n Specify a single input file format option in the form\n of OPTION or OPTION=VALUE\n -O, --output-fmt FORMAT[,OPT[=VAL]]...\n Specify output format (SAM, BAM, CRAM)\n --output-fmt-option OPT[=VAL]\n Specify a single output file format option in the form\n of OPTION or OPTION=VALUE\n --reference FILE\n Reference sequence FASTA FILE [null]\n'


`

Why is samtools sort called again? And why is it looking for a reference fasta - a header is supplied?

Any help is much appreciated.
Best regards

@eyay
Copy link
Owner

eyay commented Nov 25, 2016

Hi ulah

Thanks for your interest in the pipeline.

  1. I call GenomeFetch in the remove_reads_with_genomic_TGG.py script with the line
    """ gf = GenomeFetch.GenomeFetch(genomedir=o.genome_dir) """
    where I provide hg38 genome. So the GenomeFetch fetches the sequences from the provided genome given the coordinates.

  2. Yes my script parallelizes the process. Thanks it is a good idea to compress the output, but I haven't had time to implement that.

  3. We used the default value for outFilterMultimapNmax which is 10, in order to be a little strict. But it is worth to consider chaning it to 20, like in ENCODE.

I'm not quite sure why you get this error. Which samtools version do you use? Thanks for pointing out the rigidness in the bam namings, I will fix that.

best
Ilgar

@ulah
Copy link
Author

ulah commented Nov 28, 2016

Hi Ilgar,

Thanks for your fast response.

To 1) Nice to know that you can provide your own genome file, I didn't assumed that from the documentation of the script.
To 3) Sry, my fault. I checked the manual from v2.4.0 for the params (where the default for outFilterMultimapNmax is 20), but I run the mapping with v2.5.1b where the default is indeed 10. A little confusing... ;) Nevertheless, I actually wanted to know why and how you are including multimappings in the later steps? Independent of the number of alternative hits, you'll never know the origin. In my opinion the bias you introduce is always much greater than the knowledge gain, isn't it?

Concerning the error message:
$ samtools --version
samtools 1.2
Using htslib 1.2.1
Copyright (C) 2015 Genome Research Ltd.

$ pip freeze
Babel==2.3.4
CGAT==0.2.5
Cython==0.25.1
Jinja2==2.8
MarkupSafe==0.23
MonoVar==0.0.0
MySQL-python==1.2.5
PAM==0.4.2
Pillow==2.3.0
PyVCF==0.6.7
PyYAML==3.12
Pygments==2.1.3
RSeQC==2.6.1
SPARQLWrapper==1.7.6
Sphinx==1.4.8
Twisted-Core==13.2.0
Twisted-Web==13.2.0
adium-theme-ubuntu==0.3.4
alabaster==0.7.9
alignlib-lite==0.3
apt-xapian-index==0.45
argparse==1.2.1
backports-abc==0.5
backports.ssl-match-hostname==3.5.0.1
bgcore==0.5.0
biopython==1.68
brewer2mpl==1.4.1
bx-python==0.7.2
ccsm==0.9.11.3
certifi==2016.9.26
chardet==2.0.1
colorama==0.2.5
command-not-found==0.3
compizconfig-python==0.9.11.3
cutadapt==1.11
cycler==0.10.0
dblatex==0.3.4.post3
debtagshw==0.1
decorator==4.0.0
defer==1.0.6
dirspec==13.10
dnspython==1.11.1
docutils==0.12
duplicity==0.6.23
ebfilter==0.2.0
et-xmlfile==1.0.1
future==0.16.0
ggplot==0.11.5
h5py==2.5.0
html5lib==0.999
httplib2==0.8
imagesize==0.7.1
isodate==0.5.4
jdcal==1.3
joblib==0.10.3
keepalive==0.5
lockfile==0.8
lxml==3.3.3
matplotlib==1.3.1
matplotlib-venn==0.11.4
mergevcf==1.0.0
mpmath==0.19
networkx==1.9.1
nose==1.3.7
numpy==1.9.2
oauthlib==0.6.1
oneconf==0.3.7
openpyxl==2.4.0
pandas==0.17.1
patsy==0.4.1
pep8==1.7.0
pexpect==3.1
piston-mini-client==0.7.5
psycopg2==2.6.2
pyOpenSSL==0.13
pybedtools==0.7.8
pycrypto==2.6.1
pycups==1.9.66
pygobject==3.12.0
pyparsing==2.0.1
pysam==0.9.1.4
pyserial==2.6
pysmbc==1.0.14.1
python-apt===0.9.3.5ubuntu2
python-dateutil==1.5
python-debian===0.1.21-nmu2ubuntu2
pytz==2012rc0
pyxdg==0.25
rdflib==4.2.1
reportlab==3.0
requests==2.2.1
rpy2==2.8.4
ruffus==2.6.3
scikit-learn==0.18.1
scipy==0.18.1
sessioninstaller==0.0.0
singledispatch==3.4.0.3
six==1.10.0
snowballstemmer==1.2.1
software-center-aptd-plugins==0.0.0
ssh-import-id==3.21
statsmodels==0.6.1
sympy==1.0
system-service==0.1.6
tornado==4.4.2
umi-tools==0.2.3
unity-lens-photos==1.0
urllib3==1.7.1
vboxapi==1.0
virtualenv==1.11.4
web.py==0.38
weblogo==3.5.0
wheel==0.24.0
wsgiref==0.1.2
xdiagnose===3.6.3build2
xlwt==1.1.2
zope.interface==4.0.5

Best,
Urs

@eyay
Copy link
Owner

eyay commented Nov 29, 2016

Hi Urs

The reason we are interested in multi mapped reads is because some miRNAs and tRNAs are expressed from multi loci, so the reads are expected to be multi mapped and we didn't wanna lose them. In our experience, including the multimapped reads didn't introduce much bias. We used weighed approach to include multimapped reads, but with a trick of only considering multi mapped regions that are annotated (gencode, mirbase, gtrnadb etc). So, if a read maps to 10 different coords in the genome, and only 3 of them is annotated region, then each read contributes 1/3 to the final count of that annotated region (e.g small rna).

I hope this was clear and let me know if you need further help

Best
Ilgar

@ulah
Copy link
Author

ulah commented Nov 30, 2016

Hi Ilgar,

I see... But shouldn't it count 1/10 to the final count of the annotated region in your example? I think the likelihood that the read originates from the other regions cannot be ignored as you may have any sorts of off-target contamination?!
Anyway, thanks for the fast replies and detailed information, we're just starting with the single cell small transcriptome business ;)

Unfortunately, my samtools error remains. It looks to me like the following code in "remove_softclipped_reads_parallelized_v2.py" is the critical part:
60 #sort and index the final bam file
---> 61 pysam.sort(bam_out, bam_out_sorted)
bam_out = '01_Mapping_clipped/SRR3495785/SRR3495785_unique_tmp.bam'
bam_out_sorted = '01_Mapping_clipped/SRR3495785/SRR3495785_unique'
62 pysam.index(bam_out_sorted+".bam", template=inbamPysamObj)

Could you confirm that it is save to comment these lines out? What should then the final name of the sorted, indexed bam file look like (Currently it is SRR3495785_unique.bam)?

Best,
Urs

@ulah
Copy link
Author

ulah commented Jan 12, 2017

Hi Ilgar,
After some time, I found some time to look deeper into the problem.
I realized now, that the problem is probably somehow related to your parallelization implementation...

Here is what I did:
(1)
MyCall:
$ python [...]/remove_softclipped_reads_parallelized_v2.py -i 01_Mapping -o 01_Mapping_clipped -p 5 -n 0 -t 3
Same errors occur as before, but I didn't realize before, that there were indeed some files created in "01_Mapping_clipped"

(2)
To track which files might be problematic, I simply added "print inbam" below "def remove_clipped_reads(inbam):" in "remove_softclipped_reads_parallelized_v2.py".
Running again, I saw that the code break during the processing of the 5th bam. So, I removed it from the input and tried again and the error occurs again in the 5th bam.

(3)
I can't remember why, but then I changed the number of threads (p) to 1. Now the error occurred immediately, i.e. in the first bam. Testing p=3,4,8 confirmed that it breaks after one set of 3,4,8 bams has been processed.

(4)
Being no expert in python, I tried to remove the parallelization from the script by replacing
Parallel(n_jobs=o.numCPU)(delayed(remove_clipped_reads)(sample) for sample in samplenames_with_fullpath)
with
for sample in samplenames_with_fullpath:
remove_clipped_reads(sample)
This obviously removes the errors related to parallel, but the samtools error still remains...

It would be great if you could help me here as I'd like to use your pipeline/data for an ongoing project.

Best,
Urs

@mhagemann86
Copy link

mhagemann86 commented Nov 5, 2017

Hi Ilgar & Urs,

The issue with running the remove_softclipped_reads_parallelized_v2.py script, can be fixed by changing the following:

Delete the +".bam" in line 62
60 #sort and index the final bam file
61 pysam.sort(bam_out, bam_out_sorted)
62 pysam.index(bam_out_sorted +".bam" , template=inbamPysamObj)

The further scripts however rely on the .bam ending of the clipped files, so insert the +".bam" again

23 bam_out_sorted = ".".join(outbamTmp.split(".")[:-1]) + ".bam"

At least that seems to done the trick for me.

Kind Regards
Michael

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants