How I can parallelize computations in a fasta file, where each processor takes one sequence


I don't know how to parallelise a code in Python that takes each line of a FASTA file and makes some statistics, like compute GC content, of it. Do you have some tips or libraries that will help me to decrease the time spent in execution?

I've tried to use os.fork(), but it gives me more execution time than the sequential code. Probably is due to I don't know very well how to give each child a different sequence.

#Computing GC Content
from Bio import SeqIO                  
with open('chr1.fa', 'r') as f:
    records = list (SeqIO.parse(f,'fasta'))
    GC_for_sequence=[]
    for i in records:
        GC=0
        for j in i:
            if j in "GC":
                GC+=1
        GC_for_sequence.append(GC/len(i))
    print(GC_for_sequence)

The expected execution would be: Each process takes one sequence, and they do the statistics in parallel.

- - Source

Answers

answered 1 week ago Lukasz Tracewski #1

Here's one idea with standard multiprocessing module:

from multiprocessing import Pool
import numpy as np

no_cores_to_use = 4

GC_for_sequence = [np.random.rand(100) for x in range(10)]

with Pool(no_cores_to_use) as pool:
    result = pool.map(np.average, GC_for_sequence)

print(result)

In the code I used numpy module to simulate a list with some content. pool.map takes function that you want to use on your data as first argument and data list as second. The function you can easily define yourself. By default, it should take a single argument. If you want to pass more, then use functools.partial.

[EDIT] Here's an example much closer to your problem:

from multiprocessing import Pool
import numpy as np

records = ['ACTGTCGCAGC' for x in range(10)]
no_cores_to_use = 4

def count(sequence):
    count = sequence.count('GC')
    return count

with Pool(no_cores_to_use) as pool:
    result = pool.map(count, records)

print(sum(result))

answered 1 week ago Sam Mason #2

a few notes on your existing code to start with:

  1. I'd suggest not doing: list (SeqIO.parse(…)) as that will pause execution until all sequences have been loaded in memory, you're much better off (memory and total execution time) just leaving it as an iterator and consuming elements off to workers as needed

  2. looping over each character is pretty slow, using str.count is going to be much faster

putting this together, you can do:

from Bio import SeqIO

with open('chr1.fa') as fd:
    gc_for_sequence=[]
    for seq in SeqIO.parse(fd, 'fasta'):
        gc = sum(seq.seq.count(base) for base in "GC")
        gc_for_sequence.append(gc / len(seq))

if this still isn't fast enough, then you can use the multiprocessing module like:

from Bio import SeqIO
from multiprocessing import Pool

def sequence_gc_prop(seq):
    return sum(seq.count(base) for base in "GC") / len(seq)

with open('chr1.fa') as fd, Pool() as pool:
    gc_for_sequence = pool.map(
        sequence_gc_prop,
        (seq.seq for seq in SeqIO.parse(fd, 'fasta')),
        chunksize=1000,
    )

comments from Lukasz mostly apply. other non-obvious stuff:

  • the weird seq.seq for seq in… stuff is to make sure that we're not Pickling unnecessary data
  • I'm setting chunksize to quite a large value because the function should be quick, hence we want to give children a reasonable amount of work to do so the parent process doesn't spend all its time orchestrating things

comments powered by Disqus