The fastest way to handle large files in Python

Time:2019-11-8

There are about 500GB of images in the various catalogs we need to work with.Each image is about 4MB in size, and we have a python script that processes one image at a time (it reads the metadata and stores it in the database).Depending on the size, each directory may take 1-4 hours to process.

We can use 2.2GHz quad core processor and 16GB ram on GNU / Linux operating system.The current script uses only one processor.What is the best way to use other kernels and ram to process images faster?Will starting multiple Python processes to run scripts take advantage of other kernels?

Another option is to use things like gearman or Beanstalk to assign work to other machines.I’ve looked at multiprocessing libraries but don’t know how to use them.

 

Solution


Will starting multiple Python processes to run scripts take advantage of other kernels?

Yes, if the task is CPU bound, it will.This is probably the easiest option.However, do not generate a single process for each file or directory;Consider using tools like this,parallel(1)And let it produce two processes at each core.

Another option is to use things like gearman or Beanstalk to assign work to other machines.

That might work.Besides, lookZeroMQOfPython binding, which makes distributed processing very simple.

I’ve looked at multiprocessing libraries but don’t know how to use them.

For example, define a function,processIt reads images from a single directory, connects to a database, and stores metadata.Let it return a boolean indicating success or failure.WedirectoriesIs a list of directory processing.Then?

import multiprocessing
pool = multiprocessing.Pool(multiprocessing.cpu_count())
success = all(pool.imap_unordered(process, directories))

All directories will be processed in parallel.If necessary, you can also perform parallel operations at the file level;This needs more repair.

Note that this will stop on the first failure;More work is needed to make it fault tolerant.

This article is first published in the python black hole network, and the blog park is updated synchronously