I am working with a database containing many URLs (tens of thousands). I am trying to multithreadedly rewrite a solution that is simply trying to resolve a given domain. If successful, he compares the result with what is currently in the database. If it is different, the result is updated. If it fails, it is also updated.
Naturally, this will lead to excessive database calls. To clarify some of my misunderstandings about the best way to get some form of asynchronous load balancing, I have the following questions (but Perl is still a newbie).
- What is the best option for workload distribution? Why?
- How do I collect URLs to resolve before spawning occurs?
- Creating a hash of domains with comparable data seems to make the most sense to me. Then divide it, start the children, the children return to their parents.
- How should parent data be handled in a clean way?
I played with a more pythonic method (given that I have more experience working in Python), but have not yet been able to get it to work due to lack of blocking for some reason. Based on this problem, streaming is not the best option simply because of (lack of) processor time for each thread (plus, I was crucified more than once in the Perl channel to use streams: P and for good reason)
Below is the more or less pseudo code I played with my threads (which should be used more as a complement to my explanation of what I'm trying to execute than anything).
# Create children... for (my $i = 0; $i < $threads_to_spawn; $i++ ) { threads->create(\&worker); }
Then the parent sits in a loop, controlling the total array of domains. It blocks and refills it if it becomes empty.
source share