What parallel programming model do you recommend today to take advantage of multi-core processors tomorrow?

If you are writing a new application from scratch today and want it to scale to all the kernels that you could add to it tomorrow, which parallel programming model / system / language / library would you choose? Why?

I am particularly interested in the answers on these axes:

  • Program performance / ease of use (can mortals use it successfully?)
  • Target application domain (what problems does it have (not) well?)
  • Concurrency style (does it support tasks, pipelines, parallelism data, messages ...?)
  • Maintaining operability / future verification (will anyone else use it in 20 years?)
  • Performance (how does it scale on which hardware?)

I am consciously vague about the nature of the application in anticipation of getting good general answers that are useful for various applications.

+45
parallel-processing multicore
Sep 17 '08 at 4:29
source share
22 answers

Multi-core programming may require more than one paradigm. Some current rivals:

  • MapReduce This works well when the problem can be easily decomposed into parallel pieces.
  • Nested Parallelism Data . This is similar to MapReduce, but actually supports recursive decomposition of the problem, even if the recursive chunks are of the wrong size. Look for NDP to be a big win in purely functional languages ​​running on massively parallel, but limited hardware (like GPUs).
  • Software transactional memory . If you need traditional threads, STM makes them bearable. You pay 50% of the performance in critical sections, but you can scale complex locking schemes to 100 processors without pain. However, this will not work for distributed systems.
  • Parallel streams of objects with messaging . This really smart model is used by Erlang. Each "object" becomes an easy stream, and objects communicate through asynchronous messages and pattern matching. This is basically true parallel OO. This has succeeded successfully in several real-world applications, and it works great for untrusted distributed systems.

Some of these paradigms give maximum performance, but only work if the problem breaks up. Others sacrifice some performance, but allow a wider range of algorithms. I suspect that some combination of the above will eventually become the standard toolbox.

+27
Sep 17 '08 at 15:10
source share

Two solutions that I really like are combining costing ( JoCaml , Polyphonic C # , ) and an actor model ( Erlang , Scala , E , Io ).

I am not particularly impressed with Transactional Programming Memory . He just feels that he is only there to let the threads cling to life a little longer, although they were supposed to die a few decades ago. However, it has three main advantages:

  • People understand database transactions
  • There is already talk of RAM hardware.
  • As far as we all would like, threads are likely to be the dominant concurrency model over the next few decades, sadly. STM can significantly reduce pain.
+13
Sep 17 '08 at 5:04
source share

The mapreduce / hadoop paradigm is useful and relevant. Especially for people who are accustomed to languages ​​like perl, the idea of ​​matching the array and performing some actions on each element should be fairly smooth and natural, and mapreduce / hadoop simply proceeds to the next step and says that there is no reason why each element The array must be processed on one computer.

In a way, this is more tested for battle since Google uses mapreduce, and many people use hadoop and have shown that it works well for scaling applications on multiple computers over a network. And if you can scale multiple computers over a network, you can scale multiple cores in one machine.

+11
Sep 17 '08 at 4:35
source share

For the .NET application, I select “ .NET Parallel Extensions (PLINQ) ”, it is extremely easy to use and allows me to parallelize existing code in minutes.

  • It's just to find out.
  • Used to perform complex operations on large arrays of objects, so I can not comment on other applications
  • Support for tasks and communication channels
  • It should be supported over the next few years, but who knows for sure?
  • The CTP version has some performance issues, but it already looks very promising.

Mono is likely to get PLINQ support , so it could be a cross-platform solution.

+10
Sep 17 '08 at 4:33
source share

For heavy computing, etc. purely functional languages ​​such as Haskell are easily parallelizable without any effort on the part of the programmer. Besides the training of Haskell, that is.

However, I do not think that this is the way of the future (simply), because too many programmers are too accustomed to the paradigm of imperative programming.

+6
Sep 17 '08 at 4:46
source share

kamaelia is a python framework for building applications with a lot of communication processes.

http://www.kamaelia.org/cat-trans-medium.png Kamaelia - Concurrency is useful, fun

In Kamaelia, you create systems from simple components that talk to each other . This speeds up development, greatly simplifies maintenance, and also means that you are creating natural parallel software . It is designed to access any developer, including beginners. It also makes him funny :)

Which systems? Network servers, clients, desktop applications, Pygame-based games, transcoded systems and pipelines, digital television systems, spam destroyers, training tools and much more :)

See also the question Multicore and Concurrency - Languages, Libraries, and Development Methods.

+5
Sep 18 '08 at 1:20
source share

I rely on the transmission of event contours using promises, as is implemented in systems such as Twisted , E , AmbientTalk and others. They retain the ability to write code with the same assumptions of the execution model as non-competitive / parallel applications, but they scale for distributed and parallel systems. (That's why I'm working on Ecru .)

+4
Sep 17 '08 at 4:36
source share

Check out Erlang . Google for this, and watch various presentations and videos. Many of the programmers and architects that I respect are quite susceptible to its scalability. We use it where I work very hard ...

+2
Sep 17 '08 at 4:40
source share

As already mentioned, purely functional languages ​​are essentially parallel. However, imperative languages ​​are much more intuitive for many people, and we are deeply rooted in the imperative code. The main problem is that pure functional languages ​​explicitly express side effects, and side effects are implicitly expressed in imperative languages ​​in the order of statements.

I believe that methods for declaratively expressing side effects (for example, in an object-oriented structure) will allow compilers to decompose peremptory statements into their functional relationships. This should then allow automatic parallelization of the code in the same way as pure functional code.

Of course, just like today, it is still advisable to write certain assembler-critical code for performance, you still have to write tomorrow with code that is clearly critical for performance. However, such methods, as I have outlined, should help to automatically take advantage of many architecture cells with the minimum cost spent by the developer.

+2
Sep 17 '08 at 15:32
source share

I am surprised that no one suggested MPI (messaging interface). It was shown that, designed for distributed memory, MPI programs with significant and frequent global communication (solving linear and nonlinear equations with billions of unknowns) have a scale of up to 200 kHz.

+2
Mar 18 '09 at 14:53
source share

The Qt parameter offers a MapReduce implementation for multi-core processors, which is very easy to use. This is multiOS.

+1
Sep 17 '08 at 4:34
source share

If your problem domain allows you to think about the fact that no model has anything in common. The less you share between processes and threads, the less you have to design complex concurrency models.

+1
Sep 17 '08 at 4:59
source share

See also the question Multicore and Concurrency - Languages, Libraries, and Development Methods.

+1
Oct 23 '08 at 11:08
source share

This question seems to continue to appear in different wordings - there may be different groups in StackOverflow. Flow-Based Programming (FBP) is a concept / methodology that has been around for over 30 years and has been used to process most of the batch processing in a large Canadian bank. It has streaming implementations in Java and C #, although earlier implementations were based on fibers (C ++ and mainframe Assembler - the one used in the bank). Most approaches to the problem of using multi-core include trying to take the usual single-threaded program and find out which parts can work in parallel. FBP takes a different approach: the application is designed from the very beginning in terms of several black box components that work asynchronously (think of an assembly line for production). Because the interface between the components is a data stream, FBP is essentially language independent and therefore supports mixed language applications and domain specific languages. For the same reason, side effects are minimized. It can also be described as a “nothing in common” model and MOM (message-oriented middleware). MapReduce seems to be a special case of FBP. FBP differs from Erlang mainly in that Erlang works in terms of many short-lived streams, so green streams are more appropriate here, while FBP uses fewer (usually from several tens to several hundreds) longer streams. For the part of the packet network that has been in daily operation for more than 30 years, see the part of the packet network . For a high-level design of an interactive application, see the High-Level Design Brokerage Application . FBP was found to lead to a significantly larger number of supported applications, as well as improved elapsed times - even on single-core machines!

+1
Jun 02 '09 at 15:55
source share

Queue task with several working systems (not sure about the correct terminology - message queue?)

Why?

Mostly because it is an absurdly simple concept. You have a list of materials that need processing, and then many processes that receive tasks and process them.

In addition, unlike the reasons, say, Haskell or Erlang are so parallel / parallelizable (?), This is a fully linguistic agnostic - you can trivially implement such a system in C, Python or in any other language (even using shell scripting), while I doubt that bash will soon receive software transactional memory or union-calculus.

+1
Jun 2 '09 at 16:48
source share

We used PARLANSE , a parallel programming langauge with explicit concurrency partial order specification over the past decade, to implement a scalable program analysis and conversion system ( DMS Software Reengineering Toolkit ) that mainly performs symbolic rather than numerical calculations. PARLANSE is a compiled C-like language with traditional scalar data type types, integers, float types, dynamic data types and arrays, compound data structures and union and lexical range functions. Although most languages ​​are vanilla (arithmetic expressions over operands, if-then-else statements make loops, function calls), parallelism does not. parallelism is expressed by defining the relationship “precedes” over blocks of code (for example, a to b, a to c, d to c) written as

(|; a (... a computation) (<< a) b ( ... b computation ... ) (<< a) c ( ....c computation ...) (>> c) d ( ... d computation...) )|; 

where <and → operators refer to "time order". The PARLANSE compiler can see these parallel calculations and predefine all the structures needed to calculate the grains a, b, c, d and generate its own code to start / stop each, which minimizes the overhead for starting and stopping these parallel grains.

Check out this link for a parallel iterative deepening of finding the best solutions for the 15 puzzle , which is the 4x4 older brother of the 8 puzzle. It uses only parallel potential as the parallelism constructor (|| abcd) , which says that there are no restrictions on the partial order of abcd calculations, but it also uses speculation and asynchronously interrupts tasks that cannot find a solution. Its a lot of ideas in fairly small code.

PARLANSE runs on multi-core PCs. The large PARLANSE program (we built a lot with 1 million + lines or more) will have thousands of these partial orders, some of which will call functions that others contain. So far, we have had good results with 8 processors and a modest profit of up to 16, and we are still setting up the system. (We believe that the real problem with a large number of cores on current PCs is the memory bandwidth: 16 cores that undermine the memory subsystem create a huge need for bandwidth).

Most other languages ​​do not expose parallelism, so they are hard to find, and runtime systems pay a high price for planning grain computations using general purpose primitives. We believe that the recipe for disaster, or at least bad work, is due to Amhdal’s law: if the number of machine instructions for planning grain is large compared to work, you cannot be effective. OTOH, if you insist on calculating grains with many machine instructions so that the planning costs are relatively low, you cannot find computational grains that are independent, and therefore you do not have a useful parallelism for planning. Therefore, the key idea of ​​PARLANSE is to minimize the costs of grain planning, so that the grains can be small, so that they can be found in real code. Understanding this trade-off was caused by the unsuccessful failure of the clean data stream paradigm, which did everything in parallel with tiny parallel pieces (for example, the add operator).

We have been working on this for a decade. This is hard to understand. I do not see how people who did not build parallel langauges and use / configure them for this time interval have serious chances to create effective parallel systems.

+1
Aug 28 '09 at 3:53
source share

I really like the Clojure model. Clojure uses a combination of immutable data structures and transactional memory software.

Immutable data structures are those that never change. New versions of structures can be created with modified data, but if you have a "pointer" to the data structure, it will never change from under you. This is good because you can access this data without worrying about concurrency issues.

Transactional program memory is discussed elsewhere in these answers, but suffice it to say that this is a mechanism by which multiple threads can affect some data, and if they collide, one of the threads rolls back to try again. This allows you to significantly accelerate the speed when the risk of collision is present, but unlikely.

There is a video from the author Rich Hickey, which goes a lot more.

+1
Aug 28 '09 at 4:18
source share

OpenCL can be a useful way, which provides means for distributing certain types of computing loads across heterogeneous computing resources, IE the same code will work on a multi-core processor, as well as on commodity graphics processors. ATI recently released just such a toolchain . The NVidia CUDA toolchain is similar, although somewhat more limited. Nvidia also appears to have OpenCL sdk in the works

It should be noted that this probably will not help many where workloads do not have a parallel nature of data, for example, it will not help in typical transaction processing. OpenCL focuses primarily on types of computations that are mathematically intensive, such as scientific / engineering modeling or financial modeling.

+1
Aug 28 '09 at 4:52
source share

If you are writing a new application today from scratch, and you want it to scale to all the kernels that you could add to it tomorrow, which parallel programming model / system / language / library would you choose?

Perhaps the most widely applicable today is the Cilk-style task queue (now available in .NET 4). They are great for problems that can be solved by dividing and conquering with predictable complexity for subtasks (such as parallel map and reduce over arrays, where the complexity of function arguments is known, as well as algorithms such as quicksort), and which covers many real Problems.

This also applies only to shared memory architectures, such as multi-core processors today. Although I do not believe that this basic architecture will disappear any time soon, I believe that at some point it should be combined with distributed parallelism. This will be either in the form of a cluster of multi-core processors on a multi-core processor with a message passing between the multicode, or in the form of a hierarchy of cores with a predictable communication time between them. To achieve maximum efficiency, significantly different programming models will be required, and I do not think that much is known about them.

+1
May 23 '10 at
source share

There are three parts to concurrent IMO programming: identify parallelism and specify parallelism. Identify = Break the algorithm into parallel pieces of work, indicate = actual encoding / debugging performed. Identifying does not depend on what structure you will use to indicate parallelism, and I do not think that the structure can help there. He has a good understanding of your application, target platform, general parallel software compromises (hardware latencies, etc.), and most importantly - experience. Indicate, however, it is possible to discuss, and here is what I am trying to answer below:

I tried a lot of frameworks (at school and at work). Since you asked about multi-core processors, which are shared memory, I will stick to the three common memory frameworks I used.

Pthreads (not really there, but definitely applicable):

Pro: -Pthreads is extremely generic. For me, pthreads is like a parallel programming assembly. You can encode any paradigm in pthreads. “It's flexible, so you can make it as tall as you want.” There are no inherent limitations to slow you down. You can write your own constructions and primitives and get as much speed as possible.

Con: -Make sure that you do all the plumbing, like managing work queues, task allocation, on your own. The actual syntax is ugly, and your application often has a lot of extra code that makes writing code difficult and then hard to read.

OpenMP:

Pros: - Codex looks clean, plumbing and separation of tasks - mostly under the hood -Semi flexible. This gives you some interesting planning opportunities.

Cons: -Meant for a simple loop such as parallelism. (The latest Intel verion also supports tasks, but the tasks are the same as Cilk.) - Other structures may or may not be well written to execute. The GNU implementation is fine. Intel ICC worked better for me, but I would rather write some things to improve performance.

Cilk, Intel TBB, Apple GCD:

Pros: -Positively optimal basic algorithms for the parallelism task level -Special management of serial / parallel tasks -TBB also has a parallelism pipeline infrastructure that I used (this is not the best to be frank) -Installs the task of writing a lot of code for task-based systems which can be a big plus if you briefly

Cons: -Reduced performance control of basic structures. I know that Intel TBB has very poorly performing basic data structures, for example, the work queue was bad (in 2008, when I saw it). -Code looks awful sometimes with all the keywords and keywords that they want to use -Reads a lot of links in more detail to understand their "flexible" APIs.

+1
May 10 '11 at a.m.
source share

I would use Java - its portable, so future processors will not be a problem. I would also encode my application with layers separating the interface / logic and data (more similar to a level 3 web application) with standard mutex routines as a library (less debugging of parallel code). Remember that web servers scale very well for many processors and are the least painful way for multi-core processors. Either that, or look at the old Connection Machine model with a data-bound virtual processor.

0
Sep 17 '08 at 4:34
source share

Erlang is a more mature solution and is portable and open source. I worked with Polyphonic C #, I don’t know how to program every day in it.

There are libraries and extensions for almost every language / OS under the sun, Google transactional memory. This is an interesting approach from MS.

0
Oct 23 '08 at 11:18
source share



All Articles