The future is in parallelization, but today's approaches are not very comfortable to program for. GPGPUs requires simple control flows, ie very few branches or very skewed branches. Threads on GPGPUs are grouped and if threads in one group branch differently then both branches have to be executed. CPUs on the other hand have relatively high price for synchronization and even then CPUs can run very few threads in parallel, compared to GPGPUs at least. In addition, x86 requires rather strong memory consistency between CPU cores and that increases the price of making many core x86 CPUs. AFAIR Intel removed most of cache consistency semantics within Larrabee, because with so much CPU cores (about 50 AFAIR) cache consistency requirements would kill the whole project (too much power and area dedicated to cache consistency). With the new Xeon Phi's it seems Intel has reintroduced full cache coherency, but Xeon Phi isn't now as impressive. Specifications are here: http://ark.intel.com/products/71992/...53-ghz-60-core
60 cores x 1 GHz on 22nm, while Radeon 7970 is 32 cores (compute units) * 1 GHz on 28nm, so not much more increase in flexibility. Additionally it's rather possible that 22nm server Haswell can have 20 cores at 3 GHz, thus giving comparable peak compute strength and being much easier to program.
I'm keen on message passing and event driven asynchronous architectures but unfortunately there are no hardware acceleration for such things in contemporary CPUs. Going low-level with locks and semaphores always gives me headaches.