Ok, I've been bit by Tim Bray's Wide Finder meme.
I noticed the conversation swarm as it bubbled up, but didn't pay too much attention. Mark Masterson's article It's Time to Stop Calling Circuits "Hardware" caught my attention, as I have pondered the plasticity of the boundary between hardware and software in a previous life.
So I've been digesting the conversation swarm. It's one heck of an interesting read.
Tim presents a problem case that frames a fundamental shift occurring in modern CPU/system architectures. The shift is moving us away from ever increasing CPU speeds towards ever increasing CPU counts. Certain classes of problems are extremely well suited for the shift to multcpucore architectures. Other problems gain no direct benefit, particularly if they are migrated without change. Tim uses the problem of summarizing log file data as an example of this latter case.
Without brainpower focused on this aspect of the problem, the techniques being employed to increase aggregate compute capacity will not provide much benefit for many of the common tasks performed in IT shops.
There are three interesting aspects to Tim's conversation swarm. Two are explicit. The third is implicit.
The first aspect consists of all the solutions for the stated goal - how to leverage the latest trend in processor/system architectures for the seemingly mundane task of processing log data.
For what it's worth, here are my first thoughts on the problem of leveraging multiple cpus ofor the task of processing log data. My preference leans towards use of existing technology, most likely to be implemented by the people most likely to feel the pain.
Divide and conquer: (the sysadmin in me)
- Coerce the logging engine(s) to dump into multiple log files (to multiple disks or disk channels if necessary).
- Run a pile of processes to process the log files independently.
- Consolidate the data - either as post processing or incrementally via some form of IPC.
- The choice of language is immaterial, but history would probably vote for perl or shell goop
Streams and Trigger: (mentioned in the conversation comments)
- Hook into the log stream(s)
- Spawn readers for the various data collection functions
- Send events from the log stream(s) to the readers, processing the data as it's received
Neither of these two solutions are particularly interesting, but I imagine they are the most likely to be implemented in the wild.
My final offering is more of a meta solution.
- Formulate a red herring idea
- Pose it to a bunch of brainy people
- Watch them chew on it
- Gain new insight
The second interesting aspect of the conversation swarm is the rumination over the relationship between computer languages and the shift in cpu/system architectures.
One participant (sorry, can't recall the link) offered the suggestion that it's probably easier to improve a language like Erlang than it is to modify the mainstream languages to provide the capabilities inherent in Erlang.
I don't disagree with this point of view, but Tim's point regarding the widespread use of perl/awk/etc points to a fundamental fact in IT shops - the tool must be wickedly effective at getting the job done. Optimal performance is often optional.
So how to effectively use 64-1024 CPU machines?
First off, who says our currently technologies are effectively using the existing architectures? Follow things from the hardware up the application stack - it staggers the mind.
The reality is we seldom go back and fix. We come up with clever ways to incrementally capitalize on architectural changes. We reframe existing code in ways that take advantage of changes in architectures. I'm overgeneralizing somewhat, but no matter.
At the risk of sounding like a pessimist, I think we'll end up with thousands of little SOA web services engines. Each one handling a single piece. Each one with its own HTTP stack. Each one using PHP/Perl/Ruby/etc to implement the service functions. Each one sitting on top of a tiny little mysql database. Eeeep! I just scared myself - better drop this line of thought. I'll have nightmares for weeks.
The third interesting aspect of the conversation is how it shows some of the most important characteristics of the modern concept of networks vs. groups. It's decentralized, it's unlikely to be swayed by an alpha geek, it creates a variety of unanticipated results, it's a bit messy, and it provides fertile ground for exploring the topic at some point in the future.