So, the questions are
- Is there an existing language that expresses parallelism at the
right
level for multi-core/multithreaded core to take advantage of?
Doubtful.
I tend to agree. I don't know of any, but I thought someone else might.
I would claim that ScalPL is one, but it partially depends on what one
thinks it means for a language to "exist".
I had downloaded the ScalPL paper and am working my way through it. I was
intrigued by software cabling when you talked about it here some months ago.
I went through that documentation and, quite frankly, got all bogged down in
the terminology and lost the forrest for the trees. I am not sure that
changing all the names (as you did for ScalPL) helps that problem, but the
inclusion of examples did help some. BTW, I am not claiming that your
documentation is bad; it may very well be my fault. I have to spend more
time on it, which I will.
Thanks for the feedback. I've spoken to several others the last few
days who have similar comments and would like to see more examples. I
figured there was more work to do on this paper (as suggested by the
"Prelim" in the version number), but decided to get this version out,
mostly for those who have been waiting patiently, but also to let
feedback help steer the rest of the writing. I replaced the
modules/chips/boards/sockets/cables analogy in Software Cabling with
plans/tasks/strategies/situations/roles in ScalPL ("scalpel") in an
attempt to provide more familiarity and broaden the audience, but the
jury's still out on how much that helped, and some people have even told
me that the terminology change took some of the nerd-factor charm of
Software Cabling away.
I did have some questions about software cabling, for which I didn't see the
answer in ScalPL. Specifically, it appears that what I call a transaction
does not survive an I/O.
I'm not sure what you mean by "what I call a transaction", but I can
understand some of your confusion with I/O because the paper certainly
doesn't give it the coverage it merits, and I/O is one of ScalPL's
subtler points as it is in many function-based languages. First of all,
by definition, I/O means access to things outside of the plan, so such
access must be made via the plan's roles (and rolesets). Rather than
have the plan directly access external device and file buffers via these
roles (ouch), the recommended approach is to externally create an
instance of a sanctioned "file" class which holds (on instantiated
places) the non-persistent aspects of the file (e.g. file pointer,
buffers) and which has some of its instantiated visitors externally
bound to the desired physical file or device buffers, and then for the
plan performing I/O to access that file instance/object from within the
plan via its roles. That is a big reason for why instantiated visitors
even exist.
And if that doesn't make sense (yet), I'll try to make it clear in
upcoming revisions, or we can discuss it outside of comp.arch.
I don't see how it could read external data if all
of the data it neeeds must be there before it starts. I tried to work
through a simple debit/credit transaction and couldn't see how to do it.
Can you help me out with an explanation?
It depends on what you mean by "transaction" (which is where my
comp.distributed discussion with Marty Fouts seemed to break down a few
years back). A two-phase atomic transaction has lots of nice
properties, but is defined by the property that, once it begins its
second phase where it is giving up resources (and output can be
considered thus), it's first phase where it acquires resources (e.g.
input) must be over. Tasks in ScalPL are two-phase transactions, so one
task in ScalPL can get some input and produce some output but can't do
this repeatedly. If you want repeated I/O, you build more complex
(nested or strung) transactions from multiple tasks embodied into a
strategy, and would use places within the strategy to maintain any
required persistent state between/among them. I'm not sure whether that
answers your question, but again, maybe comp.arch isn't the place to get
in much more depth (but comp.distributed or offline could be).
Also, since you point it out, what is the implementation status of ScalPL
and the associated editor?
All of the diagrams in the document were screenshots from the
Linux-based L23 editor. Significant progress was made on a
windows-based L23. Neither of these could be considered products at
this time, and they may never be, but arrangements could probably be
made. Significant design and some coding has occurred for other tools
(e.g. runtime and debugging) to be integrated with L23, but since this
work has been self-funded since at least '91, it's been slow-going.
Yes. I may be wrong, but I am hard pressed to put pure data parallelism (as
would be exploited by a vector machine) versus very fine grained parallelism
as exemplified by say the kind of speculative execution that Andy Glew likes
on a granularity scale. They seem more to be different types than different
levels of parallelism.
I would concur that speculative execution is different in that it's more
of a trade-off between likelihood of using the speculated result and the
cost of using the resources to get that result. I personally never
considered vector parallelism as especially pure, though, and I'm
content with granularity and existing categorizations. I've seen too
many cases where categories get so narrow that common elegant solutions
become elusive.
I think there is some concurrence that the programmer's job should be to
express the algorithm in a way that makes at least some amount of the
potential parallelism apparent, and that some of that apparent parallelism
may not be beneficially exploited at runtime for some platforms. That
raises a few questions, like: (1) What's the best way to express potential
parallelism in a way that is efficent whether or not it is exploited? (2)
How much potential parallelism should be made apparent by the programmer?
(3) How much potential parallelism is too much, and just complicates the
decision and/or specification of how much to ultimately exploit? (4)
Should a human help in the determination of how much of the parallelism to
exploit (or conversely, how much to squelch) for a particular platform,
and if so, how?
I personally believe that if 1 is addressed satisfactorily, then 2 and 3
can take care of themselves--i.e. some of the parallelism is revealed, the
program runs satisfactorily on platforms L, M, and N, which exploit no
more than that amount, but when it is moved to platform P, more must be
revealed. It's essentially "just in time" parallelization (hey, "Extreme
Parallel Programming"!), which can work well if 4 is designed correctly.
This avoids the fine-grain dataflow and/or functional language issues
where there may be oodles of parallelism revealed, so much that it's hard
to tell where to start in exploiting it effectively. (Some of my own ideas
on this are apparent in my recent postings, others are not.)
I tend to agree with that thesis.
Then maybe I didn't push it far enough. :-) I didn't even delve into
whether 1 makes the code less vavluable in some other ways, so that
people want to maintain the "un-apparent parallization" code even while
possessing the "apparent parallization" code. My thesis (which is by no
means proven or even believed by many) is that the code with apparent
potential parallelization can have enough advantages other than just
level of parallelism that, with the proper tools, programmers can find
it as (if not more) desireable to maintain than code without apparent
parallelization. (That may dictate the availability of open source
tools.) If that's true, there's little reason for programmers not to
produce all new code in the apparent parallelization form, as well as to
preserve all partial improvements they've made to existing code.
I have had some potential funders claim that a new model shouldn't even
be proposed without also proposing tools to convert dusty decks, but
trying to solve both of these problems in tandem is not only largely
impractical, it may be largely unnecessary. Even without such tools,
the above properties alone suggest that code will, over time, get
parallelized, instead of the current dusty deck approach where the
output (effectively object code) becomes useless when the platform
changes and the user is back to square one (the dusty deck).
-Dave