Not enough parallelism in programming

Discussion:

(too old to reply)

David C. DiNucci

2005-09-08 18:23:40 UTC

OK, so if programmers can think in 2-D, what do they need in a language to
reasonably express those thoughts?

"Executable graphs" is the obvious answer.

So what don't you like about existing languages that do executable
graphs? One such is Mentat. It did OK for large-grain problems, but as
usual there are a host of issues making it less than practical for
fine-grain parallelism.

I can't speak for Robert, but I think Mentat has been given enough time
and investment that I can speak freely for myself now.

I first became aware of Mentat in a talk by Andrew Grimshaw (at HPDC?)
back in '95 or so. He asked the audience whether they knew of any other
language where graphs were first class, and although many languages
could be interpreted as such (LISP? functional?), I was the sole one to
raise my hand, citing Software Cabling. Later, in the course of the IPG
work at NASA, I had the opportunity to attend a tutorial of the Legion
work out at UVa (which was probably when I first met you). I was
disappointed to find that, under the covers, the entities comprising an
application communicated via messages. In fact, at least back then, it
was up to the application programmer to explicitly copy/marshal the
program data to and from communication buffers. I later heard Andrew
explain at a more application-oriented meeting that it was best not to
rely on that level of communication even for application messages in
highly-parallel apps--i.e. better to divvy up the app into large
Legion-managed entities, each of which used traditional message passing
within. I interpreted that to mean back to square one as far as
portability was concerned.

Legion's most interesting points, in my mind, were(are?) in its
security/distributed object model. It's computational model does (or
did) not lend itself to very scalable or efficient programs. Judging
from comments from the team (including yours above, "OK for large-grain
problems"), I'm not sure you/they disagree. Not to mention, I don't
think I've ever actually seen a formal graphical syntax for Mentat
programs, though they may technically be regarded as graphs.

I'd be happy to be brought up to date if any of this is now wrong.

-Dave

[I've added comp.distributed, though I assume you will remove it from
any reply, based on past experience.]

David C. DiNucci

2005-09-12 11:04:48 UTC

Permalink

I was disappointed to find that, under the covers, the entities
comprising an application communicated via messages.

That's the implementation Andrew's group did, but it wasn't the only
way to implement the thing. For example, if you like buying big SMP
machines instead of distributed memory clusters, you could have done a
threaded implementation, which probably would have supported much more
fine grain computations. You do recall the Hydra distributed OS, yes?
That other OS that people reinvent, all the while moaning about Multics?

Asynchronous message passing is fine for high-latency communication
(e.g. between processors), because it's relatively good at hiding
latency, but is often not fine for low-latency communication (e.g.
between entities on the same processor), because it often adds needless
overhead. The situation is almost exactly reversed for (shared) memory
operations. These are hardly secrets. The result is that when people
are left with only message passing for inter-entity communication and
they're looking for performance, they understandably tend to try to
build the entities on processor boundaries, or at least to severely
limit communication between them if they might not be on processor
boundaries. That means that to use such a model effectively, the
number of entities must correspond roughly to the number of processors,
and that (in my book) means the model is not scalable.

As for relying on shared memory/threads instead of messages on an SMP
system, I suppose you could make the argument that that would scale as
far as SMP platforms scale, which isn't very far. It has the opposite
problem that programmers would shy away (or more likely be forbidden)
from inter-entity communication if the entities did not share memory.
I'm not sure I'd call Hydra a success story, but that's beside the point.

To make a truly scalable model, you need to have dynamic granularity,
which implies being able to automatically alter the balance between
memory and messages as the number of processors (or high- and
low-latency interconnects) changes. HPF and others try for source code
scalability--to have a language processor which emits code properly
granularized for a particular platform. I have been developing one
approach (in LGDF2, F-Nets, CDS, Software Cabling, and now ScalPL) to
accomplish object code scalability without significantly messing with
existing languages or compilers.

I later heard Andrew
explain at a more application-oriented meeting that it was best not to
rely on that level of communication even for application messages in
highly-parallel apps-

That's good practical advice for his implementation. I guess he could
have lied instead...

I respected his candor, opting for accuracy and helping people to get
their work done rather than hyping. But it seems to me that either
Mentat is scalable, as you say, or users should use some other tools if
they want to scale, as he said.

It's computational model does (or
did) not lend itself to very scalable or efficient programs.

That's a huge leap when you're talking about a system that you don't
understand very well. In fact, if your program did little enough
communication, it scaled and was quite efficient. If you did too much,
this statement is correct. But it all depends on what you're looking
for in a language.

By that definition, scalability doesn't mean much, as it is completely
independent of the computational model and/or method of communication.
As long as the programmer is in control of where the communication goes,
it's always theoretically possible to find some application where the
ratio of computation to communication is high enough for the
communication overhead to be swamped by the computation, but that
doesn't make the model "scalable". What one is "looking for in a
language" for scalability is one that can express a given
parallelizeable application in such a way that it can run efficiently on
either a small or large number of processors--i.e. introduces as few
artifacts of its own (above and beyond traits of the algorithm itself
and the platform on which it runs) that pull the speedup curve away from
the theoretical maximum. Unneeded copying caused by local messages is
one such introduced artifact, and dynamic granularity is how to avoid
introducing it.

There are ways in which few other systems could touch Mentat /
Legion's efficency -- for example, it was extremely easy to
engineer transfers without middlemen; if you did
% cp a b
on Legion file objects, the contents of "a" were not copied to the
machine on which you typed the command.

I specifically said that Legion had its good points, but I can only
guess that you consider this one because there's some other platform
that inserts two unnecessary copies. Isn't this at least as efficient?:
scp systema:a systemb:b

Not to mention, I don't
think I've ever actually seen a formal graphical syntax for Mentat
programs, though they may technically be regarded as graphs.

I don't know what you mean by "formal" in this situation, but one of
our undergradutes whipped out a program which showed the various
graphs fired by Mentat & Legion programs. Then again, you can draw
them yourself once you understand how they work. The neatest thing
about Mentat was that you didn't have to think graphs to program in
it. You just invoked methods on classes, just like C++.

That you can express some aspect of a program as a graph is very
different from saying that the program is the graph or vice versa. From
your wording above, I can't tell whether you think that the graphs shown
by this student's program were complete enough to be considered
semantically identical to the textual version. Then again, in the

So what don't you like about existing languages that do executable
graphs? One such is Mentat.

So maybe I just don't know what you meant by "doing executable graphs".

For the record, ScalPL strategies are indeed annotated graphs, from
beginning to end, specifying data flow, control flow, hierarchy, object
instantiation, scope, etc. ScalPL doesn't specify a graphic
representation of tasks (data transformations)--that's not its job--but
it does provide an annotation language for specifying them, which relies
heavily on other existing languages.

[I've added comp.distributed, though I assume you will remove it from
any reply, based on past experience.]

You could have mentioned the reason: It's against netiquette to post
to groups that you don't read. I don't read comp.distributed, ergo...

I'll let you explain your own reasons. I was just warning you and
others that the newsgroups line for this thread was likely to be
relatively dynamic...and still is.

-Dave

David C. DiNucci

2005-09-13 08:44:38 UTC

Permalink

Post by David C. DiNucci
Asynchronous message passing is fine for high-latency communication
(e.g. between processors), because it's relatively good at hiding
latency, but is often not fine for low-latency communication (e.g.
between entities on the same processor), because it often adds needless
overhead. The situation is almost exactly reversed for (shared) memory
operations. These are hardly secrets.

Right. The mystery is why you can't look beyond an initial
implementation to look at the ideas of a new system.

I can't look beyond an initial implementation unless somebody at least
tells me how they would address apparent shortcomings in subsequent
implementations. So, tell me, what sort of changes would you make to
facilitate variable granularity, or to otherwise address the issues I've
raised? Have modules communicate via some form of distributed shared
memory? Synchronous messages? CDS? Some other approach?

Post by David C. DiNucci
That means that to use such a model effectively, the
number of entities must correspond roughly to the number of processors,
and that (in my book) means the model is not scalable.

That's a new definition of scalability that I've never heard before.
Under that definition, none of the fastest machines in the world run
scalable software.

Yes, I could argue that their software doesn't scale, but their data
does, so they just throw lots of different data at the same non-scalable
software. In fact, if their data doesn't scale, they change the
software from an explicit method to an implicit one so that the data
does scale so that they won't have to scale their software. Some
methods (Monte Carlo, parameter search, etc.) may even create scalable
data out of the blue, apply a non-scalable algorithm to it, and call the
result scalable software. The approach is undoubtedly productive, but I
think it would be misleading if anyone got the idea that the presence of
these apps and machines implied anything positive about humanity's
ability to produce scalable algorithms.

Post by David C. DiNucci
HPF and others try for source code
scalability--to have a language processor which emits code properly
granularized for a particular platform.

But HPF isn't scalable by your definition -- the natural way to
compile HPF is to have one process per processor.

The source code typically has no signs of having one process per
processor, just the object/runtime. That was precisely the distinction
I was making. I could have used a better example than HPF, given my
data parallelism diatribe above.

The fundamental design of Mentat is scalable by your definition. The
initial implementation was not. There's a difference between design
and implementation.

Scalable communication is not just an implementation issue. To
facilitate scalable communication--i.e. variable granularity--you need
at least either hardware (MMU) support for DSM, preferably paired with
some way to object-granularity data transfers, or some sort of software
trick to do the same thing (a la Treadmarks), or some compile-time
knowledge about what kinds of access each independent module will be
making to communicated data. The software tricks have not fared well
performance-wise, and MMU support can't be relied on for portability.
That leaves compile-time knowledge, and that means either significant
global source analysis or user declarations. In my work, I have opted
for the latter, but in either case, it impacts the design.

The example I gave (cp) is an extremely common program which inserts
extra copies. I don't understand how you can reply to my example and
be guessing about what platform exhibits the bad behavior I mention.

On systems I use, cp is used for local copies and scp (or rcp) is used
when one or more is remote. So, as far as I know, my cp does not insert
extra copies.

Post by David C. DiNucci
That you can express some aspect of a program as a graph is very
different from saying that the program is the graph or vice versa.

But I never said programs were graphs! That's a distinction you
brought up.

Then I'll drop it. I was apparently confused when you responded to a
proposal for executable graphs (which to me implies programs which are
graphs) by asking what was wrong with Mentat, which "does executable
graphs".

OK, this is my last reply; I don't post to groups that I don't read,
and it's impolite to repeatedly add groups that a person you're
holding a conversation with doesn't read.

I was under the impression that you were offended by "off-topic"
threads, and now you say that if one conversant in a thread doesn't read
an on-topic group, then manners dictate that the thread be posted ONLY
in the off-topic group.

TTFN,
-Dave

David C. DiNucci

2005-09-13 22:32:37 UTC

Permalink

So, the questions are
- Is there an existing language that expresses parallelism at the
right
level for multi-core/multithreaded core to take advantage of?

Doubtful.

I tend to agree. I don't know of any, but I thought someone else might.

I would claim that ScalPL is one, but it partially depends on what one
thinks it means for a language to "exist".

I had downloaded the ScalPL paper and am working my way through it. I was
intrigued by software cabling when you talked about it here some months ago.
I went through that documentation and, quite frankly, got all bogged down in
the terminology and lost the forrest for the trees. I am not sure that
changing all the names (as you did for ScalPL) helps that problem, but the
inclusion of examples did help some. BTW, I am not claiming that your
documentation is bad; it may very well be my fault. I have to spend more
time on it, which I will.

Thanks for the feedback. I've spoken to several others the last few
days who have similar comments and would like to see more examples. I
figured there was more work to do on this paper (as suggested by the
"Prelim" in the version number), but decided to get this version out,
mostly for those who have been waiting patiently, but also to let
feedback help steer the rest of the writing. I replaced the
modules/chips/boards/sockets/cables analogy in Software Cabling with
plans/tasks/strategies/situations/roles in ScalPL ("scalpel") in an
attempt to provide more familiarity and broaden the audience, but the
jury's still out on how much that helped, and some people have even told
me that the terminology change took some of the nerd-factor charm of
Software Cabling away.

I did have some questions about software cabling, for which I didn't see the
answer in ScalPL. Specifically, it appears that what I call a transaction
does not survive an I/O.

I'm not sure what you mean by "what I call a transaction", but I can
understand some of your confusion with I/O because the paper certainly
doesn't give it the coverage it merits, and I/O is one of ScalPL's
subtler points as it is in many function-based languages. First of all,
by definition, I/O means access to things outside of the plan, so such
access must be made via the plan's roles (and rolesets). Rather than
have the plan directly access external device and file buffers via these
roles (ouch), the recommended approach is to externally create an
instance of a sanctioned "file" class which holds (on instantiated
places) the non-persistent aspects of the file (e.g. file pointer,
buffers) and which has some of its instantiated visitors externally
bound to the desired physical file or device buffers, and then for the
plan performing I/O to access that file instance/object from within the
plan via its roles. That is a big reason for why instantiated visitors
even exist.

And if that doesn't make sense (yet), I'll try to make it clear in
upcoming revisions, or we can discuss it outside of comp.arch.

I don't see how it could read external data if all
of the data it neeeds must be there before it starts. I tried to work
through a simple debit/credit transaction and couldn't see how to do it.
Can you help me out with an explanation?

It depends on what you mean by "transaction" (which is where my
comp.distributed discussion with Marty Fouts seemed to break down a few
years back). A two-phase atomic transaction has lots of nice
properties, but is defined by the property that, once it begins its
second phase where it is giving up resources (and output can be
considered thus), it's first phase where it acquires resources (e.g.
input) must be over. Tasks in ScalPL are two-phase transactions, so one
task in ScalPL can get some input and produce some output but can't do
this repeatedly. If you want repeated I/O, you build more complex
(nested or strung) transactions from multiple tasks embodied into a
strategy, and would use places within the strategy to maintain any
required persistent state between/among them. I'm not sure whether that
answers your question, but again, maybe comp.arch isn't the place to get
in much more depth (but comp.distributed or offline could be).

Also, since you point it out, what is the implementation status of ScalPL
and the associated editor?

All of the diagrams in the document were screenshots from the
Linux-based L23 editor. Significant progress was made on a
windows-based L23. Neither of these could be considered products at
this time, and they may never be, but arrangements could probably be
made. Significant design and some coding has occurred for other tools
(e.g. runtime and debugging) to be integrated with L23, but since this
work has been self-funded since at least '91, it's been slow-going.

Yes. I may be wrong, but I am hard pressed to put pure data parallelism (as
would be exploited by a vector machine) versus very fine grained parallelism
as exemplified by say the kind of speculative execution that Andy Glew likes
on a granularity scale. They seem more to be different types than different
levels of parallelism.

I would concur that speculative execution is different in that it's more
of a trade-off between likelihood of using the speculated result and the
cost of using the resources to get that result. I personally never
considered vector parallelism as especially pure, though, and I'm
content with granularity and existing categorizations. I've seen too
many cases where categories get so narrow that common elegant solutions
become elusive.

I think there is some concurrence that the programmer's job should be to
express the algorithm in a way that makes at least some amount of the
potential parallelism apparent, and that some of that apparent parallelism
may not be beneficially exploited at runtime for some platforms. That
raises a few questions, like: (1) What's the best way to express potential
parallelism in a way that is efficent whether or not it is exploited? (2)
How much potential parallelism should be made apparent by the programmer?
(3) How much potential parallelism is too much, and just complicates the
decision and/or specification of how much to ultimately exploit? (4)
Should a human help in the determination of how much of the parallelism to
exploit (or conversely, how much to squelch) for a particular platform,
and if so, how?
I personally believe that if 1 is addressed satisfactorily, then 2 and 3
can take care of themselves--i.e. some of the parallelism is revealed, the
program runs satisfactorily on platforms L, M, and N, which exploit no
more than that amount, but when it is moved to platform P, more must be
revealed. It's essentially "just in time" parallelization (hey, "Extreme
Parallel Programming"!), which can work well if 4 is designed correctly.
This avoids the fine-grain dataflow and/or functional language issues
where there may be oodles of parallelism revealed, so much that it's hard
to tell where to start in exploiting it effectively. (Some of my own ideas
on this are apparent in my recent postings, others are not.)

I tend to agree with that thesis.

Then maybe I didn't push it far enough. :-) I didn't even delve into
whether 1 makes the code less vavluable in some other ways, so that
people want to maintain the "un-apparent parallization" code even while
possessing the "apparent parallization" code. My thesis (which is by no
means proven or even believed by many) is that the code with apparent
potential parallelization can have enough advantages other than just
level of parallelism that, with the proper tools, programmers can find
it as (if not more) desireable to maintain than code without apparent
parallelization. (That may dictate the availability of open source
tools.) If that's true, there's little reason for programmers not to
produce all new code in the apparent parallelization form, as well as to
preserve all partial improvements they've made to existing code.

I have had some potential funders claim that a new model shouldn't even
be proposed without also proposing tools to convert dusty decks, but
trying to solve both of these problems in tandem is not only largely
impractical, it may be largely unnecessary. Even without such tools,
the above properties alone suggest that code will, over time, get
parallelized, instead of the current dusty deck approach where the
output (effectively object code) becomes useless when the platform
changes and the user is back to square one (the dusty deck).

-Dave