RFC - New Object-Oriented Method of Parallel Programming (Shorter Version)

David DiNucci

2006-05-09 06:43:25 UTC

[Since you also posted a request for comments to comp.distributed, and
since this is part of such a response and is on topic there, I'm
cross-posting this there. Hope you don't mind.]

Your quoted text above, e.g. "The key underlying idea of dataflow" made
it sound much more general. Still, I have always been somewhat leery of
the black-and-white delineation of static and dynamic dataflow.

This I think is the problem why we cannot narrow down the specifics. There
is clear, massive difference in architecture, complexity, but performance as
well between the static and dynamic dataflow systems. The method I have
proposed, which being different from dataflow in a number of aspects, is
intended to be completely dynamic in terms of dataflow terminology. To make
it a practical system, the method allows defining all the necessary
constraints (if any) locally within definition of the single atom.

I don't understand this last sentence--e.g. what sorts of constraints,
what you are calling an atom.

Ever heard of a Kahn-McQueen network?

This further shows why we have to address the issue of static/dynamic
dataflow to make it a meaningful discussion, since Kahn-McQueen was a kind
of STATIC dataflow machine that does not address the dynamic dataflow issues
at all.

I've never considered Kahn-McQueen (KM) as a machine, just as a
definition used in a theorem that states that when you have a bunch of
deterministic transforms connected by fixed sequential (i.e.
non-passing) channels, the results will be deterministic. It's sort of
a generalization of function composition. To the extent that some
dataflow models (though perhaps not yours) allow such fixed networks at
various places throughout the program, whether those models are static
or dynamic, it is a useful (if not somewhat obvious) result, and it's a
shame not to let programmers take advantage of it. If I understand,
it's not possible to build such a fixed network with stressflow.

Another issue here is that the ScalPL semantics describe the possible
outcomes of any given ScalPL program, not necessarily how to get there.
(In fact, the F-Net semantics, upon which ScalPL is based, have been
described axiomatically as well as procedurally.) So, ScalPL does not
preclude optimizations which produce the required outcome in some more
parallel way than the obvious pipelined execution. Put another way,
using matching as you've described (where partial results pass one
another) can, in some circumstances at least, be used by a ScalPL
compiler or runtime even if the user program is expressed in an
easily-understood "static dataflow" Kahn-McQueen form.

True, as any other terms, they sometimes are used to different things, but I
try to use the most prevailing one. To explain it better, I will use this
super-trivial calculation example, again, where node A will be firing nodes
B and C which will be firing together node D.
Static dataflow most often meant a situation where a single instantiation of
a node would only mostly be able to perform one calculation at a time. This
would mean that if some node B were to perform some operation on data X1,
X2, and X3, the calculation would be perform by mostly same computing
resource, thus always producing B(X1), B(X2), B(X3).
In dynamic dataflow many instantiations of B will be able to perform totally
simultaneously, often using completely separate computing resources,
providing full parallelism possible, but also meaning that results of
calculations could arrive in any random order, say B(X2), B(X1), B(X3). This
is the situation that required token tagging in dynamic dataflow machines.
Without it, the results of such “dynamic” dataflow calculations from nodes
B(X) and C(Y) would be arriving in incorrect order at D, making it produce
results based on mismatched data, say D(X2,Y1), D(X1,Y3), D(X3,Y2).

Perhaps this is one reason that I avoid these terms and all of the
historical baggage that goes along with them. For example, a single
node in ScalPL can often be instantiated multiple times with the results
still arriving in order (to use your terminology). Is that static
because of the way the results come out, or dynamic because of the
multiple instantiations? And, if the programmer wants individual tasks
or entire strategies (dataflow graphs) to be instantiated multiple times
(e.g. once per input) to get more parallelism, that's fine, too. Is
each of those strategies static or dynamic? It's even possible to
instantiate entire strategies multiple times and still ensure that the
results arrive in order.

The simple way to deal with such a problem, which I discuss in detail in my
specification, is the solution where target node D can be organized/seen not
as a single node, but an array of them. This in turn means that the entire
simple complex B,C,D can now be instantiated as an array, meaning that we
now have say N of data context instantiations of all nodes, N of B nodes, N
of C nodes, and N of D nodes. All of N instantiations of B nodes, and N
instantiations of C nodes can now be created at once and calculate
simultaneously, and instances of D as soon as a pair of results arrive at
it. There is no danger of mismatched data because the only data arriving at
D[k] can be Xk and Yk. In the stressflow description I refer to it as using
“this” pointer as tag.

Of course, virtually the same approach can be used with ScalPL, it's
just required less often. Whether you pipeline data through one fixed
network, or instantiate multiple networks and push different data
through each and with the results assigned to different entities, or a
combination of those, you (and runtime and debuggers, etc.) can exploit
the determinism of KM where fixed networks exist. Among other things,
this can have significant impact on the amount of tracing that needs to
be done for debugging/replay (as mentioned further in my dissertation).

When this scheme is used with dataflow systems, this solution could be seen
as replicating the same static dataflow machine N times, and not as a
“dynamic” dataflow machine. The proponents of dynamic dataflow surely must
have known that they could simply have N static machines instead of dealing
with all the tagging issues.

I'm not sure what your point is, but it does seem quite sensible to me
to provide enough information that the platform (including compilers and
runtime) can optimize the program as either dynamic (with tagging) or as
independent (partitioned) sets of static graphs.

In terms of stressflow, if the calculations are simple enough and only need
the X, Y values that are passed as parameters as inputs to its calculations,
the B and C nodes can have no operations in the stressed sections and do
everything inside relaxed sections. For A and D it depends on what they
connect to. No stressed section means that there is no waiting anywhere and
that an optimizing compiler does not even have to generate any access locks.
This means full theoretically possible speed with negligible overhead.

Speaking of which: How do you coordinate shared data between the caller
and the callee? For example, if large data structures are being passed
from the caller, and potentially updated in the caller between each
call, must the callee's programmer choose between (a) performing all
reading/processing of the structure in the stressed section (so that
reading is finished by the time it returns) or (b) copying the structure
within the stressed section and then processing the copy in the relaxed
section to allow the caller to continue or (c) not copying at all and
just reading/processing the data in the relaxed section in hopes that
the caller won't modify it? If so, an optimal decision of which to do
depends not only on the callee's programmer knowing what the caller will
be doing with the data (since, for example, c is optimal if the caller
doesn't modify the data), but also on whether the callee and the caller
are on distributed processors (since the act of messaging the data from
caller to callee will automatically create a copy anyway).

In ScalPL, the programmer specifies what sorts of access each plan will
need to interface data, and the platform can optimize that access and
scheduling depending on the circumstances at the time. In other words,
the mechanism of accomplishing the sharing (e.g. call by name or value,
by passing messages or through shared memory) is not hardwired into the
program. Only the policy of what needs to be shared is.

As good as this solution is, it is not applicable to every situation. We may
need to feed the results calculated at D to, say, a single hardware port.
True, this could be done by another node E, driven by all the nodes D,
outputting the D results as soon as they appear but without leaving any
unfilled D behind. The scheme is excellent in a number of applications.
However, there are some situations where this scheme would not be a good
solution.

In ScalPL, a simple solution is to just have all of the producers (D or
E) write to a single place which represents (or feeds) the hardware
port. Of course, a ScalPL strategy can only "see" the interface that
its parent wants it to see, so it cannot see, for example, whether such
an interface (also called a "visitor") place feeds directly to a device
or is virtualized to whatever extent, such as being further processed or
merged by other outer layers.

Another ScalPL solution is for an outside layer to pass in an object
(i.e. another plan) which handles all interaction with the port (rather
like cin, cout in C++).

Suppose that intermediate results somewhere are not simple type data, but
massive data tables. This solution would be allocating new massive set of
data with every instance of A,B,C, and D. Or, imagine a situation where we
have N sets of input data but only a small fraction of N as possible valid
outputs, resulting in large number of unused D’s. Or suppose that Y does not
change nearly as often as X, meaning that recalculating C(Y) for every
change of X is a total waste.

Of course, ScalPL has all of the flexibility to handle these situations,
and leaves the platform much more room to optimize them, not to mention
making it much more obvious in the program because all these situations
are diagramed graphically. This foreknowledge of data and control flow
is important from both software engineering and performance aspects.
With ScalPL, one doesn't need to devise abstract representations of how
the parts of the program interact, because the program *is* a drawing of
how the pieces interact.

This situation where having multiple, simultaneous “dynamic-dataflow” type
completely simultaneous execution instances, within the same data and node
connection contexts can be of benefit. This is the situation for which the
“ordered relay mechanism” can be used (which uses the pre-reservation
calls). This mechanism is completely unnecessary for any other situations.
But, I have discovered it to be an extremely useful tool for a number of
applications from almost straightforward emulation of electrical/electronic
circuits, to implementing redundant/fault-resistant systems. It also is far
superior to any token tagging mechanisms which are far less flexible and
require time consuming ordering of incoming tokens by tag.

ScalPL is fully general, so all of the situations you describe here are
easily handled. If you question how well it accomplishes one or more of
them compared to stressflow, maybe you can describe those specific ones
more precisely and we can go from there.

What is incorrect? Unless you clarify, I'll guess that you mean my last
sentence.

The description of the D node behavior in stressflow was completely
incorrect. First – no node execution instance is ever generated until it can
begin executing. The only slowdowns that can happen are a result of trying
to call another atom. This is why I wrote about the inherent full
“dynamicity” of the stressflow model.
The reservations and pre-reservations happen on the lock access mechanism.
For a multi-input node, the locks simply keeps track of incoming requests
and the one that completes the set fires the target node.

Thanks. So it sounds as though the you *do* have constructs to perform
the matching within the runtime (i.e. lock access mechanism). That's
good. Still, a principal difference from ScalPL appears to be that
stressflow defaults to no matching, and requires matching for
determinism whenever data from different sources merges, while ScalPL
defaults to determinism by virtue of Kahn-McQueen, but provides ways to
build constructs which are not. I would argue that this implies that
ScalPL provides well-defined behavior in cases where stressflow does
not. For example, the potential results that a program produces can be
directly dependent upon how "out of order" the intermediate results can
become, and this out-of-orderness is a direct part of a ScalPL
specification, whereas I don't think it is in a stressflow specification.

I know you've written much more that's probably worth responding to some
day, but I do have other work to do. Maybe we've both provided at least
a glimpse into each other's development methods. I am tempted to
construct a ScalPL plan for each of the major stressflow constructs to
illustrate the primary differences and similarities, but I rather doubt
that will come soon. It appears that, in general, there are four
principal stressflow segments to model: (1) The caller (precall), (2)
the caller (postcall), (3) the callee (stressed), and (4) the callee
(unstressed). In ScalPL, these would most naturally be 4 different
tasks, with the obvious dependences between them, but it sometimes
sounds as though 1, 3, and 2 could all be one task with only 4 being a
separate task. I will need to understand more fully the circumstances
(if any) under which the flow from 1 to 3 to 2 can block.

Thanks for your time,
-Dave
---
http://www.elepar.com/

Slawomir J.

2006-05-09 21:54:53 UTC

Permalink

Post by David DiNucci
Since you also posted a request for comments to comp.distributed, and
since this is part of such a response and is on topic there, I'm
cross-posting this there. Hope you don't mind.

No problem whatsoever.

Post by David DiNucci

This I think is the problem why we cannot narrow down the specifics.
There is clear, massive difference in architecture, complexity, but
performance as well between the static and dynamic dataflow systems. The
method I have proposed, which being different from dataflow in a number
of aspects, is intended to be completely dynamic in terms of dataflow
terminology. To make it a practical system, the method allows defining
all the necessary constraints (if any) locally within definition of the
single atom.

I don't understand this last sentence--e.g. what sorts of constraints,
what you are calling an atom.

The constraints within the atom correspond directly to how much static or
dynamic in terms of dataflow technology an atom work can be. In great
simplification, all that needs to be deterministic (static-dataflow like) is
placed within the stressed section. All that is to be dynamic is placed
within the relaxed section. All that needs to be dynamic but still kept in
order uses ordered calls out of the relaxed section. (Ordered calls out of
the stressed section make no sense).

Post by David DiNucci
I've never considered Kahn-McQueen (KM) as a machine, just as a definition
used in a theorem that states that when you have a bunch of deterministic
transforms connected by fixed sequential (i.e. non-passing) channels, the
results will be deterministic. It's sort of a generalization of function
composition. To the extent that some dataflow models (though perhaps not
yours) allow such fixed networks at various places throughout the program,
whether those models are static or dynamic, it is a useful (if not
somewhat obvious) result, and it's a shame not to let programmers take
advantage of it. If I understand, it's not possible to build such a fixed
network with stressflow.

Sure it is. To do it, the stressflow program has to be special-cased to
consist of atoms that:

1. Return no values.

2. Pass needed parameters by value

3. Have all their code in stressed section (thus removing any dynamicity)

Post by David DiNucci
Another issue here is that the ScalPL semantics describe the possible
outcomes of any given ScalPL program, not necessarily how to get there.
(In fact, the F-Net semantics, upon which ScalPL is based, have been
described axiomatically as well as procedurally.) So, ScalPL does not
preclude optimizations which produce the required outcome in some more
parallel way than the obvious pipelined execution. Put another way, using
matching as you've described (where partial results pass one another) can,
in some circumstances at least, be used by a ScalPL compiler or runtime
even if the user program is expressed in an easily-understood "static
dataflow" Kahn-McQueen form.

That can always be said, that some behind the scenes magic can be used to
improve performance. The devil is always in the details on how the hard
reality is dealt with. Dynamic, as in fully parallel firing of the same node
connection/arc instance wise, will always result in a situation where the
multiple results will be arriving at the output arc out of order and
something, somehow has to reorder them or stall the fast ones so the slower
ones can catch up. That call pre-reservation thing is nice because it simply
does reserve a place in line for delivering the output as the first
operation guaranteed to be executed in the right order. When the actual call
is ready, the pre-reserved place is used. If the pre-reserved place reaches
the head of the lock reservation queue the target stalls until the reserving
task produces the data. This guarantees strict deterministic order without
limiting dynamic execution and without tagging and collecting results in
sorted queues.

Post by David DiNucci
Perhaps this is one reason that I avoid these terms and all of the
historical baggage that goes along with them. For example, a single node
in ScalPL can often be instantiated multiple times with the results still
arriving in order (to use your terminology). Is that static because of
the way the results come out, or dynamic because of the multiple
instantiations? And, if the programmer wants individual tasks or entire
strategies (dataflow graphs) to be instantiated multiple times (e.g. once
per input) to get more parallelism, that's fine, too. Is each of those
strategies static or dynamic? It's even possible to instantiate entire
strategies multiple times and still ensure that the results arrive in
order.

Dynamic dataflow means unlimited firings per same arc (or connection
context) without waiting for the previous firing to complete. It is
generally assumed that a well constructed/optimized static machine can do
(statistically speaking) above 1 firing and below 2 firings per same arc.
Meaning the next firing can happen when the previous one is, say, half way
through processing.

Multiple instantiations of a whole or part of a system are therefore not
considered dynamic, because they use separate arcs (connection context
instance).

Post by David DiNucci

When this scheme is used with dataflow systems, this solution could be
seen as replicating the same static dataflow machine N times, and not as
a dynamic dataflow machine. The proponents of dynamic dataflow surely
must have known that they could simply have N static machines instead of
dealing with all the tagging issues.

I'm not sure what your point is, but it does seem quite sensible to me to
provide enough information that the platform (including compilers and
runtime) can optimize the program as either dynamic (with tagging) or as
independent (partitioned) sets of static graphs.

I have no problem with it, except for the fact that efficient method is
still needed to do it low level wise. My project was more about the low
level implementation scheme than high level syntax using it.

Post by David DiNucci

In terms of stressflow, if the calculations are simple enough and only
need the X, Y values that are passed as parameters as inputs to its
calculations, the B and C nodes can have no operations in the stressed
sections and do everything inside relaxed sections. For A and D it
depends on what they connect to. No stressed section means that there is
no waiting anywhere and that an optimizing compiler does not even have to
generate any access locks. This means full theoretically possible speed
with negligible overhead.

Speaking of which: How do you coordinate shared data between the caller
and the callee? For example, if large data structures are being passed
from the caller, and potentially updated in the caller between each call,
must the callee's programmer choose between (a) performing all
reading/processing of the structure in the stressed section (so that
reading is finished by the time it returns) or (b) copying the structure
within the stressed section and then processing the copy in the relaxed
section to allow the caller to continue or (c) not copying at all and just
reading/processing the data in the relaxed section in hopes that the
caller won't modify it? If so, an optimal decision of which to do depends
not only on the callee's programmer knowing what the caller will be doing
with the data (since, for example, c is optimal if the caller doesn't
modify the data), but also on whether the callee and the caller are on
distributed processors (since the act of messaging the data from caller to
callee will automatically create a copy anyway).

(a) or (b) never (c) which should be detected by a compiler as an error.
Anyway, strict, puritan dataflow-type model can always be used with the
needed data passed as call parameters, which already handles the copying.
Other tricks are simply matters of avoiding unnecessary copying,
de-composing and recomposing the data for parallel processing logically but
not physically, etc. The idea was to give a whole bag of tricks for sake of
complete efficiency, not to hide the silverware because someone could hurt
himself with it.

Post by David DiNucci
In ScalPL, the programmer specifies what sorts of access each plan will
need to interface data, and the platform can optimize that access and
scheduling depending on the circumstances at the time. In other words,
the mechanism of accomplishing the sharing (e.g. call by name or value, by
passing messages or through shared memory) is not hardwired into the
program. Only the policy of what needs to be shared is.

As I said before, the goal of my project was to provide a model that
directly translates into implementation, so that it could be used for
performance programming on both high and very low level. The goal was to
create a C++ of parallel programming and not Pascal or Smalltalk of parallel
programming. In spite of many pure, safe structural methods of parallel
programming being proposed, most commercial parallel programming is done
with low level tools or tools based on 50 year old languages like Fortran.
The reason behind it, in my view, is utopian Puritanism behind most of the
structural parallel programming methods. Puritanism always carries the
hidden cost, in the case of most puritan parallel programming methods,
guaranteed fool-proof data consistency was always based on passing copies
of entire sets of data being worked as tokens or messages. It worked and
was safe, but for real programming there was C++ and ugly non-structural
methods of parallelism.

Post by David DiNucci
Thanks. So it sounds as though the you *do* have constructs to perform the
matching within the runtime (i.e. lock access mechanism). That's good.
Still, a principal difference from ScalPL appears to be that stressflow
defaults to no matching, and requires matching for determinism whenever
data from different sources merges, while ScalPL defaults to determinism
by virtue of Kahn-McQueen, but provides ways to build constructs which are
not. I would argue that this implies that ScalPL provides well-defined
behavior in cases where stressflow does not. For example, the potential
results that a program produces can be directly dependent upon how "out of
order" the intermediate results can become, and this out-of-orderness is a
direct part of a ScalPL specification, whereas I don't think it is in a
stressflow specification.

Full dynamic parallel execution will produce results out of order there
are no miracles. There is no timing determinism by some mythical virtue
that happens by itself on tasks that run at the same time. That is talking
timing Perpetum-Mobile. If they are free to start almost together and run as
fast as they can, they will sometimes come at the finish out of order so
the question still is how are they kept/forced into the right order at the
finish line (only when specifically necessary) and without being slowed down
at the start? Criticizing a working implementation method on the basis that
it can all by avoided altogether magically by some mystical magic is rather
silly.

Post by David DiNucci
I know you've written much more that's probably worth responding to some
day, but I do have other work to do. Maybe we've both provided at least a
glimpse into each other's development methods. I am tempted to construct
a ScalPL plan for each of the major stressflow constructs to illustrate
the primary differences and similarities, but I rather doubt that will
come soon. It appears that, in general, there are four principal
stressflow segments to model: (1) The caller (precall), (2) the caller
(postcall), (3) the callee (stressed), and (4) the callee (unstressed).
In ScalPL, these would most naturally be 4 different tasks, with the
obvious dependences between them, but it sometimes sounds as though 1, 3,
and 2 could all be one task with only 4 being a separate task. I will
need to understand more fully the circumstances (if any) under which the
flow from 1 to 3 to 2 can block.

This is making it far more complex that it is. The caller creates a callee
and cooperates with him (sort of) while the callee is in its stressed
section. The parting ways inside the callee and not at the point of making
the call allows for smootly controlling amount of free "dynamicity" of the
callee.

Ultimate test will be to test drive both methods, I think, and I am looking
forward to it. As one of the next steps I will provide complete examples
with benchmarks for different hardware. Of course, I will say something here
at the newsgroups once I am ready for that. One way to go about it is
propose some specific parallel programming jobs to solve as benchmarking
tests by anybody and any method.

Thanks for your interest and your time and Very Best Regards

Slawomir J. ***@stressflow.com