David DiNucci
2006-05-09 06:43:25 UTC
[Since you also posted a request for comments to comp.distributed, and
since this is part of such a response and is on topic there, I'm
cross-posting this there. Hope you don't mind.]
is clear, massive difference in architecture, complexity, but performance as
well between the static and dynamic dataflow systems. The method I have
proposed, which being different from dataflow in a number of aspects, is
intended to be completely dynamic in terms of dataflow terminology. To make
it a practical system, the method allows defining all the necessary
constraints (if any) locally within definition of the single atom.
I don't understand this last sentence--e.g. what sorts of constraints,
what you are calling an atom.
dataflow to make it a meaningful discussion, since Kahn-McQueen was a kind
of STATIC dataflow machine that does not address the dynamic dataflow issues
at all.
I've never considered Kahn-McQueen (KM) as a machine, just as a
definition used in a theorem that states that when you have a bunch of
deterministic transforms connected by fixed sequential (i.e.
non-passing) channels, the results will be deterministic. It's sort of
a generalization of function composition. To the extent that some
dataflow models (though perhaps not yours) allow such fixed networks at
various places throughout the program, whether those models are static
or dynamic, it is a useful (if not somewhat obvious) result, and it's a
shame not to let programmers take advantage of it. If I understand,
it's not possible to build such a fixed network with stressflow.
Another issue here is that the ScalPL semantics describe the possible
outcomes of any given ScalPL program, not necessarily how to get there.
(In fact, the F-Net semantics, upon which ScalPL is based, have been
described axiomatically as well as procedurally.) So, ScalPL does not
preclude optimizations which produce the required outcome in some more
parallel way than the obvious pipelined execution. Put another way,
using matching as you've described (where partial results pass one
another) can, in some circumstances at least, be used by a ScalPL
compiler or runtime even if the user program is expressed in an
easily-understood "static dataflow" Kahn-McQueen form.
historical baggage that goes along with them. For example, a single
node in ScalPL can often be instantiated multiple times with the results
still arriving in order (to use your terminology). Is that static
because of the way the results come out, or dynamic because of the
multiple instantiations? And, if the programmer wants individual tasks
or entire strategies (dataflow graphs) to be instantiated multiple times
(e.g. once per input) to get more parallelism, that's fine, too. Is
each of those strategies static or dynamic? It's even possible to
instantiate entire strategies multiple times and still ensure that the
results arrive in order.
just required less often. Whether you pipeline data through one fixed
network, or instantiate multiple networks and push different data
through each and with the results assigned to different entities, or a
combination of those, you (and runtime and debuggers, etc.) can exploit
the determinism of KM where fixed networks exist. Among other things,
this can have significant impact on the amount of tracing that needs to
be done for debugging/replay (as mentioned further in my dissertation).
to provide enough information that the platform (including compilers and
runtime) can optimize the program as either dynamic (with tagging) or as
independent (partitioned) sets of static graphs.
and the callee? For example, if large data structures are being passed
from the caller, and potentially updated in the caller between each
call, must the callee's programmer choose between (a) performing all
reading/processing of the structure in the stressed section (so that
reading is finished by the time it returns) or (b) copying the structure
within the stressed section and then processing the copy in the relaxed
section to allow the caller to continue or (c) not copying at all and
just reading/processing the data in the relaxed section in hopes that
the caller won't modify it? If so, an optimal decision of which to do
depends not only on the callee's programmer knowing what the caller will
be doing with the data (since, for example, c is optimal if the caller
doesn't modify the data), but also on whether the callee and the caller
are on distributed processors (since the act of messaging the data from
caller to callee will automatically create a copy anyway).
In ScalPL, the programmer specifies what sorts of access each plan will
need to interface data, and the platform can optimize that access and
scheduling depending on the circumstances at the time. In other words,
the mechanism of accomplishing the sharing (e.g. call by name or value,
by passing messages or through shared memory) is not hardwired into the
program. Only the policy of what needs to be shared is.
E) write to a single place which represents (or feeds) the hardware
port. Of course, a ScalPL strategy can only "see" the interface that
its parent wants it to see, so it cannot see, for example, whether such
an interface (also called a "visitor") place feeds directly to a device
or is virtualized to whatever extent, such as being further processed or
merged by other outer layers.
Another ScalPL solution is for an outside layer to pass in an object
(i.e. another plan) which handles all interaction with the port (rather
like cin, cout in C++).
and leaves the platform much more room to optimize them, not to mention
making it much more obvious in the program because all these situations
are diagramed graphically. This foreknowledge of data and control flow
is important from both software engineering and performance aspects.
With ScalPL, one doesn't need to devise abstract representations of how
the parts of the program interact, because the program *is* a drawing of
how the pieces interact.
easily handled. If you question how well it accomplishes one or more of
them compared to stressflow, maybe you can describe those specific ones
more precisely and we can go from there.
incorrect. First – no node execution instance is ever generated until it can
begin executing. The only slowdowns that can happen are a result of trying
to call another atom. This is why I wrote about the inherent full
“dynamicity” of the stressflow model.
The reservations and pre-reservations happen on the lock access mechanism.
For a multi-input node, the locks simply keeps track of incoming requests
and the one that completes the set fires the target node.
Thanks. So it sounds as though the you *do* have constructs to perform
the matching within the runtime (i.e. lock access mechanism). That's
good. Still, a principal difference from ScalPL appears to be that
stressflow defaults to no matching, and requires matching for
determinism whenever data from different sources merges, while ScalPL
defaults to determinism by virtue of Kahn-McQueen, but provides ways to
build constructs which are not. I would argue that this implies that
ScalPL provides well-defined behavior in cases where stressflow does
not. For example, the potential results that a program produces can be
directly dependent upon how "out of order" the intermediate results can
become, and this out-of-orderness is a direct part of a ScalPL
specification, whereas I don't think it is in a stressflow specification.
I know you've written much more that's probably worth responding to some
day, but I do have other work to do. Maybe we've both provided at least
a glimpse into each other's development methods. I am tempted to
construct a ScalPL plan for each of the major stressflow constructs to
illustrate the primary differences and similarities, but I rather doubt
that will come soon. It appears that, in general, there are four
principal stressflow segments to model: (1) The caller (precall), (2)
the caller (postcall), (3) the callee (stressed), and (4) the callee
(unstressed). In ScalPL, these would most naturally be 4 different
tasks, with the obvious dependences between them, but it sometimes
sounds as though 1, 3, and 2 could all be one task with only 4 being a
separate task. I will need to understand more fully the circumstances
(if any) under which the flow from 1 to 3 to 2 can block.
Thanks for your time,
-Dave
---
http://www.elepar.com/
since this is part of such a response and is on topic there, I'm
cross-posting this there. Hope you don't mind.]
Your quoted text above, e.g. "The key underlying idea of dataflow" made
it sound much more general. Still, I have always been somewhat leery of
the black-and-white delineation of static and dynamic dataflow.
This I think is the problem why we cannot narrow down the specifics. Thereit sound much more general. Still, I have always been somewhat leery of
the black-and-white delineation of static and dynamic dataflow.
is clear, massive difference in architecture, complexity, but performance as
well between the static and dynamic dataflow systems. The method I have
proposed, which being different from dataflow in a number of aspects, is
intended to be completely dynamic in terms of dataflow terminology. To make
it a practical system, the method allows defining all the necessary
constraints (if any) locally within definition of the single atom.
what you are calling an atom.
Ever heard of a Kahn-McQueen network?
This further shows why we have to address the issue of static/dynamicdataflow to make it a meaningful discussion, since Kahn-McQueen was a kind
of STATIC dataflow machine that does not address the dynamic dataflow issues
at all.
definition used in a theorem that states that when you have a bunch of
deterministic transforms connected by fixed sequential (i.e.
non-passing) channels, the results will be deterministic. It's sort of
a generalization of function composition. To the extent that some
dataflow models (though perhaps not yours) allow such fixed networks at
various places throughout the program, whether those models are static
or dynamic, it is a useful (if not somewhat obvious) result, and it's a
shame not to let programmers take advantage of it. If I understand,
it's not possible to build such a fixed network with stressflow.
Another issue here is that the ScalPL semantics describe the possible
outcomes of any given ScalPL program, not necessarily how to get there.
(In fact, the F-Net semantics, upon which ScalPL is based, have been
described axiomatically as well as procedurally.) So, ScalPL does not
preclude optimizations which produce the required outcome in some more
parallel way than the obvious pipelined execution. Put another way,
using matching as you've described (where partial results pass one
another) can, in some circumstances at least, be used by a ScalPL
compiler or runtime even if the user program is expressed in an
easily-understood "static dataflow" Kahn-McQueen form.
True, as any other terms, they sometimes are used to different things, but I
try to use the most prevailing one. To explain it better, I will use this
super-trivial calculation example, again, where node A will be firing nodes
B and C which will be firing together node D.
Static dataflow most often meant a situation where a single instantiation of
a node would only mostly be able to perform one calculation at a time. This
would mean that if some node B were to perform some operation on data X1,
X2, and X3, the calculation would be perform by mostly same computing
resource, thus always producing B(X1), B(X2), B(X3).
In dynamic dataflow many instantiations of B will be able to perform totally
simultaneously, often using completely separate computing resources,
providing full parallelism possible, but also meaning that results of
calculations could arrive in any random order, say B(X2), B(X1), B(X3). This
is the situation that required token tagging in dynamic dataflow machines.
Without it, the results of such “dynamic” dataflow calculations from nodes
B(X) and C(Y) would be arriving in incorrect order at D, making it produce
results based on mismatched data, say D(X2,Y1), D(X1,Y3), D(X3,Y2).
Perhaps this is one reason that I avoid these terms and all of thetry to use the most prevailing one. To explain it better, I will use this
super-trivial calculation example, again, where node A will be firing nodes
B and C which will be firing together node D.
Static dataflow most often meant a situation where a single instantiation of
a node would only mostly be able to perform one calculation at a time. This
would mean that if some node B were to perform some operation on data X1,
X2, and X3, the calculation would be perform by mostly same computing
resource, thus always producing B(X1), B(X2), B(X3).
In dynamic dataflow many instantiations of B will be able to perform totally
simultaneously, often using completely separate computing resources,
providing full parallelism possible, but also meaning that results of
calculations could arrive in any random order, say B(X2), B(X1), B(X3). This
is the situation that required token tagging in dynamic dataflow machines.
Without it, the results of such “dynamic” dataflow calculations from nodes
B(X) and C(Y) would be arriving in incorrect order at D, making it produce
results based on mismatched data, say D(X2,Y1), D(X1,Y3), D(X3,Y2).
historical baggage that goes along with them. For example, a single
node in ScalPL can often be instantiated multiple times with the results
still arriving in order (to use your terminology). Is that static
because of the way the results come out, or dynamic because of the
multiple instantiations? And, if the programmer wants individual tasks
or entire strategies (dataflow graphs) to be instantiated multiple times
(e.g. once per input) to get more parallelism, that's fine, too. Is
each of those strategies static or dynamic? It's even possible to
instantiate entire strategies multiple times and still ensure that the
results arrive in order.
The simple way to deal with such a problem, which I discuss in detail in my
specification, is the solution where target node D can be organized/seen not
as a single node, but an array of them. This in turn means that the entire
simple complex B,C,D can now be instantiated as an array, meaning that we
now have say N of data context instantiations of all nodes, N of B nodes, N
of C nodes, and N of D nodes. All of N instantiations of B nodes, and N
instantiations of C nodes can now be created at once and calculate
simultaneously, and instances of D as soon as a pair of results arrive at
it. There is no danger of mismatched data because the only data arriving at
D[k] can be Xk and Yk. In the stressflow description I refer to it as using
“this” pointer as tag.
Of course, virtually the same approach can be used with ScalPL, it'sspecification, is the solution where target node D can be organized/seen not
as a single node, but an array of them. This in turn means that the entire
simple complex B,C,D can now be instantiated as an array, meaning that we
now have say N of data context instantiations of all nodes, N of B nodes, N
of C nodes, and N of D nodes. All of N instantiations of B nodes, and N
instantiations of C nodes can now be created at once and calculate
simultaneously, and instances of D as soon as a pair of results arrive at
it. There is no danger of mismatched data because the only data arriving at
D[k] can be Xk and Yk. In the stressflow description I refer to it as using
“this” pointer as tag.
just required less often. Whether you pipeline data through one fixed
network, or instantiate multiple networks and push different data
through each and with the results assigned to different entities, or a
combination of those, you (and runtime and debuggers, etc.) can exploit
the determinism of KM where fixed networks exist. Among other things,
this can have significant impact on the amount of tracing that needs to
be done for debugging/replay (as mentioned further in my dissertation).
When this scheme is used with dataflow systems, this solution could be seen
as replicating the same static dataflow machine N times, and not as a
“dynamic” dataflow machine. The proponents of dynamic dataflow surely must
have known that they could simply have N static machines instead of dealing
with all the tagging issues.
I'm not sure what your point is, but it does seem quite sensible to meas replicating the same static dataflow machine N times, and not as a
“dynamic” dataflow machine. The proponents of dynamic dataflow surely must
have known that they could simply have N static machines instead of dealing
with all the tagging issues.
to provide enough information that the platform (including compilers and
runtime) can optimize the program as either dynamic (with tagging) or as
independent (partitioned) sets of static graphs.
In terms of stressflow, if the calculations are simple enough and only need
the X, Y values that are passed as parameters as inputs to its calculations,
the B and C nodes can have no operations in the stressed sections and do
everything inside relaxed sections. For A and D it depends on what they
connect to. No stressed section means that there is no waiting anywhere and
that an optimizing compiler does not even have to generate any access locks.
This means full theoretically possible speed with negligible overhead.
Speaking of which: How do you coordinate shared data between the callerthe X, Y values that are passed as parameters as inputs to its calculations,
the B and C nodes can have no operations in the stressed sections and do
everything inside relaxed sections. For A and D it depends on what they
connect to. No stressed section means that there is no waiting anywhere and
that an optimizing compiler does not even have to generate any access locks.
This means full theoretically possible speed with negligible overhead.
and the callee? For example, if large data structures are being passed
from the caller, and potentially updated in the caller between each
call, must the callee's programmer choose between (a) performing all
reading/processing of the structure in the stressed section (so that
reading is finished by the time it returns) or (b) copying the structure
within the stressed section and then processing the copy in the relaxed
section to allow the caller to continue or (c) not copying at all and
just reading/processing the data in the relaxed section in hopes that
the caller won't modify it? If so, an optimal decision of which to do
depends not only on the callee's programmer knowing what the caller will
be doing with the data (since, for example, c is optimal if the caller
doesn't modify the data), but also on whether the callee and the caller
are on distributed processors (since the act of messaging the data from
caller to callee will automatically create a copy anyway).
In ScalPL, the programmer specifies what sorts of access each plan will
need to interface data, and the platform can optimize that access and
scheduling depending on the circumstances at the time. In other words,
the mechanism of accomplishing the sharing (e.g. call by name or value,
by passing messages or through shared memory) is not hardwired into the
program. Only the policy of what needs to be shared is.
As good as this solution is, it is not applicable to every situation. We may
need to feed the results calculated at D to, say, a single hardware port.
True, this could be done by another node E, driven by all the nodes D,
outputting the D results as soon as they appear but without leaving any
unfilled D behind. The scheme is excellent in a number of applications.
However, there are some situations where this scheme would not be a good
solution.
In ScalPL, a simple solution is to just have all of the producers (D orneed to feed the results calculated at D to, say, a single hardware port.
True, this could be done by another node E, driven by all the nodes D,
outputting the D results as soon as they appear but without leaving any
unfilled D behind. The scheme is excellent in a number of applications.
However, there are some situations where this scheme would not be a good
solution.
E) write to a single place which represents (or feeds) the hardware
port. Of course, a ScalPL strategy can only "see" the interface that
its parent wants it to see, so it cannot see, for example, whether such
an interface (also called a "visitor") place feeds directly to a device
or is virtualized to whatever extent, such as being further processed or
merged by other outer layers.
Another ScalPL solution is for an outside layer to pass in an object
(i.e. another plan) which handles all interaction with the port (rather
like cin, cout in C++).
Suppose that intermediate results somewhere are not simple type data, but
massive data tables. This solution would be allocating new massive set of
data with every instance of A,B,C, and D. Or, imagine a situation where we
have N sets of input data but only a small fraction of N as possible valid
outputs, resulting in large number of unused D’s. Or suppose that Y does not
change nearly as often as X, meaning that recalculating C(Y) for every
change of X is a total waste.
Of course, ScalPL has all of the flexibility to handle these situations,massive data tables. This solution would be allocating new massive set of
data with every instance of A,B,C, and D. Or, imagine a situation where we
have N sets of input data but only a small fraction of N as possible valid
outputs, resulting in large number of unused D’s. Or suppose that Y does not
change nearly as often as X, meaning that recalculating C(Y) for every
change of X is a total waste.
and leaves the platform much more room to optimize them, not to mention
making it much more obvious in the program because all these situations
are diagramed graphically. This foreknowledge of data and control flow
is important from both software engineering and performance aspects.
With ScalPL, one doesn't need to devise abstract representations of how
the parts of the program interact, because the program *is* a drawing of
how the pieces interact.
This situation where having multiple, simultaneous “dynamic-dataflow” type
completely simultaneous execution instances, within the same data and node
connection contexts can be of benefit. This is the situation for which the
“ordered relay mechanism” can be used (which uses the pre-reservation
calls). This mechanism is completely unnecessary for any other situations.
But, I have discovered it to be an extremely useful tool for a number of
applications from almost straightforward emulation of electrical/electronic
circuits, to implementing redundant/fault-resistant systems. It also is far
superior to any token tagging mechanisms which are far less flexible and
require time consuming ordering of incoming tokens by tag.
ScalPL is fully general, so all of the situations you describe here arecompletely simultaneous execution instances, within the same data and node
connection contexts can be of benefit. This is the situation for which the
“ordered relay mechanism” can be used (which uses the pre-reservation
calls). This mechanism is completely unnecessary for any other situations.
But, I have discovered it to be an extremely useful tool for a number of
applications from almost straightforward emulation of electrical/electronic
circuits, to implementing redundant/fault-resistant systems. It also is far
superior to any token tagging mechanisms which are far less flexible and
require time consuming ordering of incoming tokens by tag.
easily handled. If you question how well it accomplishes one or more of
them compared to stressflow, maybe you can describe those specific ones
more precisely and we can go from there.
What is incorrect? Unless you clarify, I'll guess that you mean my last
sentence.
The description of the D node behavior in stressflow was completelysentence.
incorrect. First – no node execution instance is ever generated until it can
begin executing. The only slowdowns that can happen are a result of trying
to call another atom. This is why I wrote about the inherent full
“dynamicity” of the stressflow model.
The reservations and pre-reservations happen on the lock access mechanism.
For a multi-input node, the locks simply keeps track of incoming requests
and the one that completes the set fires the target node.
the matching within the runtime (i.e. lock access mechanism). That's
good. Still, a principal difference from ScalPL appears to be that
stressflow defaults to no matching, and requires matching for
determinism whenever data from different sources merges, while ScalPL
defaults to determinism by virtue of Kahn-McQueen, but provides ways to
build constructs which are not. I would argue that this implies that
ScalPL provides well-defined behavior in cases where stressflow does
not. For example, the potential results that a program produces can be
directly dependent upon how "out of order" the intermediate results can
become, and this out-of-orderness is a direct part of a ScalPL
specification, whereas I don't think it is in a stressflow specification.
I know you've written much more that's probably worth responding to some
day, but I do have other work to do. Maybe we've both provided at least
a glimpse into each other's development methods. I am tempted to
construct a ScalPL plan for each of the major stressflow constructs to
illustrate the primary differences and similarities, but I rather doubt
that will come soon. It appears that, in general, there are four
principal stressflow segments to model: (1) The caller (precall), (2)
the caller (postcall), (3) the callee (stressed), and (4) the callee
(unstressed). In ScalPL, these would most naturally be 4 different
tasks, with the obvious dependences between them, but it sometimes
sounds as though 1, 3, and 2 could all be one task with only 4 being a
separate task. I will need to understand more fully the circumstances
(if any) under which the flow from 1 to 3 to 2 can block.
Thanks for your time,
-Dave
---
http://www.elepar.com/