Post by HenryPost by David DiNucciI'd be interested to hear your perspectives. For example, if "is
doomed" really reflects your thoughts, I presume that for some
definitions of grid (e.g. grid = cluster), you don't believe this, so
what is the "least complex" definition of grid that you feel "is doomed"?
My general view is roughly covered by my suggested acronym
(which Google now refuses to show me, even though it posted
it twice!) along the lines of
"Great Research; Implementation Disaster" -
where "disaster" may be too strong (see above) and "research"
should really be "idea" or "concept", but it's an acronym...
It's there: groups.google.com/group/comp.distributed/msg/8b032b1d036b6fd7
Post by HenryMy perspective comes from involvement with running a production site
within a large, global Grid project.
Good. Interesting wording, though: production site within a grid. I'm
not sure if you believe that necessarily implies that it's part of a
production grid, but I'll not press for a clarification.
Post by HenryPost by David DiNucciThanks to some (old and new) friends, I had the opportunity to hover
about at the SC'05 conference in Seattle last year, where I ran into
some people who were heavy into grids about the same time I was ('97,
'98), and I heard some discussion about how they now believed their
sights had been set too high.
My feeling is that the difficulties of actual deployment of a platform
useful to end-users have been grossly underestimated. People tend
to think of scalability issues as "will we need to track more than 255
of those", but there is a whole separate set of difficulties involved
in trying to get software correctly deployed by sysadmins around
the world (whose native language may not be English) with often
poor documentation, on a variety of hardware and software platforms
(neither "Windows" nor "Linux" exist), with differing network
connectivity, access and firewall technology and even with differing
personal or institutional priorities to getting things working.
It doesn't help that the thing is very interconnected so that resources
all have to talk to each other, so misconfigured sites can affect
others directly (as well as the usual "black-hole" problem).
(Compare with e.g. BOINC where there is a single server, and a
client is either successfully connected or not).
Note that the original Arpanet project (before my time) expressly
tried to avoid this by centrally issuing identical interface nodes
(TIP?) to each site.
I guess you're talking about the IMPs (and TIPs, Terminal IMPs),
starting as Honeywell DDP-516s and then to H-316s. (Before my time,
too, but that's what google says.)
It's quite true that such homogeneity of platform is not really an
option for grids, but having each customer site upgrading/deploying
independently would seem to be the other extreme on the spectrum.
Certainly traditional platform (e.g. system software and/or hardware)
vendors already have to deal with deploying their wares to diverse
environments. Doesn't it make more sense to funnel grid infrastructure
through them, piggybacking onto an existing supply and service chain?
Dealing with N vendors must be substantially easier than dealing with
N*(large number) of customer sites. Certainly, inserting middlemen has
its own costs and problems, but may be the lesser of evils.
Post by HenryI'm not sure of the extent to which the Architecture makes this
problem worse - I think deployment is just naturally Very Hard.
A second problem is that the Grid is implemented as a stack of
layers of middleware, which is probably the Right way: if a
particular layer or service has scalability or functionality problems
then it can be replaced by something else. (It doesn't help that
as actual users start using the thing, they find that they actually
want to use it in a totally different way to what was expected.)
Except "Something Else" will be closer to developer code (and
developer documentation!) and will take a while to get deployed
consistently _correctly_ - I would budget 1 or 2 _years_ for this,
and that's a loooong time in Grid projects.
This sounds like the first problem, only complicated by the fact that
user sites are lured into the idea that they can customize, only to find
out that their customizations must be phased in with everyone elses.
Post by HenrySo we end up with a lot of subtly different sites trying to install
stuff we don't really understand!
The problem with this: if user jobs are to succeed > 90% of
the time, how reliable must every layer[*] and every step[**] in
the process be?
I'm not sure what a "less complex" Grid would look like - if
you take out the many sites around the globe aspect (or
"spanning multiple administrative domains" in fancy-speak)
you get back to batch-queueing jobs among a local group
of systems - i.e. the 1960s!
I guess that answers my original question. It also seems to imply that
you believe one can get something that qualifies as a grid by extending
a batch queueing system to span administrative domains. I guess it's
true that the term is often used that way these days. It is certainly a
stretch from how we originally thought of it, but I think reality
quickly set in for those doing the work. It's probably no fun working
on a grid project for years unless you can have something, at some
point, that you can call a grid. In that respect, I am willing to
concede that initial definitions were too ambitious.
(It reminds me of a visit I made back to NASA Ames a few years after I
left. One of the grid managers explained that they were phasing out the
NASA Information Power Grid to some new NASA Grid, in part because the
IPG was focussed primarily on computational services. But, in fact, the
IPG was never proposed to focus on those, they just happened to be the
first things that got implemented.)
Post by Henry[IMO the single key concept behind "Grid Computing" is
the Virtual Organisation - this notion encapsulates both
single-sign across real organisations/resources and also
lots of collaborative possibilities.
The cynic would still point out that it's just Unix' groups
writ large!]
I'm not such a cynic, and I've had friends (even non-grid immersed ones)
who gravitated toward the V.O. term when they saw it. To me, the
concept seems a bit vague, probably due to ambiguity of the term
Organization, but I won't begrudge its usage.
Post by HenryThe problem or challenge is to get all the interconnectedness
to work, reliably and usefully.
Hth
Henry
* User submits job: needs information system to find resources,
matchmaking to pick one, job transfer to resource e.g. GRAM,
maybe local queueing and transfer to execution node,
job progress monitoring...
** A user executable starts running on a node. The first thing it
may want to do is locate and download input data: a whole
separate Grid transaction.
Although these may describe the how layers and steps work in a
Globus/OGSA environment, it is far from the only way, and I would argue
(and have been arguing since before GGF and OGSA existed) that it's not
the best way. These examples demonstrate some reasons why.
-Dave