comments

Tom Anderson (tom@emigrant)
Fri, 7 Aug 1998 10:43:23 -0700 (PDT)

My plan is to get comments from you, then send it out a bit
more widely to get comments from Ed, Corbato, Wetherall, McCanne, etc.,
and then to recruit participants (see below), and then to approach
the industrial partners.

tom
------
The xBone: Combining High Performance Communication and
Computation for Wide Area Distributed Systems and Networking Research

1. Introduction

We propose to develop a international infrastructure for novel research in
wide area distributed systems and applications. Today, wide area systems
research is limited the lack of a general-purpose testbed for conducting long-
term experiments under real traffic load; without such a testbed, wide area
systems research is limited to paper design studies and small-scale
demonstrations, without the reality check of real users. The MBone (the
multicast backbone) has shown the value of a wide area testbed for
attracting users to new applications (such as Internet-based
teleconferencing), as well as the new problems that appear when systems
are deployed on the wide scale. A huge amount of research has been
motivated and enabled by the MBone, leading directly to new products in
industry. A "paper MBone" would have simply not been effective. The
MBone, however, was painstakingly put together piece by piece; it is hard
to imagine how the research community could do something of the same
scale more than once or twice, despite the many applications, such as new
telecollaboration tools, new approaches to web caching, new name services,
and new security architectures to name a few, that could benefit from
widespread deployment. By analogy with the MBone, we call the proposed
testbed the xBone, for the "anything-backbone".

Our proposal is for a university, industrial, and government partnership to
place a rack of roughly twenty PC's at 50-100 geographically separate sites,
with dedicated, reliable gigabit bandwidth between sites and to large
numbers of early adopters. A single PC per site might be enough to support
a single experiment such as the MBone, but we would like to use the
infrastructure to support many simultaneous experiments, each running
continuously to attract real users. Proposals to use the framework would be
chosen by a program committee based on the potential for impact, the
benefit of demonstrating a system under long-term use, and the ability to
attract users. We explicitly do not envision this infrastructure being used as
a poor-man's parallel computer; geographically distributed computation is
required to optimize latency, bandwidth, and availability to dispersed end
clients, but geographic distribution is a hindrance to effective parallel
computing.

We are seeking contributions from emerging networking bandwidth
providers, (such as Abilene, SuperNet, and HSCC), networking switch
providers (e.g., to light a lambda provided by Abilene), PC and workstation
vendors (such as Intel and Sun), government (e.g., to fund glue hardware
and software), and university operations (for connectivity to the local
gigaPOP, as well as hardware installation and maintenance). On the
software side, our plan is for the system to be completely remotely operated
and self-managed, with secure remote loading of software, remote
debugging and rebooting, and self-configuration. Most of this software
needed exists today in various forms, although it has not been put together
for this purpose. For example, DARPA's ongoing Active Networks
program is developing software to share geographically distributed
resources among multiple applications; we expect to use their work in the
long term as it becomes robust, although in the short term to get started we
plan to provide each experiment dedicated resources (e.g., a PC at every
site).

We should point out that the infrastructure we are proposing is emerging as
the standard platform of choice for web services. Heavily used web
services such as Altavista and Netscape have long had several
geographically distributed replicas, to improve latency, bandwidth and
availability to end-users; Amazon.com now has a site on each coast.
However, the tools used to operate these distributed web sites are extremely
primitive. If the computer science research community is to figure out
solutions for what web services will need in the future, we will need an
experimental testbed for validating those solutions.

2. Applications

Our goal is to enable novel wide-area distributed systems and networking
research. There is a huge pent-up demand for a deployment mechanism for
new research ideas; we enumerate a few here. The scope of the list here
suggests that even twenty PC's per site would be quickly utilized.

a. Telecollaboration. The MBone is a huge success story for widespread
deployment, but it is suffering from its own popularity. Its protocols are
based on manual coordination, and although there are several proposals for
self-managing multicast networks (e.g., for building the multicast tree, for
address allocation, for reliable multicast transmission), it is unclear how to
deploy those new ideas given the widespread use of the MBone today.
Either we must restrict ourselves to backwardly compatible changes (a
serious limitation in a fast moving research field), conduct a "flag day"
where everyone agrees to upgrade at the same time, or, with the xBone,
provide a framework for incrementally deploying an alternate, parallel
"Mbone2" structure to operate in parallel to the original Mbone while it was
being phased out. Similarly, various proposals for telecollaboration tools
require computation inside of the network to be effective, for example, to
mix audio streams on the fly.

b. Real time. Providing end-to-end real time performance is an active
research area, with a proposed Internet standard in RSVP and several other
competing efforts at other universities. Without widespread deployment
with real users, however, it is unclear whether RSVP or its competitors
would work at a large scale, for example, to support Internet telephony in
the face of resource limits and rouer failures. Similarly, Internet switch
manufacturers are moving towards providing quality of service by
providing prioritized classes of service; however, there has been no
widespread prototyping effort to demonstrate that priority classes will be
sufficient to provide reasonable performance for real-time traffic.

c. Worldwide web caching. There are at least four major research efforts
proposing novel approaches to managing a distributed collection of web
caches (at UCLA, Washington, Wisconsin, and Texas); this is in addition to
the existing hierarchical Harvest/Squid caches. In fact, the research efforts
have been motivated by the surprising fact that the Squid caches do not
work -- with hit rates under 50%, going through the Squid cache hierarchy
hurts end client response time. Without being able to deploy and measure a
distributed web cache, the research community would not have been able to
determine the real problems that needed addressing; similarly, if we are
unable to deploy and measure the proposed solutions, we will be unable to
determine the next set of problems in operating a cooperating set of web
caches. The xBone provides the opportunity for a bake-off between
competing approaches; we could deploy two competing web caching
services and allow users to vote with their feet.

d. IETF standards. There is an existing mechanism, via the IETF, for new
Internet standards to be proposed and adopted for use. The xBone would
be complementary to that process, enabling proposed standards to be
implemented, deployed, and tested under real use before and during being
adopted for the real Internet. For example, IPv6 and mobile IP standards
have been tested on a small scale, but with the xBone, clients could begin to
count on these services being continuously available.

e. Internet measurement. A number of research efforts have begun to focus
on measuring characteristics of the Internet, both to understand its behavior
and to use as input into large scale simulations of the Internet. Measurement
efforts are ongoing at LBL, Pittsburgh, Cornell, ISI, Michigan, and
Washington, among other places. Because there is no direct way to ask
routers to report on their buffer occupancy, link latencies/bandwidths,
utilization, or drop rate, measurements must be taken from multiple sites to
be effective at capturing the state of the Internet. LBL and Pittsburgh have
developed a new suite of software, for example, to be deployed at
participating sites for use in a widespread Internet measurement study.

f. Internet operation. The ongoing Internet measurement efforts have begun
to illustrate that the Internet has substantial operational problems, including
high drop rates (5-6%), persistent congestion, poor route selection, and
route oscillations, just to name a few examples. Researchers at Berkeley,
Washington, Harvard, and elsewhere have proposed new approaches to
routing and congestion control to address these problems, but without a
deployment strategy that would allow them to be tested against substantial
numbers of users, there would be no way of validating these approaches to
the degree that would be necessary to think about using them in the real
Internet.

g. Distillation and compression. As more of the Web becomes graphics-
based, and as end-host displays become more heterogeneous (from PDA's
to reality engines), there is an increasing need for application-specific
compression to take place inside of the network to optimize around
bottleneck links. For example, it makes little sense to ship a full screen
picture across the Internet to a PDA; it also makes little sense to ask users to
manually select among small and large versions of the same image. Various
proposals exist to address this problem, but they are unlikely to have much
of an impact without a framework for providing reliable service to real
users. In the long run, one would hope compression and distillation would
be supported by both servers and clients, but before there is widespread
adoption, there is need for translators embedded in the network to handle
legacy systems.

g. Wide area distributed systems. A number of projects, such as Globe,
Globus, Legion, WebOS, ProActive, and DARPA's Quorum, have recently
been started to provide a software framework to support applications that
can make effective use of remote computational and storage resources.
These systems face a huge research agenda; to just illustrate one example,
we have only very limited understanding of how to provide cache and
replica consistency across the wide area. To focus this work on the real
problems of next-generation distributed applications, we need a strategy for
how applications can be developed for these frameworks and then be tested
in real use.

h. Naming. Similarly, a number of proposals have recently been developed
for replacing the Internet's Domain Naming System (DNS). Although DNS
is effective at mapping individual machine names to IP addresses, as
services become replicated, there is an increasing need to carefully control
the mapping from names to instances of a service (e.g., binding clients on
the East Coast to the Amazon.com replica in Delaware vs. binding West
Coast clients to the one in Seattle). Point solutions to this problem have
started to be deployed, but without a generic framework the solutions are
likely to be ad hoc (for example, Cisco's Distributed Director selects the
closest replica based on hop count, ignoring link latency, bandwidth, server
load, etc.)

i. Wide area security. There is an obvious need for a national infrastructure
for the secure, authenticated, accountable, and revocable access to remote
resources; several proposals have been made for how to provide the needed
framework, including wide area versions of Kerberos, MIT's SDSI and
Washington's CRISIS system. Nevertheless, it is unlikely that such a
system will be deployed in the near future, because it would rely on
physically secure computers spread around the country. The security issues
for the xBone, the MBone, the Active Networks backbone, etc., are similar,
and are unlikely to be solved without the ability to deploy a reliable,
continuously available framework for authentication and key distribution.

j. Distributed databases. An active area of research in the database
community is how to integrate geographically distributed data sets, for
example, to integrate various Web databases or NASA's EOSDIS into a
usable system capable of supporting queries that span multiple sites. The
Mariposa project at Berkeley, for example, has proposed an architecture for
dynamically moving data and computation around to minimize network
bandwidth and local computation cost.

k. Active networks. Finally, DARPA's Active Networks program can be
seen as providing an architecture for applications that can benefit from
computing in the network. For example, what virtual machine do these
applications use? How are resources allocated among applications that
share a physical machine? Ideally, we could use the Active Network
framework for operating the xBone; however, it is still in the process of
being standardized. Currently, there are four separate proposals for the
Active Network architecture, and the quickest way to resolve which is the
most appropriate would be to provide support for each of them across a
geographically distributed set of computers, and let application developers
vote with their feet.

3. Local Hardware Installation

As a research community, over the past few years we have gained lots of
experience with assembling and operating machine-room-area clusters.
Since our graduate students have wanted to avoid having to run to the
machine room every time something goes wrong with our local clusters, we
have also gained considerable experience with remote cluster operation.
Our strategy is to simpify operations by (i) throwing hardware at the
problem (e.g., using two extra PCs to serve as fail-safe monitors on the
cluster operations) and (ii) having a standard baseline hardware and
software configuration, avoiding the management problems of trying to turn
random collections of PC's running random collections of software into a
usable system.

At each site, we envision:

2 control PC's to serve as fail-safe reboot engines, monitoring the operation
of the other PC's. The control PC's would control reboot serial lines
(X.10) for all of the machines in the cluster (including each other); new
experiments (including potentially new OS kernels) are downloaded over
the Internet to the control PC and then installed on the relevant PC in the
rack. The control PC's would also have the responsibility for passively
monitoring the cluster's Internet connection for illegal use by the
experimental PC's. As a fail-safe, the two PC's will not be used to run
experimental software. The control PC's would also have GPS's to
provide time synchronization for the rest of the cluster.

20 PC's to serve as experimental apparatus. Each PC would be configured
with a reasonable amount of memory and disk, a machine-room area
network connection, and a wide-area network connection.

A high-speed machine room area network, such as Myrinet or fast switched
Ethernet, connecting all of the PC's in the cluster. This network would be
dedicated to the cluster and isolated from any other machines at the site.

A high-speed Internet connection to the local gigaPOP (the local connection
point to Abilene, HSCC, SuperNet, etc.). This connection would be
something like Gigabit Ethernet, that can be passively monitored by the
control PC's.

A local operator would be needed at each site to install the system and to
replace any failed hardware components, and to reboot the control PC's in
the event that both crash. Otherwise, the local operator would have no
software responsibilities for the cluster.

All told, the list price of the system described above would be roughly
$150K (?) per site; the machine-room footprint is about 5 square feet (2
racks).

4. List of potential participants:

West Coast
----------
Thomas Anderson (UW) tom@cs.washington.edu
David Wetherall (UW)
Deborah Estrin (USC) estrin@cs.usc.edu
Lixia Zhang (UCLA)
Darrell Long (UCSC)
Steve McCanne (Berkeley) mccanne@cs.berkeley.edu
David Culler (Berkeley)
Sally Floyd (LBL)
Joe Pasquale (UCSD)
Mendel Rosenblum (Stanford)
Mary Baker (Stanford)
Bob Braden (ISI-West)
SDSC
Oregon?

Mid-Country
-----------
John Hartman (Arizona) jhh@cs.arizona.edu
John Carter (Utah) retrac@cs.utah.edu
Mike Dahlin (Texas-Austin) dahlin@cs.utexas.edu
Pei Cao (Wisconsin) cao@cs.wisc.edu
Garth Gibson (CMU) garth.gibson@cs.cmu.edu
Jon Turner (Washington U)
Raj Jain (Ohio State)
Farnam Jahanian (Michigan)
Mathis (Pittsburgh Supercomputing)
Dirk Grunwald (Colorado)
Gary Minden (Kansas)
Roy Campbell, Illinois
Minnesota?

East Coast
----------
Jonathan Smith (Penn) jms@cs.upenn.edu
Jeff Chase (Duke) chase@cs.duke.edu
John Guttag (MIT) guttag@lcs.mit.edu
Hari Balakrishnan (MIT)
Larry Peterson (Princeton) llp@cs.princeton.edu
B. R. Badrinath (Rutgers) badri@cs.rutgers.edu
Jim Kurose (UMass)
Margo Seltzer (Harvard)
ISI-East
Andrew Grimshaw (Virginia)
S. Keshav (Cornell)
Ian Foster (Argonne)
Ellen Zegura (Georgia Tech), ewz@cc.gatech.edu
David Kotz (Dartmouth)
Yechiam Yemini (Columbia)

International
------------
Roger Needham (Cambridge)
Gerry Neufeld (U British Columbia)
Ken Sevcik (Toronto) kcg@cs.toronto.edu
Andy Tannenbaum
Marc Shapiro (INRIA)

Industry
--------
Intel
Fred Baker (Cisco)
Peter Newman (Nokia)
Jim Gray (Microsoft BARC)
Microsoft Redmond
Chuck Thacker (Microsoft Cambridge)
Mike Schroeder, DEC SRC
Jeff Mogul, DEC NSL
Scott Shenker, Xerox PARC
Greg Papadopolous (SUN)
Mike Schwartz (@Home)
K.K. Ramakrishnan (AT&T)
Eric Brewer (Inktomi)
Srini Seshan (IBM?)
Bellcore
TIS