Merit Network
Can't find what you're looking for? Search the Mail Archives.
  About Merit   Services   Network   Resources & Support   Network Research   News   Events   Home

Discussion Communities: Merit Network Email List Archives

North American Network Operators Group

Date Prev | Date Next | Date Index | Thread Index | Author Index | Historical

Re: Time to revise RFC 1771

  • From: Dave Israel
  • Date: Tue Jun 26 17:16:49 2001

On 6/26/2001 at 13:47:37 -0700, Clayton Fiske said:
> On Tue, Jun 26, 2001 at 04:27:49PM -0400, Dave Israel wrote:
> > 
> > This ignores three basic facts:
> > 
> > 1) Networks tend to be homogenous in platform.
> > 2) Platforms tend to accept their own implementation quirks
> > 3) Networks peer at borders
> > 
> > Therefore, under the "drop the session rule," my bad announcement
> > gets to all my borders fine, and all my external peers who are not
> > running forgiving/compatable implementations drop their connections
> > to me and all my traffic to/from them hits the floor.
> 
> In this case, vendor C's implementation was neither forgiving nor
> compatible. It still dropped the peer(s) in question. It just had
> the much more harmful quirk that it forwarded the bad route on to
> its peers before doing so. In this case, a homogenous network would
> not only lose its border sessions, it would lose all internal ones
> through which the route was advertised.

I'm certainly not defending (or attacking) either vendor's
implementation; in the current environment, I believe following
the RFC is the correct course.  I was more concerned with
future implementations of BGP, and how (I feel) they should handle
problems like this, since, as we add more and more features to 
BGP, how we handle what appears to be a bad route (or a bad
NLRI) is going to become more important.
 
> > One CRC error does not make PPP drop.  Why make one route cause
> > a catastrophic loss of connectivity?  Report the bad route,
> > drop it, and move on; let layer 8 resolve it.
> 
> Because, arguably, we don't know that it's just one route. We just
> know that one route set off the alarm. Do you feel safe assuming that
> whatever bug caused one corrupted route left all the other routes
> alone?

No, but I feel secure that, if it corrupted a large enough number of
routes, the effect will not be worse than dropping the session.
Somebody mentioned what happens if there are 100,000 bad routes and 1
good one.  You keep the good one and drop the 100,000 bad ones.
Dropping routes is even easier than using them.  Besides, which tends
to be harder on a router: dropping bad routes, or tearing down and
restarting a TCP session?
 
> Plus, a CRC error can occur between two valid, compliant, bug-free
> implementations. A bad route, by definition, can't. We're not talking
> about external faults here, but broken implementations. When one side
> of a protocol session simply breaks the rules, I don't think it's
> reasonable to say that the other side needs to be "fixed" to accept
> that breakage. Fix the broken side.

A "bad route" can happen whenever one implementation differs from
another.  Both can be valid according to some definition of the
standard.  Determining who is wrong, and fixing it, takes time.  If
you're dropping a few of my routes during that time, that's
unavoidable.  If every customer of mine cannot reach every customer of
yours while we fight over whose implementation is wrong and who needs
to change what, then who wins?  And how is this fight more legitimate
than the one you have with your telco provider over how they built
your circuit and where your errors are coming from?

> The reason this has got everyone's attention is because of the unique
> way in which the breakage occurred. If all implementations were changed
> to drop the single bad route and keep the sessions intact, the damage
> would not have been what it was. If all implementations followed the
> current specs and dropped the session with the router which first
> originated the bad route, the damage would not have been what it was.
> To say that one way causes massive damage and the other doesn't is
> inaccurate. The damage was caused by the implementation in question
> doing something resembling one but with harmful behavior thrown in.

I think the issue has gone beyond what happened, and into what will
happen.  It's a simple design philosophy question:  Do you build
protocols that are robust and resilient under stress, or do you
build protocols that refuse to interoperate until everything
completely agrees?  Ideally, I can see the beauty of the second,
but realistically, I think you need to be permissive.  


-- 
Dave Israel
Senior Manager, IP Backbone
Intermedia Business Internet





Discussion Communities


About Merit | Services | Network | Resources & Support | Network Research
News | Events | Contact | Site Map | Merit Network Home


Merit Network, Inc.