Internet Border Gateway Protocol (BGP)

Victor Leung
15 min readJun 4, 2017

--

This article is about the Internet Border Gateway Protocol (BGP). It is a standardised exterior gateway protocol to exchange routing and reachability information among different autonomous agents (ASes) or Internet Service Providers (ISPs) on the Internet. The importance of the protocol, its capabilities, its problems and solutions would be explained and illustrated in details below.

1.The Border Gateway Protocol and its functions

In January 1989 at the 12th Internet Engineering Task Force (IETF) meeting, Len Bosack, Kirk Lougheed and Yakov Rekhter devised the BGP with the design goal to develop a protocol capable of providing policy control, loop detection and the scalability required to support hundreds of thousands of networks through address aggregation techniques.

BGP is an inter-Autonomous System routing protocol that is used to connect the ISPs. For example, Hutchison exchange network layer reachability information (NLRI) with China Mobile. Since there is no centralised control of the internet, they have to exchange NLRI in order to glue these autonomous networks together. Hutchison controls its own equipment and it’s own part of the internet, while China Mobile controls its own part of the internet with the different intra-autonomous system routing protocol. They have to inter-operate somehow and exchange information about the IP address associated with their customers.

The primary function of a BGP speaking system is evolved to handle this engineering and research problem, which is to exchange information between autonomous networks without any centralised control. When a packet is sent to the service provider, it has to look up the table to figure out where to send that. And that packet may end up on the other side of China on a completely different network. It needs to figure out how to get here from source to destination. BGP is the basic global foundation of the TCP/IP internets.

Another important function of BGP is to handle a commercial issue. For example, China mobile would not prefer Hutchison to send too much traffic to it, as this would result in costing China mobile more. When doing the shortest path algorithm, the goal is to connect everybody with the shortest path. However, different protocols are used within these autonomous networks in different domain and region, which has to work even when the service providers do not have a common agreement about what is the best route. The best route from Hutchison perspective may look different from China mobile due to different contracts with the company, thus different ideas on what is the best route.

In the internet, Hutchison would not want traffic to crossing its network, unless someone paying it for that, either from the source of the traffic, or the destination, or the combination of both. The protocol is about limiting the connectivity instead of being generous about it and giving to everyone. It could restrict the connectivity that is paying the service provider.

2. The operations of BGP

The current version of BGP is version 4, which was published as RFC 4271 in 2006. The operation uses an algorithm that is neither a pure distance vector nor a pure link state algorithm. Instead, BGP is an example of a path-vector algorithm, which is a class of distance vector using the Bellman-Ford routing algorithm. It uses path information, that is stored within the AS_PATH attribute, to avoid traditional distance vector problems. A path vector defines a route as a pairing between a destination and the attributes of the path to that destination. The explicit paths to reach a destination are advertised to a neighbour BGP router instead of advertising a cost and next hop in distance vector schemes. A network can define preferred paths instead of shortest paths that might have congestion or security problems.

The routing tables are traversed in order to reach the destination network system. This is the primary mechanism which provides loop avoidance. As a network is advertised from one AS to another, the AS path is updated to include the AS it has just passed through. BGP also supports address aggregation, allowing a large reduction in Core Internet Routing Tables.

If one internet path goes down, one of the operations of BGP is to offer network stability that guarantees routers can quickly adapt to send packets through another reconnection by making routing decisions based on paths, rules or network policies configured by a network administrator. Each BGP router maintains a standard routing table used to direct packets in transit. This table is used in conjunction with the routing information base (RIB), which is a data table stored on a server on the BGP router. The RIB contains route information both from directly connected external peers, as well as internal peers and continually updates the routing table as changes occur.

BGP sends updated router table information only when something changes with only the affected information. BGP has no automatic discovery mechanism, which means connections between peers have to be set up manually, with peer addresses programmed in at both ends.

It uses an incremental update strategy to conserve bandwidth and processing power. After initial exchanges of complete routing information, a pair of BGP routers exchanges only the changes to that information. This requires reliable transport between a pair of BGP routes by using TCP.

BGP is based on TCP/IP and uses client-server topology to communicate routing information, with the client-server initiating a BGP session by sending a request to the server.

3. Examples to illustrate how ASes can learn about the reachability of the Internet

For example, we have five ASes directly connected with peer relationship identified by a unique 32-bit integer Autonomous System Number (ASN), which is distributed by the Regional Internet Registries as shown in the diagram below:

BGP learns multiple paths via internal and external BGP speakers. It picks the best path and installs it in the RIB. Internal BGP is used to forward paths inside the AS while external BGP is used to exchange paths between AS.

When a customer inside the AS104 network wants to transmit data to the AS100 network, AS104 may need to transfer the packet to AS101 or AS103 first in order to reach AS100. BGP helps the router inside AS104 to make a decision which path to choose. They periodically exchange information about the ASes that they have a direct connection to. AS100 would send a message to both AS101 and AS102, that it has 1 hop length to AS101 and 1 hop length to AS102. Once AS101 received this message, it could figure out it has 0 hop length to itself, 1 hop length to AS100, 1 hop length to AS104, and more importantly, with new information that it has 2 hop lengths to reach AS102. The routing table of AS101 can now know how to transfer data to AS102. AS101 would then send all the information to AS104, which AS104 knows it is 0 hops away from itself, 1 hop length to AS101, 1 hop length to AS103 as well as 2 hop length to AS100 through AS101. Therefore AS104 can know the shortest path to reach AS100, because of 2 hops via AS101, instead of 3 hops via AS103 and AS102. The BGP allows advertise of reachability information on the internet and connect all the ASes. It allows management of trust and untrust between different service provider. It is used to uniquely identify networks with a common routing policy described in RFC 42771 and widely used for internet backbone.

BGP makes best-path decisions based on current reachability, hop counts and other path characteristics. In situations where multiple paths are available, it can be used to communicate an organization’s own preferences in terms of what path traffic should follow in and out of its networks. It has a mechanism for defining arbitrary tags, called communities, which can be used to control route advertisement behaviour by mutual agreement among peers.

4. The packet formats of BGP and highlight the functions of some of the fields

BGP messages are sent over TCP connections. A message is processed only after it is entirely received. The maximum message size is 4096 octets, while the smallest message that may be sent consists of a header without a data portion of 19 octets. Some of the functions of the field are highlighted below:

4.1 Message Header Format

Marker — the 16-octet field is included for compatibility and it must be set to all ones.

Length — 2-octet unsigned integer indicates the total length of the message, including the header in octets. It allows one to locate the Marker field of the next message in the TCP stream. The value of the field must always be greater than 19 and smaller than 4096. The padding of extra data after the message is not allowed, so the field must have the smallest value required.

Type — 1-octet unsigned integer indicates the type code of the message. The types codes are 1 — Open, 2 — Update, 3 — Notification, 4 — Keepalive.

4.2 Open Message Format

After a TCP connection is established, the first message sent by each side is an Open message. If the Open message is acceptable, a Keepalive message confirming the Open is sent back.

Version — 1-octet unsigned integer indicates the protocol version number of the message.

My Autonomous System — 2-octet unsigned integer indicates the AS number of the sender.

Hold Time — 2-octet unsigned integer indicates the number of seconds the sender proposes for the value of the Hold Timer. Upon receipt of an Open message, a BGP speaker must calculate the value of the hold timer by using the smaller of its configured hold time and the hold time received in the Open message. The hold time must be either 0 or at least 3 seconds. An implementation may reject connections on the basis of the hold time. The calculated value indicates the maximum number of seconds that may elapse between the receipt of successive keepalive and/or update messages from the sender.

BGP Identifier — 4-octet unsigned integer indicates the BGP Identifier of the sender. A given BGP speaker sets the value of its BGP Identifier to an IP address that is unsigned to that BGP speaker. The value of it is determined upon startup and is the same for every local interface and BGP peer.

Opt Param Len — 1-octet unsigned integer indicates the total length of the Optional Parameters field in octets. If the value of this field is zero, no Optional Parameters are present.

Optional Parameters (variable) — contains a list of optional parameters, in which each parameter is encoded as triplet below:

Parameter Type — 1 octet field that unambiguously identifies individual parameters.

Parameter Length — 1 octet field that contains the length of the Parameter Value field in octets.

Parameter Value (variable) — interpreted according to the value of the Parameter Type field.

The minimum length of the Open message including the message header is 29 octets.

4.3 Update Message Format

This is used to transfer routing information between BGP peers. The information can be used to construct a graph that describes the relationships of the various AS. Routing information loops and some other anomalies may be detected and removed from inter-AS routing.

An Update message is used to advertise feasible routes that share common path attributes to a peer, or to withdraw multiple unfeasible routes from service. An Update message may simultaneously advertise a feasible route and withdraw multiple unfeasible routes from services.

Withdrawn Routes Length (2 octets) — indicates the total length of Withdrawn Routes fields, a value of 0 indicates that no routes are being withdrawn from service

Withdrawn Routes (variable) — contains a list of IP address prefixes for the routes that are being withdrawn from service.

Length (1 octet) — indicates in bits of the IP address prefix. 0 indicates a prefix that matches all IP addresses.

Prefix (variable) — contains an IP address prefix, followed by the minimum number of trailing bits needed to make the end of the field fall on an octet boundary.

Total Path Attributes Length (2 octets) — indicates the total length of the Path Attributes fields in octets. A value of zero indicates that neither the NLRI nor the Path Attribute field is present.

Path Attributes (variable), a triple: <attribute type, attribute length, attribute value>, an attribute type is a 2 field that consists of:

Attr. Flags

Bit 0 — optional bit, 1 is optional and 0 is well-known.

Bit 1 — Transitive bit with 0 is transitive and 0 is non-transitive. Well-known attributes must set to 1.

Bit 2 — Partial bit defines as 1 if the information contained in the optional transitive attributes is partial or as 0 for complete. For well-known attributes, this must be set to 0.

Bit 3 — Extended Length bit set to 0 for a length of 1 octet and set to 1 for a length of 2 octets.

Bit 4–7 are unused. Must be zero when sent and ignored when received.

Attr. Type Code

Type Code 1 Origin
A well-known mandatory attribute that defines the origin of the path information, the value of 0 means IGP — NLRI is interior to the originating AS; a value of 1 means EGP — NLRI learned via the EGP protocol; a value of 2 means INCOMPLETE — NLRI learned by some other means

Type Code 2 AS_PATH

A well-known mandatory attribute that is composed of a sequence of AS path segments, represented by a triple <path segment type, path segment length, path segment value>. Value 1 stands for an unordered set of ASes and values 2 stands for an ordered set.

Type Code 3 NEXT_HOP

It defines the unicast IP address of the router that should be used as the next hop to the destinations listed in the network layer reachability information field of the update message.

Type Code 4 MULTI_EXIT_DISC

An optional non-transitive attribute that is a four-octet unsigned integer. The value may be used by a BGP speaker’s decision process to discriminate among multiple entries points to a neighbouring ASes.

Type Code 5 LOCAL_PREF

A 4-octet unsigned integer that is used by BGP speaker to inform its other internal peers of the advertising speaker’s degree of performance for an advertised route.

Type Code 6 ATOMIC_AGGREGATE

A discretionary attribute of length 0.

Type Code 7 AGGREGATOR

An optional transitive attribute of length 6 which contains the last AS number that formed the aggregate route encoded as 2 octets, followed by the IP address of the BGP speaker that formed the aggregate route encoded as 4 octets.

Network Layer Reachability Information (variable) — contains a list of IP address prefixes. The length, in octets, is not encoded explicitly but can be calculated as:

updated message length — 23 — total path attributes length — withdrawn routes length

where
- updated message length is the value encoded in the fixed-size BGP header,

- total path attributes length and withdrawn routes length are the variable part of the update message

- 23 is a combined length of the fixed-size BGP header, the total path attribute length field and the withdrawn routes length field.

Reachability information is encoded as one or more 2-tuples of the form:

Length (1 octet) — indicates the length in bits of the IP address prefix. 0 indicates a prefix that matches all IP address (with a prefix, itself, of zero octets).

Prefix (variable) — contains an IP address prefix, followed by enough trailing bits to make the end of the field fall on an octet boundary.

The minimum length of the Update message is 23 octets — 19 octets for the fixed header + 2 octets for the length of the withdrawn routes + 2 octets for the total path attribute length.

An update message can advertise one set of path attributes, but multiple destinations provided that the destination shared these attributes. All path attributes contained in a given Update message apply to all destinations carried in the NLRI field of the update message.

An update message can list multiple routes that are to be withdrawn from service. Each route is identified by its destination, which identifies the route in the context of the BGP speaker — BGP speaker connection to which it has been previously advertised.

An update message might advertise only routes that are to be withdrawn from service, in which case the message will not include path attributes or NLRI. On the contrary, it may advertise only a feasible route without withdrawn routes.

4.4 Keepalive Message Format

BGP does not use any TCP-based, keep-alive mechanism to determine if peers are reachable. Instead, Keepalive messages are exchanged between peers often enough not to cause the Hold Timer to expire. A reasonable maximum time between Keepalive messages would be one-third of the Hold Time interval. Keepalive messages must not be sent more frequently than once per second. An implementation may adjust the rate at which it sends Keepalive messages as a function of the Hold Time interval. If the negotiated Holdtime interval is zero, then periodic Keepalive messages must not be sent. A Keepalive message consists of only the message header and has a length of 19 octets.

4.5 Notification Message Format

A notification message is sent when an error condition is detected. The BGP connection is closed immediately after it is sent. In addition to the fixed-size BGP header, the notification message contains the following fields:

Error Code (1-octet) — indicates the type as follows:

1 — Message header error

2 — Open message error

3 — Update message error

4 — Hold timer expired

5 — Finite state machine error

6 — Cease

Error subcode (1-octet) integer provides more information about the nature of the reported error.

Message Header Error subcodes:

1 — Connection not synchronized

2 — Bad message length

3 — Bad message type

Open message error subcodes:

1 — Unsupported version number

2 — Bad peer AS

3 — Bad BGP identifier

4 — Unsupported optional parameter

5 — (Deprecated)

6 — Unacceptable hold time

Update message error subcodes:

1 — Malformed Attribute List

2 — Unrecognized Well-known Attribute

3 — Missing Well-known Attribute

4 — Attribute Flags Error

5 — Attribute Length Error

6 — Invalid ORIGIN Attribute

7 — Deprecated

8 — Invalid NEXT_HOP Attribute

9 — Optional Attribute Error

10 — Invalid Network Field

11 — Malformed AS_PATH

Data (variable) — is used to diagnose the reason for the notification. The contents depend on the error. The length can be determined by the formula:

Message length = 21 + data length

The minimum length therefore 21 octets.

5. Some of the instability problems of BGP and the proposed solution

Instability is defined as the rapid change of network reachability and topology information. Several problems like software bugs, TCP attacks or congestion can cause loss of service, waste of network resources and service degradation for Quality of Service (QoS) demanding application.

One of the classic problems of BGP is called the black-hole phenomenon. A manual bad configuration can cause a BGP router to wrongly announce through the AS to which it belongs, and causes many BGP routers to update their routing table entries for these networks. As a result, a huge amount of traffic would be forwarded to this AS. The unexpected traffic causes a huge amount of packet loss and finally a congestion collapse.

One of the symptoms of instability is the disappearance of an existing route. It is called flapping if the route is reappearing soon. This is caused if a router sends a routing update and then withdraws it after a short period of time, causing its peers to propagate and withdraw the updates further in the network. Performance of routers may suffer significantly as a result. It could also lead to a transient loss of connectivity.

Another possible instability happens when there is internal congestion in some links within an AS and this causes the TCP connection between two BGP routers of that AS to time out.

A robust implementation of BGP should have the characteristics that instability of a subset of routes should not affect the route advertisements or forwarding associated with the set of stable routes. Instability should not be caused by peers with high levels of instability or with different CPU speed or load that result in faster or slower processing of routes. These unstable peers should have a bounded impact on the convergence time for generally stable routes.

One of the proposed solutions is the route flap damping. It can prevent a heavy processing load on routers, which in turn delay updates on other routes. A route flapping is exponentially decayed. Damping can also mitigate denial of service attacks.

6. Security concerns in BGP and enhancement

BGP is vulnerable to a variety of attack because of a lack of integrity and authentication of messages. Communication between BGP peers is subject to active or passive wiretapping. The software, configuration information and routing databases of a router may be modified via unauthorized access to a router, thus transform routers into hostile insiders. An enhancement would be improved physical and procedural security for network management facilities and routers.

Another important vulnerability comes for the underlying transport protocol, which is TCP. BGP is vulnerable to the same security attacks that are present in TCP, such as SYN flooding attack sends many TCP SYN message toward a server and forces it to establish many connections and exhausts its resources like memory and bandwidth.

Attackers can also disrupt these TCP connections and pretend to be a legitimate peer router. Since the mechanism defined in the RFC did not provide peer-entity authentication, these connections may be subject to some forms of replay attacks that will not be detected at the TCP layer. Such attacks might result in delivery of spoofed BGP messages.

As an attacker can generate false flapping to cause a victim’s prefix to be damped. To enhance it, it is recommended merely change parameters to more conservative values, there should be no increase in risk. It should slightly mitigate the false flapping attack.

Besides, a different key should be used for communication with each protected peer. Otherwise, if the same key is used for multiple peers, there is an increased risk of compromise of one router that adversely affects other routers.

In order to enhance it, the keys used for MAC computation should be changed periodically, at most 90 days, to minimize the impact of a key compromise or successful cryptanalytic attack. Furthermore, the key should be chosen to be difficult for an attacker to guess.

--

--

Victor Leung

I write about business, technology and personal development