Voice over Internet Protocol (VoIP) is a methodology and group of technologies for the delivery of voice communications and multimedia sessions over Internet Protocol (IP) networks, such as the Internet.
The voice over IP (VoIP) protocol suite is generically broken into two categories, control plane protocols and data plane protocols. The control plane portion of the VoIP protocol is the traffic required to connect and maintain the actual user traffic. It is also responsible for maintaining overall network operation (router to router communications). The data plane (voice) portion of the VoIP protocol is the actual traffic that needs to get from one end to another.
Within the VoIP suite of protocols, voice packets are commonly referred to as the data plane. Likewise, signaling packets are commonly referred to as the control plane. This document will examine the VoIP protocol suite in two ways, which are- data plane protocols and control plane protocols.
As its name implies, VoIP utilizes IP as its basic transport method. VoIP utilizes both the Transmission
Control Protocol (TCP) and User Datagram Protocol (UDP) and SCTP (Stream Control Transmission Protocol) over IP. The following diagram shows the protocol stack for a VoIP network.
It is important to note that VoIP works with any protocol stack that supports IP. End users of VoIP can add enterprise VoIP systems to their existing infrastructure relatively quickly and easily.
Both Real-Time Protocol (RTP) and Compressed Real-Time Protocol (cRTP) are currently available using any of the control plane protocols defined in this document. Since the voice traffic within a VoIP network can often take a different path than the signaling traffic, it makes sense that they are independent protocols.
RTP is the protocol that supports user voice. Each RTP packet contains a small sample of the voice conversation.
The size of the packet and the size of the voice sample inside the packet will depend on the
The following diagram shows the RTP stack.
RTP information is encapsulated in a UDP packet. If an RTP packet is lost or dropped by the network,
it will not be re-transmitted (as it is a standard for UDP). This is because a user would not want a long pause or delay in the conversation due to the network or the phone’s requesting lost packets.
A variant of RTP is compressed RTP (cRTP), which eliminates much of the overall packet header. By eliminating this overhead, a more efficient packet is placed onto the network.
Compressed RTP is used on point-to-point wide area network (WAN) links. Point-to-point, in this case, is not implying a PPP Layer 2 framing format. The link layer may be any standard WAN link layer protocol (frame relay, HDLC, PPP, or Cisco HDLC).
IP+UDP+RTP Header(40 Bytes)
Standard RTP Packet
Compressed RTP Packet
Real-Time Control Protocol (RTCP) is a data plane protocol which is not always used. This protocol allows the end points to communicate directly regarding the quality of the call. RTCP affords the endpointsthe ability to adjust the call in real time to increase the quality of the call.
RTP Control Protocol Extended Reports (RTCP XR) is a newer extension of the RTCP concept. It defines a set of metrics that can be inexpensively added to call managers, call gateways and IP phones for call quality analysis. RTCP XR messages are exchanged periodically between IP phones and gateways.
There are a wide range of voice CODECs (coder/decoder or compression/decompression) protocols available for VoIP phone implementation. The most common voice CODECs include G.711, G.723,
G.726, G.728, and G.729. A brief description of each CODEC is as followed-
G.711 – Converts voice into a 64 kbps voice stream. This is the same CODEC used in traditional TDM
T1 voice. It is considered as the highest quality.
G.723.1 – Two different types of G.723.1 compression exist. One type uses a Code Excited Linear
Prediction (CELP) compression algorithm and has a bit rate of 5.3 kbps. The other type uses a Multi-
Pulse – Maximum Likelihood Quantizer (MP-MLQ) algorithm and provides better quality sound. This type has a bit rate of 6.3 kbps.
G.726 – Allows for several different bit rates, including 40, 32, 24, and 16 kbps. It works well with packet to private branch exchange (PBX) interconnections. It is most commonly deployed at 32 kbps.
G.728 – Provides good voice quality and is specifically designed for low latency applications. It compresses voice into a 16 kbps stream.
G.729 – One of the better voice quality CODECs. It converts voice into an 8 kbps stream. There are two versions of this CODEC, G.729 and G.729a, G.729a has a more simplified algorithm over G.729, allowing the end phones to have less processing power for the same level of quality.
The following sub section describes several factors that can affect the quality of a VoIP call in an operational environment.
The choice of CODEC is the first factor in determining the quality of a call. Generally, the higher the bit rate used for the CODEC, the better the voice quality. Higher bit rate CODECs, however, take up more space on the network and also allow for fewer total calls on the network.
The biggest factor in call quality is the design, implementation and use of the network that the voice calls are riding on. There are several ways, a network can affect a VoIP call, including packet jitter, packet loss, and delay.
Jitter caused by changes in the inter-arrival gap between packets at the endpoint.
The actual loss of voice packets through a network. Packet loss is often caused by congestion at one or more points along the path of the voice call or by a poor quality link.
Sometimes referred to as envelope delay, refers to the time it takes for the voice to travel from the handset of one phone to the ear piece of the other phone. Envelope delay is the sum of the delay caused by the CODEC of choice, jitter buffer in the phone, and the path time it takes for the packets to get through the network.
Echo is a common problem for VoIP networks. It is important to note that, unlike packet jitter, packet loss and delay; echo is not caused by the IP network. Echo is an analog impairment.
There are two types of echo on analog voice networks – Hybrid echo and acoustic echo.
Hybrid echo is generated by impedance mismatches at various analog or digital points on the network.
Acoustic echo is generated at the phone. It occurs when the voice leaving the speaker is picked up by the microphone.
The control plane is used for the various signaling protocols, allowing users of VoIP to connect their phone calls. There are several different types of VoIP signaling available today, including H.323, SIP,
SCCP and SIGTRAN.
H.323 was the first widely adopted and deployed VoIP protocol suite. The H.323 standard was developed by the International Telecommunications Union-Technology Standardization Sector (ITU-T) for transmitting audio and video over the Internet.
The overall protocol stack for H.323 (see below) is made up of several parts. Each part is responsiblefor specific tasks, such as call setup and phone registration.
H.245 is themedia control portion of the H.323 protocol suite that establishes a logical channel for each call (endpoint to endpoint). During H.245 negotiation, each endpoint exchanges its capabilities and preferences. The choice of CODEC for the call is part of this exchange.
H.225 represents the basic signaling messages that are also used when dealing with ISDN or GR-303.
For H.225, these messages include setup, alerting, connect, call proceeding, release complete, and facility messages that are based on the Q.931 signaling scheme as defined as below.
RAS (registration, admission, and status) protocol deals with element (phone) management. The RAS logical channel is established between the IP phones and the gatekeeper that manages those phones.
Session Initiation Protocol (SIP) is designed to manage and establish multimedia sessions, such as video conferencing, voice calls and data sharing.
There are several key features of SIP that make it so attractive:
1. URL addressing scheme – Allows for number portability that is physical location independent.
Addressing can be a phone number, an IP address, or an e-mail address. The messages are very similar to those used by the Internet (HTTP).
2. Multimedia – SIP can have multiple media sessions during one call, which means that users can share a game, instant message (IM), and talk at the same time.
3. It is a “light” protocol and is easily scalable.
The two components that make up a SIP system, include user agents and network servers.
User agents represent the phone (user agent client) and the server (user agent server).
The user agent client (UAC) initiates media calls. The user agent server (UAS) responds to those requests for setup on behalf of the UAC. The UAS is also responsible for finding the destination UAC or intermediate UAS.
Network servers include redirection, proxy, and registrar servers. Redirection servers do not process calls and only respond with information containing the appropriate address of the next server. The registration server registers new clients in the database and updates other databases.
Sigtran signaling protocol carries SS7 over SCTP. Since SCTP carries traditional SS7 traffic, the protocol must meet the same guidelines defined for SS7.
These guidelines include:
1. It must be compatible with UDP.
2. It must support acknowledged and error-free transfer of data.
3. It must support the segmentation of SS7 messages.
4. It must allow for network-level fault tolerance.
The Sigtran protocol stack is as followed-
SS7 Application Layer