Anatomy of a Short Message
Back to Articles
SMS SMPP Unicode Encoding GSM

Anatomy of a Short Message

October 2, 2017 5 min
Aivis Olsteins

Aivis Olsteins

It is very well known fact, that a Short Message (SMS) should contain somewhere around 140 to 160 characters. Some devices or phones insist on 160, some on less. Twitter (whose origins are also closely related to short messages), for example has a limit of 140 symbols. Speakers of languages not based on Latin script will say this number is even less. Why this ambiguity and why exactly these numbers? And are these numbers correct at all?

First, let's start with the history. SMS dates back to the time when primary function of the phone was speech calling. There were no keyboards, no touch screens, and the only way of input was to use numeric keypads by assigning several alphabet letters to each key and letting select required letter by repeated pressing of the same key. The process of text entry in this way is slow, and the number of letters is relatively small. SMS was initially a part of GSM networks only, which were deployed in European countries, so it was deemed sufficient that it can serve a limited set of Latin alphabet, numbers, and some special characters. Taking parallels to ASCII, which is able to accommodate most needs of Latin-based writing system in a 7-bit encoding space, the encoding scheme for SMS also was comprised of 7-bits, and they mostly covered same characters as ASCII, with some exceptions. For example, control characters found in ASCII range 0x00 ... 0x20 was replaced by some characters found in European languages outside ASCII range (Greek, Nordic, Spanish etc). 

Not going into technical details, the resulting protocol allowed to transmit 160 characters of text, having 7-bit encoding. Thus, total space allocated for user data, is 160 x 7 = 1120 bits. That limit for SMS stands today and any further developments and variations always play from here.

With the spread of popularity of SMS, it became clear that there are several problems: 1) there are many languages which are unable to use it because of lack of support and 2) the 160 character limit is too small.

1. Language support

The solution of first problem seems easy at the first glance, but comes at the cost: let's use this 1120 bit space and represent each character with 8 bits. That essentially allows to represent any Latin alphabet character, including those found in Nordic, Spanish and other languages. The available size, is reduced, respectively to: 1120 / 8 = 140 characters exactly. That still is far from covering all languages. Russian, Chinese, Arabic, etc still are not covered. By using same method, and encoding test with UCS-2 (which is 16-bit encoder), it is now possible to cover most of the popular languages of the world. This is the most widely used standard when typing non-ASCII and/or non-Latin messages. The cost of that: message size is reduced further to 1120 / 16 = 70 characters only. 

2. Longer messages

Now it becomes evident that there must be a way to send longer messages. Even 160 characters was not much, but for some languages 70 is absolutely insufficient. What about if we seamlessly split longer messages behind the scenes, transmit in separate parts, and concatenate at the receiving side. This way does not require modifications of transmission infrastructure, which does not change so frequently as user handsets and also cost much more to upgrade or replace. Implementation seems obvious, except for the fact that there is no field or indicator in the message itself which can signal that a given message is part of multi-part message and should be reassembled when received. The way to solve it, was to "eat" a small part from the beginning of message itself, and use it as a special header which would describe what kind of message it is. It is called User Data Header (UDH) , and apart from telling the receiver side that this is part X of multipart message, it has some more functions, which we will not touch here. The resulting approach, reduces the length of each part of concatenated message by at least some 48 bits, so the resulting message lengths per part are following:

For 7-bit encoded message: 160 for complete message, 153 for a part of multipart message ( 7 x 7 = 49 bits less )

For 8-bit encoded message: 140 for complete message, 134 for a part of multipart message ( 6 x 8 = 48 bits less )

For 16-bit encoded message: 70 for complete message, 67 for a part of multipart message ( 3 x 16 = 48 bits less )

 

{$image1}

 

Technically it solves the problem of sending messages of arbitrary length in almost any language of the world. The infrastructure does not change, it can be transparent to the content. In most cases that is the case, as we see that most mobile operators charge by message parts, regardless of how many actual characters are send through.

Below just some examples of encoding of few letters:

Letter DescriptionUTF-16UTF-8GSM 03.38 (7-bit)
     
ñSpanish small n with tildeU+00F10xC3 0xB1 (c3b1)0x7D
áSmall a acuteU+00E10xC3 0xA1 (c3a1)Not present, available via shift table + 0x61

 

Reference:

GSM03.38 page on Wikipedia

 

 

 

Share this article

Aivis Olsteins

Aivis Olsteins

An experienced telecommunications professional with expertise in network architecture, cloud communications, and emerging technologies. Passionate about helping businesses leverage modern telecom solutions to drive growth and innovation.

Related Articles

The Commitment Economy: Why Voice AI Bookings Must Be Integrated, Not Just Conversational

The Commitment Economy: Why Voice AI Bookings Must Be Integrated, Not Just Conversational

AI can promise a booking, but what about the broken promise? Learn why systemic integration, Accuracy Rate, and System Sync define the real test of Voice AI reliability

Read Article
Beyond the Dial Tone: 3 Metrics That Define Outbound AI Success

Beyond the Dial Tone: 3 Metrics That Define Outbound AI Success

Outbound AI requires a new scorecard. Learn the 3 metrics (Connection Rate, Engagement Quality, and Conversion Impact) that measure pipeline movement, not just call volume

Read Article
The New AI Scorecard: How to Measure Campaign Effectiveness Beyond "Call Volume"

The New AI Scorecard: How to Measure Campaign Effectiveness Beyond "Call Volume"

Stop guessing with 'Call Volume'. Discover the 3-Layer Framework for measuring Voice AI success: Goal Completion Rate (GCR), Sentiment Drift, and Knowledge Retrieval. Turn phone calls into structured marketing data

Read Article
What Happens to Metrics When "Hold Time" Hits Zero?

What Happens to Metrics When "Hold Time" Hits Zero?

Does Voice AI just save money? No. Discover the "CSAT Paradox" and how zero hold time improves revenue, lead capture, and team morale simultaneously.

Read Article

SUBSCRIBE TO OUR NEWSLETTER

Stay up to date with the latest news and updates from our telecom experts