Calculating word counts for translation can be a difficult topic to understand. Because word counts are important for translation pricing and predicting translation project timelines, we want our clients to be more informed on the process. Today we will look deeper behind the computer code and specifications that help translation companies provide accurate word counts in the documents they receive for translation.

In part 1 of our post on translation word counts, we discussed a standard for translation word counts known as GMX-V (Global Information Metrics Exchange for Volume). Today we will explore the specifics on how GMX-V defines the word counts in documents submitted for translation. Come with us as we dive into the code behind what defines word counts in any language.

How does a localization company know what the word counts are? What makes word counts verifiable? How do you count words in a logographic language (also known as a character based language)? To answer these questions we can start in the GMX-V specification. As stated in the specification, “Word and character counts are governed by Unicode TR 29 Version 4.1.0 – Text Boundaries, Section 4 Word Boundaries, which in turn relies on the ”. Unicode uses the rules that define input to a computer. When you press a key on your keyboard, Unicode is what works behind the scenes defining what is rendered on the screen. This process is the beginning of calculating word counts for translation.

Defining word boundaries in translation

Let’s dig into the specification. If we take a look at the first section, the focus is on the high-level definition of Unicode TR 29 word boundaries. Word boundaries are basically a set of language specific rules that clearly define where a word starts and ends. The diagram below provides a pretty good visual example of word boundaries.

The picture illustrates word breaks, and most of us can logically reason that what lies between these breaks is defined as words, but to a computer this boundary is much harder to distinguish. Through the “specification”, GMX-V can take advantage of these default word boundaries but also allow for custom boundaries if flexibility is needed.

Calculating word counts for translation of logographic languages

What are logographic languages? Logographic languages are written languages that use a letter, symbol, or sign to represent an entire word. Some examples of logographic languages are ChineseJapaneseKorean and Thai. When dealing with languages like this it can sometimes be difficult to calculate word counts. To illustrate the complications with logographic languages, let’s look at a simple example in Chinese.

If we translate “the house is red” into Chinese we get 房子是红色的。There are four words in English in the sentence “the house is red.” How many words are in the Chinese? Well, there are 7 characters. And guess what? There aren’t any spaces between any of these characters! There is when word counting starts to get tricky.

So here is how the TR 29 and the GMX-V handle it. Predefined numbers, acknowledged best practice within the localization industry, are divided by the word counts. Here are the numbers for the provided languages as listed in the GMX-V specification:

  • Chinese: 2.8
  • Japanese: 3.0
  • Korean: 3.3
  • Thai: 6.0

Back to our example. We have an example in Chinese with seven characters. If we divide that by the predefined number for Chinese (2.8) we get 2.5 words. Although it is hard to see how this method is accurate with such a small example, keep in mind that with a much larger document, and more words to sample, these counts become much more accurate. Unfortunately, there are no industry standard numbers for Lao, Khmer and Myanmar. For these languages, only character counts (or a custom defined count) will be required for the GMX-V.

We definitely didn’t discuss every detail and nuance in the GMX-V and TR 29 specifications, but hope we’ve touched on enough for you to have a better understanding of what you are looking at when you encounter GMX-V content. This should give you some insight to how verifiable translation metrics are actually calculated behind the scenes when calculating word counts for translation. If you would like more information, make sure to check out our previous post on GMX-V or reach out to us at