How to Calculate Word Counts for Translation

Calculating word counts for translation can be a difficult topic to understand. Because word counts are important for translation pricing and predicting translation project timelines, we want our clients to be more informed on the process. Today we will look deeper behind the computer code and specifications that help translation companies provide accurate word counts in the documents they receive for translation.

Today we will explore the specifics on how GMX-V defines the word counts in documents submitted for translation. Come with us as we dive into the code behind what defines word counts in any language.

How does a localization company know what the word counts are? What makes word counts verifiable? How do you count words in a logographic language (also known as a character based language)? To answer these questions we can start in the GMX-V specification. As stated in the specification, “Word and character counts are governed by Unicode TR 29 Version 4.1.0 – Text Boundaries, Section 4 Word Boundaries, which in turn relies on the ”. Unicode uses the rules that define input to a computer. When you press a key on your keyboard, Unicode is what works behind the scenes defining what is rendered on the screen. This process is the beginning of calculating word counts for translation.

Defining word boundaries in translation

Let’s dig into the specification. If we take a look at the first section, the focus is on the high-level definition of Unicode TR 29 word boundaries. Word boundaries are basically a set of language specific rules that clearly define where a word starts and ends. The diagram below provides a pretty good visual example of word boundaries.

The picture illustrates word breaks, and most of us can logically reason that what lies between these breaks is defined as words, but to a computer this boundary is much harder to distinguish. Through the “specification”, GMX-V can take advantage of these default word boundaries but also allow for custom boundaries if flexibility is needed.

Calculating word counts for translation of logographic languages

What are logographic languages? Logographic languages are written languages that use a letter, symbol, or sign to represent an entire word. Some examples of logographic languages are ChineseJapaneseKorean and Thai. When dealing with languages like this it can sometimes be difficult to calculate word counts. To illustrate the complications with logographic languages, let’s look at a simple example in Chinese.

If we translate “the house is red” into Chinese we get 房子是红色的。There are four words in English in the sentence “the house is red.” How many words are in the Chinese? Well, there are 7 characters. And guess what? There aren’t any spaces between any of these characters! There is when word counting starts to get tricky.

So here is how the TR 29 and the GMX-V handle it. Predefined numbers, acknowledged best practice within the localization industry, are divided by the word counts. Here are the numbers for the provided languages as listed in the GMX-V specification:

  • Chinese: 2.8
  • Japanese: 3.0
  • Korean: 3.3
  • Thai: 6.0

Back to our example. We have an example in Chinese with seven characters. If we divide that by the predefined number for Chinese (2.8) we get 2.5 words. Although it is hard to see how this method is accurate with such a small example, keep in mind that with a much larger document, and more words to sample, these counts become much more accurate. Unfortunately, there are no industry standard numbers for Lao, Khmer and Myanmar. For these languages, only character counts (or a custom defined count) will be required for the GMX-V.

We definitely didn’t discuss every detail and nuance in the GMX-V and TR 29 specifications, but hope we’ve touched on enough for you to have a better understanding of what you are looking at when you encounter GMX-V content. This should give you some insight to how verifiable translation metrics are actually calculated behind the scenes when calculating word counts for translation. 

A standard for translation word counts: GMX-V

Knowing the detailed metrics behind each document you submit for translation is an important step in the translation process. The word count of documents has implications in project pricing, project timelines and helps supply chain teams determine required linguistic resources. As you can see, accurate word counts are essential to the translation process. This is why YBD has adopted GMX-V into it’s software platform.

If word counts are so important, how do translation vendors figure out the metrics of the documents they receive for translation? And how can you be sure, as a requestor, that the word counts are accurate and verifiable? These questions are exactly why YBD has implemented the Globalization Information Metrics Exchange for Volume (GMX-V) as a part of its software platform. Today we will break down what GMX-V is, how it generates reports within documents submitted for translation and what metrics can be collected from the elements within the standard.

What is GMX-V?

GMX-V stands for Global Information Metrics Exchange for Volume. In basic terms, it is a localization industry standard that defines the word counts within an XML document in a non-proprietary and verifiable way. More than that, it allows you to see a very detailed breakdown of word count categories, be they representing text, numeric, format tags or punctuation-based content. This allows for very detailed representation of the content in a localization project. It is this detailed representation (think of it as a summary of metrics of what makes up the document) that contains word count information that can be used by linguists and vendors to better understand pricing, project timelines and resource needs. Since we are discussing verifiable accuracy, it is the non-proprietary nature of the metrics that GMX-V presents that is of most interest.

Agreeing upon word counts of documents for translation

As anyone involved in translation would know, establishing and agreeing upon word counts can be a very tricky if not contentious process. All of the major industry tools provide their own take on these metrics and the results rarely match one-to-one. Even tools from the same provider differ in their word counting over versions of their core product! This can lead to disputes over the content make-up, becoming particularly problematic when project costs are factored in. Project metrics are the cornerstone of accurate financial estimation, and a transparent approach to support this accuracy is where GMX-V is attempting to find its place among other industry standard tools.

What is the basic structure of GMX-V?

GMX-V is based on XML. XML, or eXtensible Markup Language, is a common format that is used to store structured information in a file that’s easy for a human to read, and easy for a computer to read. Most file formats of documents submitted for translation use some subset of the XML standard. Take XHTML for example, which is used to write web pages. XHTML is a subset of the XML standard (we wish HTML followed the XML standard a little closer ourselves). GMX-V is a subset of the XML standard and, as such, has clearly defined specifications and purpose. Here is the purpose of GMX-V as defined in the specification:

  1. To provide an unambiguous specification for counting words and characters for translation related tasks.
  2. To provide a rich set of qualifiers to help accurately define the actual translation workload for translation related tasks.
  3. To provide an XML notation for exchanging Global Information Management metrics for any Global Information Management task whether it entails translation activity or not.

Ultimately, what makes GMX-V so necessary is that during the localization process, if a proper GMX-V analysis is being generated, anyone familiar with the GMX-V specification should be able to understand the metrics that are being counted in their project, from any system. Without such a universal standard in place, it becomes hard to understand every localization tool’s metrics format.

 What does GMX-V actually look like?

So what does a GMX-V look like? Once you see the metrics in place, it is actually pretty simple. Here is an example for a single file or resource. If it were for an entire project, there would be a parent resource XML element and above that, a project XML element to allow for more than one file or resource in a project. But to keep things simple, we have used the results from a single file.

An example of GMX-V for a single file/resource.

There are a couple of things to note in this example. If we get past the first element, and into the “stage” element, the stage element allows for the word counts to be maintained for more than one phase in the localization process. For example, the GMX-V specification allows for a user defined value as well as an initial translation phase state, and a final translation phase state. This could be very useful if the word count requirements between translation and translation review are different, or if there were some changes to the content between different project steps.

If we move into the second count-group we can see it has a name: “verifiable”. There are two important groups in the GMX-V spec for count-group: verifiable and non-verifiable. Verifiable metrics are pre-defined metrics that can be pulled from an XML Localization Interchange File Format (XLIFF) document. Non-verifiable metrics, on the other hand, are metrics that require manual counting to ensure accuracy. Take for example words in an image. Are these words verifiable or non-verifiable? Since words in an image are difficult for a computer to count, they are words that needs to be verified by a person. Therefore, they are in the non-verifiable group.

Inside the count-group is the most exciting section of the XML called the “count type”, which is the bread and butter of GMX, and maintains all the different counts for all the different count types (word counts, character counts, etc.). In the example above there are three types of counts. These counts are the metrics displayed by GMX-V, pulled from the XLIFF data of documents for translation. The metrics have been obtained using a standard and verifiable system (GMX-V) and therefore proven to be accurate.

A call for a universal standard in localization metrics

Since YBD uses the GMX-V standard, you can be sure that the word counts generated for your submitted documents are accurate, no matter the language used for source and target language. How does the GMX-V standard work in a real word scenario? In part two of this blog, we will break down a real life example of a segment and its respective word count, derived using GMX-V. In that post you will also learn about how this process works for logographic based languages such as Chinese and Japanese. We hope this post gives you peace of mind that the word counts and other metrics from your translation projects are accurate, verifiable and justified by a universal standard in the localization industry.