Programming

Dealing with encodings in Go

steloflute 2016. 1. 20. 23:30

http://dominik.honnef.co/posts/2012/04/dealing_with_encodings_in_go/


Dealing with encodings in Go

One not well documented aspect of Go is how it handles string encodings, or how strings are actually treated internally. This article tries to shed some light on basic aspects of that, to get you started.

It’s all UTF-8, right?

A source of confusion is the fact that Go source code has to be encoded in UTF-8, meaning that string literals, variable/function names etc. most consist solely of UTF-8 code points. This does, however, not mean that strings in Go must only contain valid UTF-8 data. For example, the following string literal is okay to use, even though it does not describe a valid UTF-8 string:"Hello, \x90\xA2\x8A\x45".

In fact, the string datatype is effectively nothing more than an immutable byte array. This also means that using the index operator will return a specific byte, not a rune:

fmt.Printf("0x%x", "世界"[1]) // => 0xb8

Furthermore, the len() function will also return a string’s length in bytes, not runes:

fmt.Println(len("Hello, 世界")) // => 13

But isn’t Go aware of UTF-8?

Even though strings are mere byte arrays, Go does have proper support for UTF-8. For example the range operator will iterate over a string by yielding runes, instead of bytes:

s := "Hello, 世界"
for _, r := range s {
  fmt.Printf("%c ", r) // => H e l l o ,   世 界
}

And by importing the unicode/utf8 package, it is also possible to validate strings, or more importantly, get their rune count (which effectively can be compared to the string’s length as we’d interpret it):

s := "Hello, 世界"
fmt.Println(utf8.RuneCountInString(s)) // => 9
fmt.Println(utf8.ValidString(s))       // => true

And how do I deal with other encodings?

Even though Go has good support for UTF-8 (and minimal support for UTF-16), it has no built-in support for any other encoding. If you have to use other encodings (e.g. when dealing with user input), you have to use third party packages, like for example go-charset. go-charset makes it quite easy to convert between different string encodings.

If you want to further work with input data, the best idea is to first convert it to UTF-8 and then use Go’s built-in features:

package main

import (
  "code.google.com/p/go-charset/charset"
  "fmt"
  "io/ioutil"
  "strings"
  "unicode/utf8"
)
import _ "code.google.com/p/go-charset/data" // include the conversion maps in the binary

func main() {
  s := "Hello, \x90\xA2\x8A\x45" // CP932 encoded version of "Hello, 世界"

  r, _ := charset.NewReader("CP932", strings.NewReader(s)) // convert from CP932 to UTF-8
  s2_, _ := ioutil.ReadAll(r)
  s2 := string(s2_)
  fmt.Println(s2)                         // => Hello, 世界
  fmt.Println(len(s2))                    // => 13
  fmt.Println(utf8.RuneCountInString(s2)) // => 9
  fmt.Println(utf8.ValidString(s2))       // => true
}

Conclusion

Even though this article is rather short, I hope that it was able to convey the following main points:

  1. Strings in Go are mere byte arrays
  2. There is built-in functionality for dealing with UTF-8
  3. Any other encoding is best converted to UTF-8 first

For further reading, I recommend a discussion on the Go mailing list, which also sheds light on why strings behave the way they do (hint: it’s mostly about performance, and freedom).