Dealing with encodings in Go

Programming

Dealing with encodings in Go

steloflute 2016. 1. 20. 23:30

http://dominik.honnef.co/posts/2012/04/dealing_with_encodings_in_go/

Dealing with encodings in Go

One not well documented aspect of Go is how it handles string encodings, or how strings are actually treated internally. This article tries to shed some light on basic aspects of that, to get you started.

It’s all UTF-8, right?

A source of confusion is the fact that Go source code has to be encoded in UTF-8, meaning that string literals, variable/function names etc. most consist solely of UTF-8 code points. This does, however, not mean that strings in Go must only contain valid UTF-8 data. For example, the following string literal is okay to use, even though it does not describe a valid UTF-8 string:"Hello, \x90\xA2\x8A\x45".

In fact, the string datatype is effectively nothing more than an immutable byte array. This also means that using the index operator will return a specific byte, not a rune:

fmt.Printf("0x%x", "世界"[1]) // => 0xb8

Furthermore, the len() function will also return a string’s length in bytes, not runes:

fmt.Println(len("Hello, 世界")) // => 13

But isn’t Go aware of UTF-8?

Even though strings are mere byte arrays, Go does have proper support for UTF-8. For example the range operator will iterate over a string by yielding runes, instead of bytes:

s := "Hello, 世界"
for _, r := range s {
  fmt.Printf("%c ", r) // => H e l l o ,   世 界
}

And by importing the unicode/utf8 package, it is also possible to validate strings, or more importantly, get their rune count (which effectively can be compared to the string’s length as we’d interpret it):

s := "Hello, 世界"
fmt.Println(utf8.RuneCountInString(s)) // => 9
fmt.Println(utf8.ValidString(s))       // => true

And how do I deal with other encodings?

Even though Go has good support for UTF-8 (and minimal support for UTF-16), it has no built-in support for any other encoding. If you have to use other encodings (e.g. when dealing with user input), you have to use third party packages, like for example go-charset. go-charset makes it quite easy to convert between different string encodings.

If you want to further work with input data, the best idea is to first convert it to UTF-8 and then use Go’s built-in features:

package main

import (
  "code.google.com/p/go-charset/charset"
  "fmt"
  "io/ioutil"
  "strings"
  "unicode/utf8"
)
import _ "code.google.com/p/go-charset/data" // include the conversion maps in the binary

func main() {
  s := "Hello, \x90\xA2\x8A\x45" // CP932 encoded version of "Hello, 世界"

  r, _ := charset.NewReader("CP932", strings.NewReader(s)) // convert from CP932 to UTF-8
  s2_, _ := ioutil.ReadAll(r)
  s2 := string(s2_)
  fmt.Println(s2)                         // => Hello, 世界
  fmt.Println(len(s2))                    // => 13
  fmt.Println(utf8.RuneCountInString(s2)) // => 9
  fmt.Println(utf8.ValidString(s2))       // => true
}

Conclusion

Even though this article is rather short, I hope that it was able to convey the following main points:

Strings in Go are mere byte arrays
There is built-in functionality for dealing with UTF-8
Any other encoding is best converted to UTF-8 first

For further reading, I recommend a discussion on the Go mailing list, which also sheds light on why strings behave the way they do (hint: it’s mostly about performance, and freedom).

'Programming' 카테고리의 다른 글

자바 디자인 패턴 5 - Singleton (0)	2016.01.21
(Go) slice literal (0)	2016.01.20
Read and write Office documents from Clojure - docjure (0)	2016.01.07
virtualenv를 사용하자 - 가상 개발환경 구축하기 (0)	2016.01.07
파이썬 엑셀 쓰기 라이브러리 비교 (0)	2016.01.07

현재글Dealing with encodings in Go

steloflute

안녕하세요? 김태균의 블로그입니다. (Blog of KIM Taegyoon)

시계, bash, 로또, 밴드 인 어 박스, C#, newLISP, clojure, c++, haskell, IT, ticker, KOSPI200, watch, Racket, javascript, common lisp, Perl, c, go, python,

Today :
Yesterday :

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

steloflute

Dealing with encodings in Go

Dealing with encodings in Go

It’s all UTF-8, right?

But isn’t Go aware of UTF-8?

And how do I deal with other encodings?

Conclusion

'Programming' 카테고리의 다른 글

'Programming'의 다른글

티스토리툴바

Dealing with encodings in Go

Dealing with encodings in Go

It’s all UTF-8, right?

But isn’t Go aware of UTF-8?

And how do I deal with other encodings?

Conclusion

'Programming' 카테고리의 다른 글

'Programming'의 다른글

관련글

티스토리툴바