Home  >  Article  >  Backend Development  >  What Python developers need to know before migrating to Go

What Python developers need to know before migrating to Go

高洛峰
高洛峰Original
2016-10-20 09:33:081052browse

This is a (long) blog documenting our experience in migrating a large section of Python/Cython code to the Go language. If you want to know the whole story, background and all, then read on. If you are only interested in what Python developers need to know before jumping in, click the link below:

Tips and Tricks for Migrating from Python to Go

Background


Our biggest achievement in Repustate technology is to achieve Arabic sentiment analysis. Arabic is really a tough nut to crack, its word grammatical forms are too complex. Tokenization (breaking a sentence into independent words) in Arabic is more difficult than in, say, English, because Arabic words may contain spaces within them (e.g., within aleph positions). This does not need to be kept secret, that is, Repustate uses a support vector machine (SVM) to get the most likely meaning of the sentence, and then analyzes the sentiment based on this. We used a total of 22 models (22 support vector machines) and every word in the document was analyzed. In other words, if a document contains 500 words, there will be more than 10,000 support vector machine comparison operations.

Python


Repustate is almost completely implemented in Python, because we use Django as the application programming interface and website architecture. Therefore, we can only maintain the unity of the code and implement the entire Arabic emotion engine in Python. In the process of prototyping and implementation, Python is still very good. Very strong expression ability and powerful third-party library resources. It's still perfect if you're just serving web pages. However, when you need to do low-level calculations and need to perform a lot of comparison operations on hash tables (dictionaries in Python), the speed slows down. We can only process 2 to 3 Arabic documents per second, which is too slow. Compare this to our English sentiment engine, which can process 500 documents per second.

Bottleneck


So, we started the Python profiler to study which part was executing slowly. Remember when I said we would use 22 support vector machines for each word? These processes are all serial and there are no parallel operations. Okay, our first idea is to change this into a map/reduce-like operation. Long story short: map/reduce is not a good fit in Python. Python is not easy to use at all when you need concurrency. At PyCon 2013, Guido mentioned Tulip, his new project trying to solve this problem, but it would be a while before it was launched. If there is already a better option, why should we wait for it.

Change to Go language or go home and farm


My friends at Mozilla told me that most of the code for the logging architecture in the Mazilla service has been switched to Go, partly because of the power of goroutine (Go thread). Go was designed by a group of people at Google to treat parallelism as a first-level concept rather than an afterthought like Python's different solutions. So, we started to change Python to Go.

Although the Go code is not yet at production level, the results are already very encouraging. We hit 1000 documents per second, used less memory, and didn't have to deal with the annoying multi-process/gevent/"why did Ctrl+C kill my process" code that comes with Python.

Why we fell in love with Go


Anyone who knows a little bit about how programming languages ​​work (understanding the difference between interpretation and compilation and dynamic and static) will say: "Man, Go is obviously faster" . Yes, we could also rewrite the whole thing in Java and get similar performance, but that's not why Go wins. The code you write in Go is easily correct. I can't explain why, but once the code is compiled (the compilation speed is very fast), you will feel that it can work (not just that it does not prompt an error when running, but it is logically correct). I know this sounds weird, but it's true. This is like Python solving the redundancy problem (or no redundancy). It treats functions as first-level objects, so functional programming can be easily performed. Go threads and channels make your life so easy. You also get the performance improvements brought by static typing and more precise control of memory allocation without losing expressiveness.

Things we should have known


Despite all the praise, using Go requires a different mindset than using Python. Here are some notes during the migration. Things that randomly jumped into my mind when converting Python to Go:

There is no built-in collection type (you need to use map and then check the existence)

Since there is no collection type, you need to implement intersection yourself , union and other methods

There is no tuple, you need to design your own structure (struct) or use slice (similar to an array)

There is no method like __getattr_(), you need to check the existence and cannot set the default Value, for example in Python, you can write: value = dict.get("a_key", "default_value")

Need to check for errors (or at least ignore them explicitly)

Cannot have unused variables and packages , need to comment out some code from time to time

Switch between []byte and string, regular processing (regexp) uses []byte (rewriteable). This is correct, but it is still troublesome to convert back and forth

Python syntax is more relaxed. You can use out-of-range indexes to retrieve fragments of a string without error, or you can use negative numbers to retrieve fragments. Not so with Go.

Unable to use mixed type data structures. This may not necessarily be appropriate, but in Python sometimes I have a dictionary whose values ​​can be a mix of strings and lists. Not in Go, you have to clean up the data structure or custom structure in Go*

No way to assign tuples or lists into separate variables (e.g., x, y, x = [1, 2, 3])

Humpback Formula case convention (functions/structures whose first letter is not capitalized will not be exposed to other packages). I prefer Python's lowercase and underscore convention.

You must explicitly check whether the error is empty, unlike many types in Python that can be used like Boolean types (0, empty string, None can be used as Boolean "false")

Some modules (such as crypo/md5 ) has insufficient documentation, but go-nutes on IRC is very powerful and has strong support.

Converting numbers to strings (int64->string) is different from converting []byte to strings (as long as string([]byte)). Need Calling strconv

to read code in Go definitely feels like a programming language, while Python can be written like pseudocode. Go uses more non-English numeric characters, using || and && instead of or and and.

Writing files will have File.Write([]byte) and File.WriteString(string), which is inconsistent with the Python developer’s credo of solving problems in one way.

String insertion is not easy to use, you must often use fmt.Sprintf

There is no constructor, the usual habit is to write a NewType() function to return the structure you want

Else (or else if) must be formatted correctly, else must be on the same line as the brace matching the if. strangeness.

Different assignment operators are used inside and outside the function, = and := (Translator’s note: This is the author’s misunderstanding. The difference between = and := is whether the type is explicitly defined or automatically type deduced, while variables outside the function can only be used =)

If I just want a list of keys (dict.keys()) or values ​​(dict.values()), or a list of tuples (dict.items()), there is no corresponding function in Go , you can only iterate by yourself


Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn