search
HomeBackend DevelopmentGolangHow golang handles big data
How golang handles big dataDec 27, 2019 am 11:09 AM
golang

How golang handles big data

Golang has proven to be very suitable for concurrent programming. Goroutine is more readable, elegant and efficient than asynchronous programming. This article proposes a Pipeline execution model suitable for implementation by Golang, which is suitable for batch processing of large amounts of data (ETL).

imagined such application scenarios: (Recommended Learning: Go )

## Load user reviews from Database A (Cassandra) (huge quantity, for example, for example 1 billion); associate user information from database B (MySQL) based on the user ID of each comment; call the NLP service (natural language processing) to process each comment; write the processing results to database C (ElasticSearch).

Due to various problems encountered in the application, these requirements are summarized:

Requirement 1: Data should be processed in batches, for example, 100 items per batch are specified. When a problem occurs (such as any database failure), it will be interrupted, and checkpoint will be used to resume from the interruption the next time the program starts.
Requirement 2: Set a reasonable number of concurrencies for each process, so that the database and NLP services have a reasonable load (without affecting other businesses, occupy as many resources as possible to improve ETL performance). For example, steps (1)-(4) set the concurrency numbers to 1, 4, 8, and 2 respectively.

This is a typical Pipeline execution model. Think of each batch of data (for example, 100 items) as a product on the assembly line. The 4 steps correspond to the 4 processing procedures on the assembly line. After each process is completed, the semi-finished product is handed over to the next process. The number of products that can be processed simultaneously in each process varies.

You may first think of enabling 1 4 8 2 goroutines and using channels to transfer data. I have done this before, and the conclusion is that doing this will make programmers crazy: the process concurrency control code is very complicated, especially when you have to deal with exceptions, execution times exceeding expectations, controllable interruptions, etc., you have to add a bunch of channels, Until you don't even remember the use.

Reusable Pipeline module

In order to complete the ETL work more efficiently, I abstracted Pipeline into modules. I'll paste the code first and then analyze the meaning. The module can be used directly, and the main interfaces used are: NewPipeline, Async, and Wait.

Using this Pipeline component, our ETL program will be simple, efficient, and reliable, freeing programmers from cumbersome concurrent process control:

package main
 
import "log"
 
func main() {
    //恢复上次执行的checkpoint,如果是第一次执行就获取一个初始值。
    checkpoint := loadCheckpoint()
    
    //工序(1)在pipeline外执行,最后一个工序是保存checkpoint
    pipeline := NewPipeline(4, 8, 2, 1) 
    for {
        //(1)
        //加载100条数据,并修改变量checkpoint
        //data是数组,每个元素是一条评论,之后的联表、NLP都直接修改data里的每条记录。
        data, err := extractReviewsFromA(&checkpoint, 100) 
        if err != nil {
            log.Print(err)
            break
        }
        
        //这里有个Golang著名的坑。
        //“checkpoint”是循环体外的变量,它在内存中只有一个实例并在循环中不断被修改,所以不能在异步中使用它。
        //这里创建一个副本curCheckpoint,储存本次循环的checkpoint。
        curCheckpoint := checkpoint
        
        ok := pipeline.Async(func() error {
            //(2)
            return joinUserFromB(data)
        }, func() error {
            //(3)
            return nlp(data)
        }, func() error {
            //(4)
            return loadDataToC(data)
        }, func() error {
            //(5)保存checkpoint
            log.Print("done:", curCheckpoint)
            return saveCheckpoint(curCheckpoint)
        })
        if !ok { break }
        
        if len(data) < 100 { break } //处理完毕
    }
    err := pipeline.Wait()
    if err != nil { log.Print(err) }
}

The above is the detailed content of How golang handles big data. For more information, please follow other related articles on the PHP Chinese website!

Statement
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
go语言有没有缩进go语言有没有缩进Dec 01, 2022 pm 06:54 PM

go语言有缩进。在go语言中,缩进直接使用gofmt工具格式化即可(gofmt使用tab进行缩进);gofmt工具会以标准样式的缩进和垂直对齐方式对源代码进行格式化,甚至必要情况下注释也会重新格式化。

go语言为什么叫gogo语言为什么叫goNov 28, 2022 pm 06:19 PM

go语言叫go的原因:想表达这门语言的运行速度、开发速度、学习速度(develop)都像gopher一样快。gopher是一种生活在加拿大的小动物,go的吉祥物就是这个小动物,它的中文名叫做囊地鼠,它们最大的特点就是挖洞速度特别快,当然可能不止是挖洞啦。

聊聊Golang中的几种常用基本数据类型聊聊Golang中的几种常用基本数据类型Jun 30, 2022 am 11:34 AM

本篇文章带大家了解一下golang 的几种常用的基本数据类型,如整型,浮点型,字符,字符串,布尔型等,并介绍了一些常用的类型转换操作。

一文详解Go中的并发【20 张动图演示】一文详解Go中的并发【20 张动图演示】Sep 08, 2022 am 10:48 AM

Go语言中各种并发模式看起来是怎样的?下面本篇文章就通过20 张动图为你演示 Go 并发,希望对大家有所帮助!

tidb是go语言么tidb是go语言么Dec 02, 2022 pm 06:24 PM

是,TiDB采用go语言编写。TiDB是一个分布式NewSQL数据库;它支持水平弹性扩展、ACID事务、标准SQL、MySQL语法和MySQL协议,具有数据强一致的高可用特性。TiDB架构中的PD储存了集群的元信息,如key在哪个TiKV节点;PD还负责集群的负载均衡以及数据分片等。PD通过内嵌etcd来支持数据分布和容错;PD采用go语言编写。

聊聊Golang自带的HttpClient超时机制聊聊Golang自带的HttpClient超时机制Nov 18, 2022 pm 08:25 PM

​在写 Go 的过程中经常对比这两种语言的特性,踩了不少坑,也发现了不少有意思的地方,下面本篇就来聊聊 Go 自带的 HttpClient 的超时机制,希望对大家有所帮助。

go语言是否需要编译go语言是否需要编译Dec 01, 2022 pm 07:06 PM

go语言需要编译。Go语言是编译型的静态语言,是一门需要编译才能运行的编程语言,也就说Go语言程序在运行之前需要通过编译器生成二进制机器码(二进制的可执行文件),随后二进制文件才能在目标机器上运行。

golang map怎么删除元素golang map怎么删除元素Dec 08, 2022 pm 06:26 PM

删除map元素的两种方法:1、使用delete()函数从map中删除指定键值对,语法“delete(map, 键名)”;2、重新创建一个新的map对象,可以清空map中的所有元素,语法“var mapname map[keytype]valuetype”。

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

Repo: How To Revive Teammates
1 months agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
2 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
Hello Kitty Island Adventure: How To Get Giant Seeds
1 months agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Dreamweaver Mac version

Dreamweaver Mac version

Visual web development tools

MantisBT

MantisBT

Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SAP NetWeaver Server Adapter for Eclipse

SAP NetWeaver Server Adapter for Eclipse

Integrate Eclipse with SAP NetWeaver application server.

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)