Apache Beam is an open source distributed data processing framework that provides a unified programming model that can run on different batch and stream processing engines. Recently, a very useful feature was added to Apache Beam's Go SDK - selecting the first N rows from a PCollection. This feature is very helpful for scenarios where large data sets need to be sampled or quickly previewed. In this article, we'll cover how to use this feature in Apache Beam's Go SDK and show some practical example code. let's start!
Question content
I have a pcollection from which I need to select the n largest rows. I'm trying to create a dataflow pipeline using go and am stuck.
package main import ( "context" "flag" "fmt" "github.com/apache/beam/sdks/v2/go/pkg/beam" "github.com/apache/beam/sdks/v2/go/pkg/beam/log" "github.com/apache/beam/sdks/v2/go/pkg/beam/x/beamx" ) type user struct { name string age int } func printrow(ctx context.context, list user) { fmt.println(list) } func main() { flag.parse() beam.init() ctx := context.background() p := beam.newpipeline() s := p.root() var userlist = []user{ {"bob", 5}, {"adam", 8}, {"john", 3}, {"ben", 1}, {"jose", 1}, {"bryan", 1}, {"kim", 1}, {"tim", 1}, } initial := beam.createlist(s, userlist) pc2 := beam.pardo(s, func(row user, emit func(user)) { emit(row) }, initial) beam.pardo0(s, printrow, pc2) if err := beamx.run(ctx, p); err != nil { log.exitf(ctx, "failed to execute job: %v", err) } }
From the above code, I need to select the first 5 rows based on user.age I found the link at the top of the package which has the same functionality but it says it returns a single element pcollection. what is the difference?
package main import ( "context" "flag" "fmt" "github.com/apache/beam/sdks/v2/go/pkg/beam" "github.com/apache/beam/sdks/v2/go/pkg/beam/log" "github.com/apache/beam/sdks/v2/go/pkg/beam/transforms/top" "github.com/apache/beam/sdks/v2/go/pkg/beam/x/beamx" ) func init() { beam.RegisterFunction(less) } type User struct { Name string Age int } func printRow(ctx context.Context, list User) { fmt.Println(list) } func less(a, b User) bool { return a.Age < b.Age } func main() { flag.Parse() beam.Init() ctx := context.Background() p := beam.NewPipeline() s := p.Root() var userList = []User{ {"Bob", 5}, {"Adam", 8}, {"John", 3}, {"Ben", 1}, {"Jose", 1}, {"Bryan", 1}, {"Kim", 1}, {"Tim", 1}, } initial := beam.CreateList(s, userList) best := top.Largest(s, initial, 5, less) pc2 := beam.ParDo(s, func(row User, emit func(User)) { emit(row) }, best) beam.ParDo0(s, printRow, pc2) if err := beamx.Run(ctx, p); err != nil { log.Exitf(ctx, "Failed to execute job: %v", err) } }
I added the function to select the first 5 rows like above, but I got the error []main.user is not allocate to main.user
I need the pcollection in the same format as before because I need to process it further. I suspect this is because the top.largest function returns a single element pcollection. Any ideas on how to convert the format?
Solution
The best pcollection is []user
So give it a try...
pc2 := beam.ParDo(s, func(rows []User, emit func(User)) { for _, row := range rows { emit(row) } }, best)
The above is the detailed content of Apache Beam select top N rows from PCollection in Go. For more information, please follow other related articles on the PHP Chinese website!

This article explains Go's package import mechanisms: named imports (e.g., import "fmt") and blank imports (e.g., import _ "fmt"). Named imports make package contents accessible, while blank imports only execute t

This article explains Beego's NewFlash() function for inter-page data transfer in web applications. It focuses on using NewFlash() to display temporary messages (success, error, warning) between controllers, leveraging the session mechanism. Limita

This article details efficient conversion of MySQL query results into Go struct slices. It emphasizes using database/sql's Scan method for optimal performance, avoiding manual parsing. Best practices for struct field mapping using db tags and robus

This article demonstrates creating mocks and stubs in Go for unit testing. It emphasizes using interfaces, provides examples of mock implementations, and discusses best practices like keeping mocks focused and using assertion libraries. The articl

This article explores Go's custom type constraints for generics. It details how interfaces define minimum type requirements for generic functions, improving type safety and code reusability. The article also discusses limitations and best practices

This article details efficient file writing in Go, comparing os.WriteFile (suitable for small files) with os.OpenFile and buffered writes (optimal for large files). It emphasizes robust error handling, using defer, and checking for specific errors.

The article discusses writing unit tests in Go, covering best practices, mocking techniques, and tools for efficient test management.

This article explores using tracing tools to analyze Go application execution flow. It discusses manual and automatic instrumentation techniques, comparing tools like Jaeger, Zipkin, and OpenTelemetry, and highlighting effective data visualization


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

PhpStorm Mac version
The latest (2018.2.1) professional PHP integrated development tool

Atom editor mac version download
The most popular open source editor

ZendStudio 13.5.1 Mac
Powerful PHP integrated development environment

SAP NetWeaver Server Adapter for Eclipse
Integrate Eclipse with SAP NetWeaver application server.

EditPlus Chinese cracked version
Small size, syntax highlighting, does not support code prompt function
