浅谈errgroup的使用以及源码分析-526互联

本文讲解的是golang.org/x/sync这个包中的errgroup

1、errgroup 的基础介绍

学习过 Go 的朋友都知道 Go 实现并发编程是比较容易的事情，只需要使用go关键字就可以开启一个 goroutine。那对于并发场景中，如何实现goroutine的协调控制呢？常见的一种方式是使用sync.WaitGroup 来进行协调控制。

使用过sync.WaitGroup 的朋友知道，sync.WaitGroup 虽然可以实现协调控制，但是不能传递错误，那该如何解决呢？聪明的你可能马上想到使用 chan 或者是 context来传递错误，确实是可以的。那接下来，我们一起看看官方是怎么实现上面的需求的呢？

1.1 errgroup的安装

安装命令：

go get golang.org/x/sync

//下面的案例是基于v0.1.0 演示的
go get golang.org/x/sync@v0.1.0

1.2 errgroup的基础例子

这里我们需要请求3个url来获取数据，假设请求url2时报错，url3耗时比较久，需要等一秒。

package main

import (
	"errors"
	"fmt"
	"golang.org/x/sync/errgroup"
	"strings"
	"time"
)

func main()  {
	queryUrls := map[string]string{
		"url1": "http://localhost/url1",
		"url2": "http://localhost/url2",
		"url3": "http://localhost/url3",
	}

	var eg errgroup.Group
	var results []string

	for _, url := range queryUrls {
		url := url
		eg.Go(func() error {
			result, err := query(url)
			if err != nil {
				return err
			}
			results = append(results, fmt.Sprintf("url:%s -- ret: %v", url, result))
			return nil
		})
	}
	
  // group 的wait方法，等待上面的 eg.Go 的协程执行完成，并且可以接受错误
	err := eg.Wait()
	if err != nil {
		fmt.Println("eg.Wait error:", err)
		return
	}

	for k, v := range results {
		fmt.Printf("%v ---> %v\n", k, v)
	}
}

func query(url string) (ret string, err error) {
	// 假设这里是发送请求，获取数据
	if strings.Contains(url, "url2") {
		// 假设请求 url2 时出现错误
		fmt.Printf("请求 %s 中....\n", url)
		return "", errors.New("请求超时")
	} else if strings.Contains(url, "url3") {
		// 假设 请求 url3 需要1秒
		time.Sleep(time.Second*1)
	}
	fmt.Printf("请求 %s 中....\n", url)
	return "success", nil
}

执行结果：

请求 http://localhost/url2 中....
请求 http://localhost/url1 中....
请求 http://localhost/url3 中....
eg.Wait error: 请求超时

果然，当其中一个goroutine出现错误时，会把goroutine中的错误传递出来。

我们自己运行一下上面的代码就会发现这样一个问题，请求 url2 出错了，但是依旧在请求 url3 。因为我们需要聚合 url1、url2、url3 的结果，所以当其中一个出现问题时，我们是可以做一个优化的，就是当其中一个出现错误时，取消还在执行的任务，直接返回结果，不用等待任务执行结果。

那应该如何做呢？

这里假设 url1 执行1秒，url2 执行报错，url3执行3秒。所以当url2报错后，就不用等url3执行结束就可以返回了。

package main

import (
	"context"
	"errors"
	"fmt"
	"golang.org/x/sync/errgroup"
	"strings"
	"time"
)

func main()  {
	queryUrls := map[string]string{
		"url1": "http://localhost/url1",
		"url2": "http://localhost/url2",
		"url3": "http://localhost/url3",
	}

	var results []string
	ctx, cancel := context.WithCancel(context.Background())
	eg, errCtx := errgroup.WithContext(ctx)

	for _, url := range queryUrls {
		url := url
		eg.Go(func() error {
			result, err := query(errCtx, url)
			if err != nil {
        //其实这里不用手动取消，看完源码就知道为啥了
				cancel()
				return err
			}
			results = append(results, fmt.Sprintf("url:%s -- ret: %v", url, result))
			return nil
		})
	}

	err := eg.Wait()
	if err != nil {
		fmt.Println("eg.Wait error:", err)
		return
	}

	for k, v := range results {
		fmt.Printf("%v ---> %v\n", k, v)
	}
}


func query(errCtx context.Context, url string) (ret string, err error) {
	fmt.Printf("请求 %s 开始....\n", url)
	// 假设这里是发送请求，获取数据
	if strings.Contains(url, "url2") {
		// 假设请求 url2 时出现错误
		time.Sleep(time.Second*2)
		return "", errors.New("请求出错")


	} else if strings.Contains(url, "url3") {
		// 假设 请求 url3 需要1秒
		select {
		case <- errCtx.Done():
			ret, err = "", errors.New("请求3被取消")
			return
		case <- time.After(time.Second*3):
			fmt.Printf("请求 %s 结束....\n", url)
			return "success3", nil
		}
	} else {
		select {
		case <- errCtx.Done():
			ret, err = "", errors.New("请求1被取消")
			return
		case <- time.After(time.Second):
			fmt.Printf("请求 %s 结束....\n", url)
			return "success1", nil
		}
	}

}

执行结果：

请求 http://localhost/url2 开始....
请求 http://localhost/url3 开始....
请求 http://localhost/url1 开始....
请求 http://localhost/url1 结束....
eg.Wait error: 请求出错

2、errgroup源码分析

看了上面的例子，我们对errgroup有了一定了解，接下来，我们一起看看errgroup做了那些封装。

2.1 errgroup.Group

errgroup.Group源码如下：

// A Group is a collection of goroutines working on subtasks that are part of
// the same overall task.
//
// A zero Group is valid, has no limit on the number of active goroutines,
// and does not cancel on error.
type Group struct {
  // context 的 cancel 方法
	cancel func()

	wg sync.WaitGroup
	
  //传递信号的通道，这里主要是用于控制并发创建 goroutine 的数量
  //通过 SetLimit 设置过后，同时创建的goroutine 最大数量为n
	sem chan token
	
  // 保证只接受一次错误
	errOnce sync.Once
  // 最先返回的错误
	err     error
}

看结构体中的内容，发现比原生的sync.WaitGroup多了下面的内容：

cancel func()
sem chan token
errOnce sync.Once
err error

2.2 WithContext 方法

// WithContext returns a new Group and an associated Context derived from ctx.
//
// The derived Context is canceled the first time a function passed to Go
// returns a non-nil error or the first time Wait returns, whichever occurs
// first.
func WithContext(ctx context.Context) (*Group, context.Context) {
	ctx, cancel := context.WithCancel(ctx)
	return &Group{cancel: cancel}, ctx
}

方法逻辑还是比较简单的，主要做了两件事：

使用context的WithCancel()方法创建一个可取消的Context
将context.WithCancel(ctx)创建的 cancel赋值给 Group中的cancel

2.3 Go

1.2 最后一个例子说，不用手动去执行 cancel 的原因就在这里。

g.cancel() //这里就是为啥不用手动执行 cancel的原因

// Go calls the given function in a new goroutine.
// It blocks until the new goroutine can be added without the number of
// active goroutines in the group exceeding the configured limit.
//
// The first call to return a non-nil error cancels the group's context, if the
// group was created by calling WithContext. The error will be returned by Wait.
func (g *Group) Go(f func() error) {
	if g.sem != nil {
    //往 sem 通道中发送空结构体，控制并发创建 goroutine 的数量
		g.sem <- token{}
	}

	g.wg.Add(1)
	go func() {
    // done()函数的逻辑就是当 f 执行完后，从 sem 取一条数据，并且 g.wg.Done()
		defer g.done()

		if err := f(); err != nil {
			g.errOnce.Do(func() { // 这里就是确保 g.err 只被赋值一次
				g.err = err
				if g.cancel != nil {
					g.cancel() //这里就是为啥不用手动执行 cancel的原因
				}
			})
		}
	}()
}

2.4 TryGo

看注释，知道此函数的逻辑是：当正在执行的goroutine数量小于通过SetLimit()设置的数量时，可以启动成功，返回 true，否则启动失败，返回false。

// TryGo calls the given function in a new goroutine only if the number of
// active goroutines in the group is currently below the configured limit.
//
// The return value reports whether the goroutine was started.
func (g *Group) TryGo(f func() error) bool {
	if g.sem != nil {
		select {
		case g.sem <- token{}: // 当g.sem的缓冲区满了过后，就会执行default，也代表着未启动成功
			// Note: this allows barging iff channels in general allow barging.
		default:
			return false
		}
	}
  
  //----主要看上面的逻辑，下面的逻辑和Go中的一样-------

	g.wg.Add(1)
	go func() {
		defer g.done()

		if err := f(); err != nil {
			g.errOnce.Do(func() {
				g.err = err
				if g.cancel != nil {
					g.cancel()
				}
			})
		}
	}()
	return true
}

2.5 Wait

代码逻辑很简单，这里主要注意这里：

//我看这里的时候，有点疑惑，为啥这里会去调用 cancel()方法呢？
//这里是为了代码的健壮性，用 context.WithCancel() 创建得到的 cancel，在代码执行完毕之前取消是一个好习惯
g.cancel()

// Wait blocks until all function calls from the Go method have returned, then
// returns the first non-nil error (if any) from them.
func (g *Group) Wait() error {
  g.wg.Wait() //通过 g.wg.Wait() 阻塞等待所有的 goroutine 执行完
	if g.cancel != nil {
    //我看这里的时候，有点疑惑，为啥这里会去调用 cancel()方法呢？
    //这里是为了代码的健壮性，用 context.WithCancel() 创建得到的 cancel，在代码执行完毕之前取消是一个好习惯
 		g.cancel()
	}
	return g.err
}

2.6 SetLimit

看代码的注释，我们知道：SetLimit的逻辑主要是限制同时执行的 goroutines 的数量为n，当n小于0时，没有限制。如果有运行的 goroutine，调用此方法会报错。

// SetLimit limits the number of active goroutines in this group to at most n.
// A negative value indicates no limit.
//
// Any subsequent call to the Go method will block until it can add an active
// goroutine without exceeding the configured limit.
//
// The limit must not be modified while any goroutines in the group are active.
func (g *Group) SetLimit(n int) {
	if n < 0 {
		g.sem = nil
		return
	}
	if len(g.sem) != 0 {
		panic(fmt.Errorf("errgroup: modify limit while %v goroutines in the group are still active", len(g.sem)))
	}
	g.sem = make(chan token, n)
}

3、errgroup 容易忽视的坑

这个坑是看别人的记录看到的，对errgroup不太熟悉时，是不小心确实容易掉进去，所以摘抄了过来，如果侵权，请联系删除，谢谢！

原文链接：并发编程包之 errgroup

需求:

开启多个Goroutine去缓存中设置数据，同时开启一个Goroutine去异步写日志，很快我的代码就写出来了：

package main

import (
	"context"
	"errors"
	"fmt"
	"golang.org/x/sync/errgroup"
	"time"
)

func main()  {
	g, ctx := errgroup.WithContext(context.Background())

	// 单独开一个协程去做其他的事情，不参与waitGroup
	go WriteChangeLog(ctx)

	for i:=0 ; i< 3; i++{
		g.Go(func() error {
			return errors.New("访问redis失败\n")
		})
	}
	if err := g.Wait();err != nil{
		fmt.Printf("appear error and err is %s",err.Error())
	}
	time.Sleep(1 * time.Second)
}

func WriteChangeLog(ctx context.Context) error {
	select {
	case <- ctx.Done():
		return nil
	case <- time.After(time.Millisecond * 50):
		fmt.Println("write changelog")
	}
	return nil
}

结果：

appear error and err is 访问redis失败

代码看着没有问题，但是日志一直没有写入。这是为什么呢？

其实原因就是因为这个ctx是errgroup.WithContext方法返回的一个带取消的ctx，我们把这个ctx当作父context传入WriteChangeLog方法中了，如果errGroup取消了，也会导致上下文的context都取消了，所以WriteChangelog方法就一直执行不到。

这个点是我们在日常开发中想不到的，所以需要注意一下～。

解决方法：

解决方法就是在 go WriteChangeLog(context.Background()) 传入新的ctx

参考资料：

八. Go并发编程--errGroup

并发编程包之 errgroup

上面这个案例中讲了一个容易忽视的坑，大家可以看看

源码

errgroup

源码errgroup

go-errgroup源码errgroup案例