Recursive Goroutines, what is the easiest way to tell Go to stop reading from the feed?

Question

Recursive Goroutines, what is the easiest way to tell Go to stop reading from the feed?

I want to know an idiomatic way to solve this problem (which is currently causing a deadlock), the recursion is done an unknown number of times, so I cannot just close the channel.

http://play.golang.org/p/avLf_sQJj_

I got it working by passing a pointer to a number and incrementing it, and I studied the use of pending Sync groups. I did not feel (and perhaps mistaken) that I came up with an elegant solution. The Go examples I've seen tend to be simple, smart, and concise.

This is the last exercise in Tour of Go, https://tour.golang.org/#73

Do you know how a Go programmer can handle this? Any help would be greatly appreciated. I try to learn from the very beginning.

+6

concurrency go recursion goroutine

SamMorrowDrums Nov 24 '14 at 10:55

source share

2 answers

Here is my interpretation of the exercise. There are many, but this is mine. I am using sync.WaitGroup and a custom mutex protected map to store the visited URLs. Mostly because the Go standard map type is not thread safe. I also combine data and error channels into a single structure, which has a method that reads the specified channels. Mainly to separate problems and (possibly) keep things a little cleaner.

Example on the playground :

 package main import ( "fmt" "sync" ) type Fetcher interface { // Fetch returns the body of URL and // a slice of URLs found on that page. Fetch(url string) (body string, urls []string, err error) } // Crawl uses fetcher to recursively crawl // pages starting with url, to a maximum of depth. func Crawl(wg *sync.WaitGroup, url string, depth int, fetcher Fetcher, cache *UrlCache, results *Results) { defer wg.Done() if depth <= 0 || !cache.AtomicSet(url) { return } body, urls, err := fetcher.Fetch(url) if err != nil { results.Error <- err return } results.Data <- [2]string{url, body} for _, url := range urls { wg.Add(1) go Crawl(wg, url, depth-1, fetcher, cache, results) } } func main() { var wg sync.WaitGroup cache := NewUrlCache() results := NewResults() defer results.Close() wg.Add(1) go Crawl(&wg, "http://golang.org/", 4, fetcher, cache, results) go results.Read() wg.Wait() } // Results defines channels which yield results for a single crawled URL. type Results struct { Data chan [2]string // url + body. Error chan error // Possible fetcher error. } func NewResults() *Results { return &Results{ Data: make(chan [2]string, 1), Error: make(chan error, 1), } } func (r *Results) Close() error { close(r.Data) close(r.Error) return nil } // Read reads crawled results or errors, for as long as the channels are open. func (r *Results) Read() { for { select { case data := <-r.Data: fmt.Println(">", data) case err := <-r.Error: fmt.Println("e", err) } } } // UrlCache defines a cache of URL we've already visited. type UrlCache struct { sync.Mutex data map[string]struct{} // Empty struct occupies 0 bytes, whereas bool takes 1 bytes. } func NewUrlCache() *UrlCache { return &UrlCache{data: make(map[string]struct{})} } // AtomicSet sets the given url in the cache and returns false if it already existed. // // All within the same locked context. Modifying a map without synchronisation is not safe // when done from multiple goroutines. Doing a Exists() check and Set() separately will // create a race condition, so we must combine both in a single operation. func (c *UrlCache) AtomicSet(url string) bool { c.Lock() _, ok := c.data[url] c.data[url] = struct{}{} c.Unlock() return !ok } // fakeFetcher is Fetcher that returns canned results. type fakeFetcher map[string]*fakeResult type fakeResult struct { body string urls []string } func (f fakeFetcher) Fetch(url string) (string, []string, error) { if res, ok := f[url]; ok { return res.body, res.urls, nil } return "", nil, fmt.Errorf("not found: %s", url) } // fetcher is a populated fakeFetcher. var fetcher = fakeFetcher{ "http://golang.org/": &fakeResult{ "The Go Programming Language", []string{ "http://golang.org/pkg/", "http://golang.org/cmd/", }, }, "http://golang.org/pkg/": &fakeResult{ "Packages", []string{ "http://golang.org/", "http://golang.org/cmd/", "http://golang.org/pkg/fmt/", "http://golang.org/pkg/os/", }, }, "http://golang.org/pkg/fmt/": &fakeResult{ "Package fmt", []string{ "http://golang.org/", "http://golang.org/pkg/", }, }, "http://golang.org/pkg/os/": &fakeResult{ "Package os", []string{ "http://golang.org/", "http://golang.org/pkg/", }, }, }

This has not been tested extensively, so there may be optimizations and fixes that can be applied, but it should at least give you some ideas.

+2

jimt Nov 24 '14 at 12:23

source share

tomasz · Accepted Answer · 2014-11-24T15:07:45+0000

Instead of including sync.WaitGroup , you can expand the send result to a parsed URL and include the number of new URLs found. In your main loop, you continue to read the results until you collect something.

In your case, the number of URLs found will contain the number of running procedures, but this is not necessary. I would personally create a more or less fixed number of procedures so that you do not open too many HTTP requests (or at least you have control over it). Then your main loop will not change, since it doesn't matter how the fetch is done. The important fact here is that you need to send a result or error for each URL - I changed the code here, so it does not generate new routines when the depth is already 1.

A side effect of this solution is that you can easily print the progress in your main loop.

Here is an example on the playground:

http://play.golang.org/p/BRlUc6bojf

 package main import ( "fmt" ) type Fetcher interface { // Fetch returns the body of URL and // a slice of URLs found on that page. Fetch(url string) (body string, urls []string, err error) } type Res struct { url string body string found int // Number of new urls found } // Crawl uses fetcher to recursively crawl // pages starting with url, to a maximum of depth. func Crawl(url string, depth int, fetcher Fetcher, ch chan Res, errs chan error, visited map[string]bool) { body, urls, err := fetcher.Fetch(url) visited[url] = true if err != nil { errs <- err return } newUrls := 0 if depth > 1 { for _, u := range urls { if !visited[u] { newUrls++ go Crawl(u, depth-1, fetcher, ch, errs, visited) } } } // Send the result along with number of urls to be fetched ch <- Res{url, body, newUrls} return } func main() { ch := make(chan Res) errs := make(chan error) visited := map[string]bool{} go Crawl("http://golang.org/", 4, fetcher, ch, errs, visited) tocollect := 1 for n := 0; n < tocollect; n++ { select { case s := <-ch: fmt.Printf("found: %s %q\n", s.url, s.body) tocollect += s.found case e := <-errs: fmt.Println(e) } } } // fakeFetcher is Fetcher that returns canned results. type fakeFetcher map[string]*fakeResult type fakeResult struct { body string urls []string } func (f fakeFetcher) Fetch(url string) (string, []string, error) { if res, ok := f[url]; ok { return res.body, res.urls, nil } return "", nil, fmt.Errorf("not found: %s", url) } // fetcher is a populated fakeFetcher. var fetcher = fakeFetcher{ "http://golang.org/": &fakeResult{ "The Go Programming Language", []string{ "http://golang.org/pkg/", "http://golang.org/cmd/", }, }, "http://golang.org/pkg/": &fakeResult{ "Packages", []string{ "http://golang.org/", "http://golang.org/cmd/", "http://golang.org/pkg/fmt/", "http://golang.org/pkg/os/", }, }, "http://golang.org/pkg/fmt/": &fakeResult{ "Package fmt", []string{ "http://golang.org/", "http://golang.org/pkg/", }, }, "http://golang.org/pkg/os/": &fakeResult{ "Package os", []string{ "http://golang.org/", "http://golang.org/pkg/", }, }, }

And yes, follow @jimt's advice and provide access to a safe stream.

Recursive Goroutines, what is the easiest way to tell Go to stop reading from the feed?

More articles: