avatar

Small Example : How to use the scrape library in Go

Introduction

When I needed to scrap a website before, it was always getting a bit complicated to do something efficient. In Python for example, if you want to do something that's really efficient (meaning you have multiple pages to scrap, not only one) you had to implement the threads mecanism. Now, threads in Python are pretty neat, but let's admit it, there is no comparison with what's going on in Go with the goroutines. Goroutines are simpler, they are more efficient and it's actually a lot easier to share data between them.

This small example doesn't use complex mecanisms in Go. My problem was pretty simple : I wanted to list all the art galleries in Paris. So I found this website which is pretty great but can you see the problem here ? Galleries are splitted on several pages. I could do that by hand but hey... Why would I do that ?

The scrape library

Scrape is a library that was written by It was written by yhat. The API is quite simple but still really powerful. Of course there is not as much features as in, let's say, BeautifulSoup. You can find the scrape library on GitHub. In the README.md there is a small example on how to use it and you can also find the complete documentation on GoDoc. Now the example given in the README.md file is pretty minimalistic, and in this article, I'll attempt to show how to create a more complete program.

Scraping the front page to gather the links

If you have a look at the website I gave the link earlier, you can see that galleries are splitted on several pages. I could, of course, write in the program the 20 links that are on that page. But we're not going to do that.

package main

import (
    "fmt"
    "net/http"

    "github.com/yhat/scrape"
    "golang.org/x/net/html"
    "golang.org/x/net/html/atom"
)

const (
    urlRoot = "http://www.galerie-art-paris.com/"
)

func gatherNodes(n *html.Node) bool {
    if n.DataAtom == atom.A && n.Parent != nil {
        return scrape.Attr(n.Parent, "class") == "menu"
    }
    return false
}

func main() {
    resp, err := http.Get(urlRoot)
    if err != nil {
        panic(err)
    }
    defer resp.Body.Close()
    root, err := html.Parse(resp.Body)
    if err != nil {
        panic(err)
    }

    as := scrape.FindAll(root, gatherNodes)
}

The gatherNodes function is called a matcher. A matcher is a function that takes a pointer to an HTML node and returns true if the HTML node satisfies the matcher. Here, the matcher is satisfied if the element is an anchor (atom.A in HTML it would correspond to the tags), has a parent, and the parent's class is "menu". Otherwise it returns false and the node is ignored. Now, scrape.FindAll(root, gatherNodes) will browse the HTML tree (root, which corresponds to the parsed resp.Body) and return a list of all the nodes that satisfies the matcher, in other words, the links I want to process.

Parsing the other links asynchronously

If you see where this is going you can already tell what I'm going to do next. Let's define a new function that will be executed as a goroutine and takes an URL as a parameter.

func scrapGalleries(url string) {
    resp, err := http.Get(url)
    if err != nil {
        panic(err)
    }
    defer resp.Body.Close()
    root, err := html.Parse(resp.Body)
    if err != nil {
        panic(err)
    }
    matcher := func(n *html.Node) bool {
        return n.DataAtom == atom.Span && scrape.Attr(n, "class") == "galerie-art-titre"
    }
    for _, g := range scrape.FindAll(root, matcher) {
        fmt.Println(scrape.Text(g))
    }
}

As you can see, the matcher is defined inline because it's a really simple one. That function will just scrape a page and display the results it found, in this case, all the galleries name that are present on the page (which is defined in a span (atom.Span) and has the "galerie-art-titre" class). Let's add a few lines to the main function :

func main() {
        // ...
    as := scrape.FindAll(root, gatherNodes)
    for _, link := range as {
        go scrapGalleries(urlRoot + scrape.Attr(link, "href"))
    }
}

The list of HTML nodes is... Well just nodes. So if you want to get the actual url, you have to get what's in the href attribute, and append it to the urlRoot (in that case these are not absolute links, but it can depend on the website you're scrapping). Now if you execute this program as it is right now, nothing will happen because the program will automatically exit. It won't wait for the gouroutines to finish because that's not the default behaviour of gouroutines. So let's add a sync.WaitGroup and see what the full program looks like :

package main

import (
    "fmt"
    "net/http"
    "sync"

    "github.com/yhat/scrape"
    "golang.org/x/net/html"
    "golang.org/x/net/html/atom"
)

const (
    urlRoot = "http://www.galerie-art-paris.com/"
)

var wg sync.WaitGroup

func gatherNodes(n *html.Node) bool {
    if n.DataAtom == atom.A && n.Parent != nil {
        return scrape.Attr(n.Parent, "class") == "menu"
    }
    return false
}

func scrapGalleries(url string) {
    defer wg.Done()
    resp, err := http.Get(url)
    if err != nil {
        panic(err)
    }
    defer resp.Body.Close()
    root, err := html.Parse(resp.Body)
    if err != nil {
        panic(err)
    }
    matcher := func(n *html.Node) bool {
        return n.DataAtom == atom.Span && scrape.Attr(n, "class") == "galerie-art-titre"
    }
    for _, g := range scrape.FindAll(root, matcher) {
        fmt.Println(scrape.Text(g))
    }
}

func main() {
    resp, err := http.Get(urlRoot)
    if err != nil {
        panic(err)
    }
    defer resp.Body.Close()
    root, err := html.Parse(resp.Body)
    if err != nil {
        panic(err)
    }

    as := scrape.FindAll(root, gatherNodes)
    for _, link := range as {
        wg.Add(1)
        go scrapGalleries(urlRoot + scrape.Attr(link, "href"))
    }
    wg.Wait()
}

Conclusion

Go is really (and I mean it, really) efficient when it comes to scrapping. That program scraps 21 pages in 140 ms. Of course it depends on the bandwidth you have and your CPU. But still, isn't this amazing ?

Some Go Snippets I find useful


Reading a whole line on stdin

var stdioScanner = bufio.NewScanner(os.Stdin)

func simpleReadLine() (string, error) {
    if !stdioScanner.Scan() {
        return "", stdioScanner.Err()
    }
    return stdioScanner.Text(), nil
}

I thought it was a lot more simple to read a whole line on stdin actually, because fmt.Scanln's name is quite... Self explanatory I guess. But, according to the fmt package's GoDoc :

Scanln is similar to Scan, but stops scanning at a newline and after the final item there must be a newline or EOF.

So let's have a look at the fmt.Scan GoDoc :

Scan scans text read from standard input, storing successive space-separated values into successive arguments. Newlines count as space. It returns the number of items successfully scanned. If that is less than the number of arguments, err will report why.

So fmt.Scanln would be more useful to do something like a IRC client. For example this line in weechat /server add stuff irc.stuff.com -ssl would be easily parsed by fmt.Scanln.

Note about the changes made to this function

I had a really interesting conversation with Axel Wagner about why this was a terrible idea to define the reader inside the function. The fact is that the bufio package, as its name suggests it, uses buffers. The scanner would read more than only one line. The fact that the scanner was declared inside the function body would lost the said scanner's scope at the end of the function, so there would be data loss. Axel wrote a small example on the Go Playground that shows this particular behaviour. You can find the whole conversation here. He also explained to me why he thought naked returns are a bad idea, and his arguments are actually quite convincing.


Function to unmarshall a Json URL to a struct

func fetchURL(url string, out interface{}) (err error) {
    resp, err := http.Get(url)
    if err != nil {
        return
    }
    defer resp.Body.Close()
    err = json.NewDecoder(resp.Body).Decode(out)
    return
}

Now your out argument must obviously be a pointer because otherwise there would be no point in doing that. Let's have a look at a small example ! Note : I already have a struct named GithubUserAPIType which is generated using json-to-go which is an amazing tool created by mholt.

func main() {
    var err error
    var me GithubUserAPIType

    err = fetchURL("https://api.github.com/users/Depado", &me)
    if err != nil {
        log.Fatal(err)
    }
}

Of course you need to make sure that your Json URL returns the right data and can be stored in your struct.


Load a yaml configuration file to a struct

One thing I commonly do with all my projects at one point is to provide an easy way to configure the program without opening the code itself. Instead of configuring the program using command line arguments, you can use a yaml file. Why yaml ? I don't know. I guess I kind of like the way yaml is structured and find it easy to create configuration files using this language.

import (
    "io/ioutil"
    "log"

    "gopkg.in/yaml.v2"
)

func Load(f string, i interface{}) error {
    conf, err := ioutil.ReadFile(f)
    if err != nil {
        return err
    }
    err = yaml.Unmarshal(conf, &i)
    if err != nil {
        return err
    }
}

For example, let's say the above code is stored in the configuration package. Here is an example on how to use it :

# conf.yml
host: irc.freenode.net
port: 6697
name: b0t
// main.go
package main

import (
    "log"
    "place/where/you/stored/configuration"
)

type Configuration struct {
    Host string
    Port string
    Name string
}

var Conf = new(Configuration)

func main() {
    var err error
    err = configuration.Load("conf.yml", &Conf)
    if err != nil {
        log.Fatal(err)
    }
    log.Println(Conf.Host, Conf.Port, Conf.Name)
}

Calculate the Md5Sum for a file

// Generate the md5sum of a file.
func GenerateMd5Sum(filename string) (string, error) {
    file, err := os.Open(filename)
    if err != nil {
        return "", err
    }
    defer file.Close()

    info, _ := file.Stat()
    filesize := info.Size()
    blocks := uint64(math.Ceil(float64(filesize) / float64(filechunk)))
    hash := md5.New()

    for i := uint64(0); i < blocks; i++ {
        blocksize := int(math.Min(filechunk, float64(filesize-int64(i*filechunk))))
        buf := make([]byte, blocksize)
        file.Read(buf)
        io.WriteString(hash, string(buf))
    }
    return string(hash.Sum(nil)), nil
}

Create custom loggers to write to different files

logfile, err := os.OpenFile("custom.log", os.O_RDWR|os.O_CREATE|os.O_APPEND, 0666)
if err != nil {
    log.Fatal(err)
}
defer logfile.Close()

custom = log.New(logfile, "", log.Ldate|log.Ltime)
custom.Println("Hello World !")

Play a distant audio stream using Go and Gstreamer

Requirements

First of all you need to have GStreamer library installed on your computer since we're going to use a Go binding of it. You can then install the Go Bindings for GStreamer using a simple go get :

go get github.com/ziutek/gst

Play the stream !

package main

import (
    "fmt"

    "github.com/ziutek/gst"
)

func main() {
    player := gst.ElementFactoryMake("playbin", "player")
    player.SetProperty("uri", "UrlToYourStreamHere.mp3")
    // Setting the state to gst.STATE_PLAYING starts playing the stream
    player.SetState(gst.STATE_PLAYING)
    fmt.Scanln()
    fmt.Println("Exiting")
}

Gst will handle the streaming itself, buffering and things like that are automated !

Pause ?

If you want to pause the stream, just use player.SetState(gst.STATE_PAUSED)

That's it ! That was easy wasn't it ? :)