Program objective
Access web pages at the same time to extract the title of each page and display these titles in the terminal. This is done using concurrency in Go, which allows you to access multiple pages simultaneously, saving time.
Explanation of the Code
Packages used
import (
"fmt"
"net/http"
"sync"
"github.com/PuerkitoBio/goquery"
)
fetchTitle function
This role is responsible for:
- Access a web page (url)
- Extract page title
- Evniate the result to a channel
func fetchTitle(url string, wg *sync.WaitGroup, results chan<- string) {
defer wg.Done() // Marca a goroutine como concluída no WaitGroup
Function parameters:
-
url string: Represents the address of the web page (url) that we are going to access to obtain the title
-
wg *sync.WaitGroup: Pointer to a WaitGroup, which we use to synchronize the completion of all tasks (goroutines) that are running at the same time. The * indicates that we are passing an "address" to WaitGroup` and not a copy of it.
-
results chan<- string: This is a one-way channel that allows you to send strings to another part of the program. It is used to pass results (titles or error messages) to the main function
The defer wg.Done() line tells the program to mark this task (goroutine) as completed when the fetchTitle function finishes. This is important so that main knows when all tasks have been completed.
HTTP Request
req, err := http.Get(url)
if err != nil {
results <- fmt.Sprintf("Error accessing %s: %v", url, err)
return
}
defer req.Body.Close()
-
http.Get(url): This line makes a HTTP GET request to the URL. This means we are accessing the page and asking the server for its content.
-
err != nil: Here we check if there was any error when accessing the page (for example, if the page does not exist or the server is not responding). If there is an error, we send a message to the results channel and end the function with return.
-
defer req.Body.Close(): This ensures that after we are done using the page content, we free up the memory allocated to store it.
Status Check
if req.StatusCode != 200 {
results <- fmt.Sprintf("Error accessing %s: status %d %s", url, req.StatusCode, req.Status)
return
}
-
req.StatusCode != 200: We check if the server responded with the code 200 OK (indicates success). If it is not 200, it means the page did not load correctly. We then send an error message to the results channel and terminate the function.
Title Loading and Search
doc, err := goquery.NewDocumentFromReader(req.Body)
if err != nil {
results <- fmt.Sprintf("Error loading document from %s: %v", url, err)
return
}
title := doc.Find("title").Text()
results <- fmt.Sprintf("Title of %s: %s", url, title)
}
-
goquery.NewDocumentFromReader(req.Body): We load the HTML content of the page (provided by req.Body) into goquery, which allows you to navigate and search specific parts of the HTML.
-
doc.Find("title").Text(): We look for the tag in the HTML of the page and get the text inside it (i.e. the title).
-
results <- fmt.Sprintf("Título de %s: %s", url, title): We send the extracted title to the results channel, where it will be read later.
main function
The main function is the main function that configures and controls the program.
func main() {
urls := []string{
"http://olos.novagne.com.br/Olos/login.aspx?logout=true",
"http://sistema.novagne.com.br/novagne/",
}
-
urls := []string{...}: We define a list of URLs that we want to process. Each URL will be passed to a goroutine that will extract the page title.
WaitGroup and Channel Configuration
var wg sync.WaitGroup
results := make(chan string, len(urls)) // Channel to store the results
-
var wg sync.WaitGroup: We create a new instance of WaitGroup, which will control the number of goroutines and ensure that they all finish before the program ends.
-
results := make(chan string, len(urls)): We create a results channel with capacity equal to the number of URLs. This channel will store messages with titles or errors.
Home of Goroutines
for _, url := range urls {
wg.Add(1)
go fetchTitle(url, &wg, results)
}
-
for _, url := range urls: Here we loop through each URL in the list.
-
wg.Add(1): For each URL, we increment the WaitGroup counter to indicate that a new task (goroutine) will be started.
-
go fetchTitle(url, &wg, results): We call fetchTitle as a goroutine for each URL, that is, we make it run in parallel with the others.
Waiting and Displaying Results
wg.Wait()
close(results)
REPO: https://github.com/ionnss/Scrapper-GoRoutine
ions,
another earth day
The above is the detailed content of Scrapper Competitor. For more information, please follow other related articles on the PHP Chinese website!
Statement:The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn