Home >Backend Development >Python Tutorial >Ive updated the Python fetcher for BoardGameGeek data
This script will fetch items data from BoardGameGeek API and store the data in a CSV file.
I updated the preivous script. Since the API response is in XML format and since there is no endpoint to fetch all items at once, the previous script would loop through a provided IDs range, making calls one by one for each item. That's not optimal, it takes long time for larger range of IDs (currently the highest number of items (IDs) available on BGG goes as high as 400k ) and the results may not be reliable. Therefore, with some modifications in this script, more item IDs will be added as parameter value to a single request url, and with that, a single response will return multiple items (~800 was the highest number that a single response returned back. BGG may eventually change this later; you can easily tweak batch_size in order to adjust if needed).
Additionally, this script will fetch all of the items, and not only the data related to board games.
The information fetched and stored for each board game is the following:
name, game_id, type, rating, weight, year_published, min_players, max_players, min_play_time, max_pay_time, min_age, owned_by, categories, mechanics, designers, artists and publishers.
The updates in this script is as it follows; We start by importing the needed libraries for this script:
# Import libraries from bs4 import BeautifulSoup from csv import DictWriter import pandas as pd import requests import time
The following is a function that will be called when the script is completed based on the range of IDs. Also, if there is an error when making a request, this function will be called in order to store all the data appended to the games list up to the point when the exeption happened.
# CSV file saving function def save_to_csv(games): csv_header = [ 'name', 'game_id', 'type', 'rating', 'weight', 'year_published', 'min_players', 'max_players', 'min_play_time', 'max_play_time', 'min_age', 'owned_by', 'categories', 'mechanics', 'designers', 'artists', 'publishers' ] with open('bgg.csv', 'a', encoding='UTF8') as f: dictwriter_object = DictWriter(f, fieldnames=csv_header) if f.tell() == 0: dictwriter_object.writeheader() dictwriter_object.writerows(games)
We'll need to define the headers for the requests. Pause between requests can be set through SLEEP_BETWEEN_REQUESTS (I have seen some information that rate limit is 2 requests per second, but it may be outdated information since I had no trouble with pause being set to 0). Additionally, here are set the values for starting point ID (start_id_range), maximum range (max_id_range) and batch_size is how many games should the response return back. Base url is defined in this section, but the IDs are added in the next part of the script.
# Define request url headers headers = { "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.16; rv:85.0) Gecko/20100101 Firefox/85.0", "Accept-Language": "en-GB, en-US, q=0.9, en" } # Define sleep timer value between requests SLEEP_BETWEEN_REQUEST = 0 # Define max id range start_id_range = 0 max_id_range = 403000 batch_size = 800 base_url = "https://boardgamegeek.com/xmlapi2/thing?id="
The in the following part is the main logic of this script. At first, based on the batch size, it will make a string of IDs that are within the defined IDs range, but not more IDs than the defined number in batch_size and that will be appended to id parameter of the url. With that, each response will return data for the number of items that is same as the batch size. After that it will process and append the data to the games list for each response and finally append to the CSV file.
# Main loop that will iterate between the starting and maximum range in intervals of the batch size for batch_start in range(start_id_range, max_id_range, batch_size): # Make sure that the batch size will not exceed the maximum ids range batch_end = min(batch_start + batch_size - 1, max_id_range) # Join and append to the url the IDs within batch size ids = ",".join(map(str, range(batch_start, batch_end + 1))) url = f"{base_url}?id={ids}&stats=1" # If by any chance there is an error, this will throw the exception and continue on the next batch try: response = requests.get(url, headers=headers) except Exception as err: print(err) continue if response.status_code == 200: soup = BeautifulSoup(response.text, features="html.parser") items = soup.find_all("item") games = [] for item in items: if item: try: # Find values in the XML name = item.find("name")['value'] if item.find("name") is not None else 0 year_published = item.find("yearpublished")['value'] if item.find("yearpublished") is not None else 0 min_players = item.find("minplayers")['value'] if item.find("minplayers") is not None else 0 max_players = item.find("maxplayers")['value'] if item.find("maxplayers") is not None else 0 min_play_time = item.find("minplaytime")['value'] if item.find("minplaytime") is not None else 0 max_play_time = item.find("maxplaytime")['value'] if item.find("maxplaytime") is not None else 0 min_age = item.find("minage")['value'] if item.find("minage") is not None else 0 rating = item.find("average")['value'] if item.find("average") is not None else 0 weight = item.find("averageweight")['value'] if item.find("averageweight") is not None else 0 owned = item.find("owned")['value'] if item.find("owned") is not None else 0 link_type = {'categories': [], 'mechanics': [], 'designers': [], 'artists': [], 'publishers': []} links = item.find_all("link") # Append value(s) for each link type for link in links: if link['type'] == "boardgamecategory": link_type['categories'].append(link['value']) if link['type'] == "boardgamemechanic": link_type['mechanics'].append(link['value']) if link['type'] == "boardgamedesigner": link_type['designers'].append(link['value']) if link['type'] == "boardgameartist": link_type['artists'].append(link['value']) if link['type'] == "boardgamepublisher": link_type['publishers'].append(link['value']) # Append 0 if there is no value for any link type for key, ltype in link_type.items(): if not ltype: ltype.append("0") game = { "name": name, "game_id": item['id'], "type": item['type'], "rating": rating, "weight": weight, "year_published": year_published, "min_players": min_players, "max_players": max_players, "min_play_time": min_play_time, "max_play_time": max_play_time, "min_age": min_age, "owned_by": owned, "categories": ', '.join(link_type['categories']), "mechanics": ', '.join(link_type['mechanics']), "designers": ', '.join(link_type['designers']), "artists": ', '.join(link_type['artists']), "publishers": ', '.join(link_type['publishers']), } # Append current item to games list games.append(game) except TypeError: print(">>> NoneType error. Continued on the next item.") continue save_to_csv(games) print(f">>> Request successful for batch {batch_start}-{batch_end}") else: print(f">>> FAILED batch {batch_start}-{batch_end}") # Pause between requests time.sleep(SLEEP_BETWEEN_REQUEST)
Below you can preview the first few rows of records in the CSV file as pandas DataFrame.
# Preview the CSV as pandas DataFrame df = pd.read_csv('./bgg.csv') print(df.head(5))
The above is the detailed content of Ive updated the Python fetcher for BoardGameGeek data. For more information, please follow other related articles on the PHP Chinese website!