免責聲明:僅供個人學習和研究之用。嚴禁用於其他用途。
介紹
該腳本是為人文學科的學術目的而開發的:具體而言,用於網路平台話語分析的研究。它可以對B站彈幕和評論進行全面研究。重點是涉及次文化和社會問題的大量內容(根據查閱的材料),需要深入調查、分析、補充和總結。
鑑於內容廣泛,結果顯示在連結中:
次文化視野下的評論與彈幕研究:
https://nbviewer.org/github/Excalibra/scripts/blob/main/d-ipynb/Subculture Perspective Review and Bullet Screen Research.ipynb
計劃完成「次文化」和「社會問題」部分的研究後再公開。不過,考慮到該領域研究人員和學生的需求,現在已經分享了。
特點與原理
腳本特點:
收集影片標題、作者、發布日期、觀看次數、收藏、分享、累積彈幕、評論次數、影片描述、類別、影片連結和封面圖片連結等資料。
擷取 100 條彈幕聊天,包含情緒評分、詞性分析、時間戳記和使用者 ID。
檢索 20 則熱門評論,以及按讚數、情緒分數、主題回覆、會員 ID、姓名和評論時間戳。
增強功能:
彈幕聊天:使用者名稱、生日、註冊日期、追蹤者數量和追蹤數量(使用 cookie)。
評論:顯示評論者的 IP 位置(透過網路介面)。
將資料輸出到 Excel 文件,其中包含情緒中位數、詞頻統計、詞雲和長條圖。
工作原理:
透過API取得JSON訊息,處理成Excel文件,利用SnowNLP、ThuNLP、Jieba等語言模型進行文字分詞、停用詞過濾、詞性分析、詞頻統計等。 Matplotlib 用於產生圖表。
快速入門
(Windows使用者可以使用pip和python。Mac使用者預設使用pip3和python3。)
腳本原始碼:GitHub 儲存庫。
必備庫:
安裝所需的庫:
pip3 install --no-cache-dir -r https://ghproxy.com/https://github.com/Excalibra/scripts/blob/main/d-txt/requirements.txt
然後執行腳本(線上):
python3 -c "$(curl -fsSL https://ghproxy.com/https://github.com/Excalibra/scripts/blob/main/d-python/get_bv_baseinfo.py)"
import json import time import requests import os from datetime import datetime import re from bs4 import BeautifulSoup from openpyxl import Workbook from openpyxl.styles import Alignment, Font from snownlp import SnowNLP import statistics import jieba from wordcloud import WordCloud import matplotlib.pyplot as plt import platform import thulac import matplotlib.font_manager as fm from selenium import webdriver from webdriver_manager.chrome import ChromeDriverManager from selenium.webdriver.common.by import By ''''''''' # Reference Links ## General Regex: https://regex101.com/ Zhihu - Two ways to obtain Bilibili video bullet comments using Python: https://zhuanlan.zhihu.com/p/609154366 Juejin - Parsing Bilibili video bullet comments: https://juejin.cn/post/7137928570080329741 CSDN - Bilibili historical bullet comment crawler: https://blog.csdn.net/sinat_18665801/article/details/104519838 CSDN - How to write a Bilibili bullet comment crawler: https://blog.csdn.net/bigbigsman/article/details/78639053?utm_source=app Bilibili - Bilibili bullet comment notes: https://www.bilibili.com/read/cv5187469/ Bilibili third-party API: https://www.bookstack.cn/read/BilibiliAPIDocs/README.md ## Reverse Lookup by UID https://github.com/esterTion/BiliBili_crc2mid https://github.com/cwuom/GetDanmuSender/blob/main/main.py https://github.com/Aruelius/crc32-crack ## User Basic Information https://api.bilibili.com/x/space/acc/info?mid=298220126 https://github.com/ria-klee/bilibili-uid https://github.com/SocialSisterYi/bilibili-API-collect/blob/master/docs/user/space.md ## Comments https://www.bilibili.com/read/cv10120255/ https://github.com/SocialSisterYi/bilibili-API-collect/blob/master/docs/comment/readme.md ## JSON https://json-schema.apifox.cn https://bbs.huaweicloud.com/blogs/279515 https://www.cnblogs.com/mashukui/p/16972826.html ## Cookie https://developer.mozilla.org/zh-CN/docs/Web/HTTP/Cookies ## Unpacking https://www.cnblogs.com/will-wu/p/13251545.html https://www.w3schools.com/python/python_tuples.asp ''''''''''' class BilibiliAPI: @staticmethod # Parse video link basic information JSON and return it in JSON format def get_bv_json(video_url): video_id = re.findall(r'BV\w+', video_url)[0] api_url = f'https://api.bilibili.com/x/web-interface/view?bvid={video_id}' bv_json = requests.get(api_url).json() return bv_json @staticmethod # Parse video link bullet comments XML using the 'cid' field in JSON def get_danmu_xml(bv_json): cid = bv_json['data']["cid"] api_url = f'https://comment.bilibili.com/{cid}.xml' danmu_xml = api_url return danmu_xml @staticmethod # Parse video link comments JSON using the 'aid' field in JSON def get_comment_json(bv_json): aid = bv_json['data']["aid"] api_url = f'https://api.bilibili.com/x/v2/reply/main?next=1&type=1&oid={aid}' comment_json = requests.get(api_url).json() return comment_json @staticmethod # Enhanced parsing of video link comments JSON using the 'aid' field in JSON def get_comment_json_to_webui(bv_json): aid = bv_json['data']["aid"] api_url = f'https://api.bilibili.com/x/v2/reply/main?next=1&type=1&oid={aid}' # Determine the current operating system type if platform.system() == "Windows": # Windows platform driver = webdriver.Chrome() else: # Other platforms driver = webdriver.Chrome(ChromeDriverManager().install()) # Provide login time print("Provide 45 seconds for Bilibili login") time.sleep(45) # Open the link driver.get(api_url) # Provide view effect time print("Provide 15 seconds to check the effects") time.sleep(15) # Find the <pre class="brush:php;toolbar:false"> element pre_element = driver.find_element(By.TAG_NAME, 'pre') # Get the text content of the element text_content = pre_element.text # Close WebDriver driver.quit() return text_content @staticmethod # Traverse user information and return basic parameters, preparing for XLSX write-in def get_user_card(mid, cookies): api_url = f'https://account.bilibili.com/api/member/getCardByMid?mid={mid}' try: response = requests.get(api_url, cookies=cookies) user_card_json = response.json() except json.JSONDecodeError: return {"error": "Failed to parse JSON. Ensure a good network environment. Too many API calls might trigger restrictions; try again later."} if 'message' in user_card_json: message = user_card_json['message'] if 'request blocked' in message or 'frequent requests' in message: return {"warning": "Ensure a good network environment. Too many API calls might trigger restrictions; try again later."} return user_card_json class CRC32Checker: '''''''''' # CRC32 cracking # Source: https://github.com/Aruelius/crc32-crack # Author: Aruelius # Note: This section has been slightly adjusted and encapsulated as a class for easier use. ''''''''' CRCPOLYNOMIAL = 0xEDB88320 crctable = [0 for x in range(256)] def __init__(self): self.create_table() def create_table(self): # Create a CRC table for quick CRC value computation for i in range(256): crcreg = i for _ in range(8): if (crcreg & 1) != 0: crcreg = self.CRCPOLYNOMIAL ^ (crcreg >> 1) else: crcreg = crcreg >> 1 self.crctable[i] = crcreg def crc32(self, string): # Compute the CRC32 value for the given string crcstart = 0xFFFFFFFF for i in range(len(str(string))): index = (crcstart ^ ord(str(string)[i])) & 255 crcstart = (crcstart >> 8) ^ self.crctable[index] return crcstart def crc32_last_index(self, string): # Compute the last character CRC table index for a given string crcstart = 0xFFFFFFFF for i in range(len(str(string))): index = (crcstart ^ ord(str(string)[i])) & 255 crcstart = (crcstart >> 8) ^ self.crctable[index] return index def get_crc_index(self, t): # Find the index in the CRC table corresponding to the highest byte value for i in range(256): if self.crctable[i] >> 24 == t: return i return -1 def deep_check(self, i, index): # Deep check based on index and previous CRC32 values to verify the assumption string = "" tc = 0x00 hashcode = self.crc32(i) tc = hashcode & 0xff ^ index[2] if not (tc = 48): return [0] string += str(tc - 48) hashcode = self.crctable[index[2]] ^ (hashcode >> 8) tc = hashcode & 0xff ^ index[1] if not (tc = 48): return [0] string += str(tc - 48) hashcode = self.crctable[index[1]] ^ (hashcode >> 8) tc = hashcode & 0xff ^ index[0] if not (tc = 48): return [0] string += str(tc - 48) hashcode = self.crctable[index[0]] ^ (hashcode >> 8) return [1, string] def main(self, string): # Main function to compute and validate CRC32 for the given string index = [0 for x in range(4)] i = 0 ht = int(f"0x{string}", 16) ^ 0xffffffff for i in range(3, -1, -1): index[3-i] = self.get_crc_index(ht >> (i*8)) snum = self.crctable[index[3-i]] ht ^= snum >> ((3-i)*8) for i in range(100000000): lastindex = self.crc32_last_index(i) if lastindex == index[3]: deepCheckData = self.deep_check(i, index) if deepCheckData[0]: break if i == 100000000: return -1 return f"{i}{deepCheckData[1]}" class Tools: @staticmethod # Get save path and format def get_save(): return os.path.join(os.path.join(os.path.expanduser("~"), "Desktop"), "Bilibili_Video_Analysis_{}.xlsx".format(datetime.now().strftime('%Y-%m-%d'))) @staticmethod # Format timestamp def format_timestamp(timestamp): dt_object = datetime.fromtimestamp(timestamp) formatted_time = dt_object.strftime("%Y-%m-%d %H:%M:%S") return formatted_time @staticmethod # Calculate sentiment score def calculate_sentiment_score(text): s = SnowNLP(text) sentiment_score = s.sentiments return sentiment_score @staticmethod # Generate a word cloud def get_word_cloud(sheet_name: str, workbook: Workbook): sheet = workbook[sheet_name] # Read frequency data words = [] frequencies = [] for row in sheet.iter_rows(min_row=2, values_only=True): words.append(row[0]) frequencies.append(row[1]) system = platform.system() if system == 'Darwin': # macOS font_path = '/System/Library/Fonts/STHeiti Light.ttc' elif system == 'Windows': font_path = 'C:/Windows/Fonts/simhei.ttf' else: # Other OS font_path = 'simhei.ttf' wordcloud = WordCloud(background_color='white', max_words=100, font_path=font_path) word_frequency = dict(zip(words, frequencies)) wordcloud.generate_from_frequencies(word_frequency) plt.imshow(wordcloud, interpolation='bilinear') plt.axis('off') plt.show() @staticmethod # Generate horizontal statistics chart def get_word_chart(sheet_name: str, workbook): sheet = workbook[sheet_name] words = [] frequencies = [] for row in sheet.iter_rows(min_row=2, values_only=True): words.append(row[0]) frequencies.append(row[1]) system = platform.system() if system == 'Darwin': font_path = '/System/Library/Fonts/STHeiti Light.ttc' elif system == 'Windows': font_path = 'C:/Windows/Fonts/simhei.ttf' else: font_path = 'simhei.ttf' custom_font = fm.FontProperties(fname=font_path) fig, ax = plt.subplots() ax.barh(words, frequencies) ax.set_xlabel("Frequency", fontproperties=custom_font) ax.set_ylabel("Words", fontproperties=custom_font) plt.yticks(fontproperties=custom_font) plt.show() @staticmethod def get_user_info_by_card(user_card_json): info = { 'name': "N/A", 'birthday': "N/A", 'regtime': "N/A", 'fans': "N/A", 'friend': "N/A" } try: info['name'] = user_card_json['card']['name'] info['birthday'] = user_card_json['card']['birthday'] info['regtime'] = Tools.format_timestamp(int(user_card_json['card']['regtime'])) info['fans'] = user_card_json['card']['fans'] info['friend'] = user_card_json['card']['friend'] except KeyError: pass return tuple(info.values()) class BilibiliExcel: @staticmethod # Write video basic information def write_base_info(workbook, bv_json): sheet = workbook.create_sheet(title="Video Info") headers = ["Video Title", "Author", "Publish Time", "Views", "Favorites", "Shares", "Total Bullet Comments", "Comments Count", "Video Description", "Category", "Video Link", "Thumbnail Link"] sheet.append(headers) data = [bv_json["data"]["title"], bv_json["data"]["owner"]["name"], Tools.format_timestamp(bv_json["data"]["pubdate"]), bv_json["data"]["stat"]["view"], bv_json["data"]["stat"]["favorite"], bv_json["data"]["stat"]["share"], bv_json["data"]["stat"]["danmaku"], bv_json["data"]["stat"]["reply"], bv_json["data"]["desc"], bv_json["data"]["tname"], video_url, bv_json["data"]["pic"]] sheet.append(data) @staticmethod def save_workbook(workbook): workbook.save(Tools.get_save()) class PrintInfo: # Print basic information @staticmethod def base_message(): if 'Windows' == platform.system(): os.system('cls') else: os.system('clear') text = ''' ************************************ Bilibili Video Analysis v2023.6.26 Author: Github.com/hoochanlon Project URL: https://github.com/hoochanlon/scripts Features: 1. Analyze and visualize Bilibili video data. Disclaimer: For research and learning purposes only. ************************************ ''' print(text.center(50, ' ')) if __name__ == '__main__': PrintInfo.base_message() while True: video_url = input("Paste the Bilibili video link: ") if re.match(r'.*BV\w+', video_url): break else: print("Invalid link format. Please re-enter.") bv_json = BilibiliAPI.get_bv_json(video_url) workbook = Workbook() workbook.remove(workbook.active) BilibiliExcel.write_base_info(workbook, bv_json) BilibiliExcel.save_workbook(workbook)
使用注意事項:
- 為了簡化cookie輸入,可以使用key=value;格式,例如“a=a;”,以跳過不必要的步驟。
- 查看 IP 位置需要透過網路驅動程式登入您的 Bilibili 帳戶。
以上是【Python】B站影片評論與彈幕處理分析腳本的詳細內容。更多資訊請關注PHP中文網其他相關文章!

Tomergelistsinpython,YouCanusethe操作員,estextMethod,ListComprehension,Oritertools

在Python3中,可以通過多種方法連接兩個列表:1)使用 運算符,適用於小列表,但對大列表效率低;2)使用extend方法,適用於大列表,內存效率高,但會修改原列表;3)使用*運算符,適用於合併多個列表,不修改原列表;4)使用itertools.chain,適用於大數據集,內存效率高。

使用join()方法是Python中從列表連接字符串最有效的方法。 1)使用join()方法高效且易讀。 2)循環使用 運算符對大列表效率低。 3)列表推導式與join()結合適用於需要轉換的場景。 4)reduce()方法適用於其他類型歸約,但對字符串連接效率低。完整句子結束。

pythonexecutionistheprocessoftransformingpypythoncodeintoExecutablestructions.1)InternterPreterReadSthecode,ConvertingTingitIntObyTecode,whepythonvirtualmachine(pvm)theglobalinterpreterpreterpreterpreterlock(gil)the thepythonvirtualmachine(pvm)

Python的關鍵特性包括:1.語法簡潔易懂,適合初學者;2.動態類型系統,提高開發速度;3.豐富的標準庫,支持多種任務;4.強大的社區和生態系統,提供廣泛支持;5.解釋性,適合腳本和快速原型開發;6.多範式支持,適用於各種編程風格。

Python是解釋型語言,但也包含編譯過程。 1)Python代碼先編譯成字節碼。 2)字節碼由Python虛擬機解釋執行。 3)這種混合機制使Python既靈活又高效,但執行速度不如完全編譯型語言。

UseeAforloopWheniteratingOveraseQuenceOrforAspecificnumberoftimes; useAwhiLeLoopWhenconTinuingUntilAcIntiment.forloopsareIdealForkNownsences,而WhileLeleLeleLeleLeleLoopSituationSituationsItuationsItuationSuationSituationswithUndEtermentersitations。

pythonloopscanleadtoerrorslikeinfiniteloops,modifyingListsDuringteritation,逐個偏置,零indexingissues,andnestedloopineflinefficiencies


熱AI工具

Undresser.AI Undress
人工智慧驅動的應用程序,用於創建逼真的裸體照片

AI Clothes Remover
用於從照片中去除衣服的線上人工智慧工具。

Undress AI Tool
免費脫衣圖片

Clothoff.io
AI脫衣器

Video Face Swap
使用我們完全免費的人工智慧換臉工具,輕鬆在任何影片中換臉!

熱門文章

熱工具

Atom編輯器mac版下載
最受歡迎的的開源編輯器

記事本++7.3.1
好用且免費的程式碼編輯器

SAP NetWeaver Server Adapter for Eclipse
將Eclipse與SAP NetWeaver應用伺服器整合。

SublimeText3 Mac版
神級程式碼編輯軟體(SublimeText3)

PhpStorm Mac 版本
最新(2018.2.1 )專業的PHP整合開發工具