Home  >  Article  >  Backend Development  >  Fun fun - using Python to analyze the social network in "The Romance of the Three Kingdoms"

Fun fun - using Python to analyze the social network in "The Romance of the Three Kingdoms"

little bottle
little bottleforward
2019-04-04 17:28:292818browse

I have always been interested in natural language processing and social network analysis. The former can help us obtain a lot of discoveries from text, while the latter allows us to understand the common relationships between people and things. Network-like connections lead to greater awareness. When the two are combined, what kind of magic will there be? As a fan of the Three Kingdoms, I had this idea: Can I use text processing methods to obtain the social network of the characters in "The Romance of the Three Kingdoms" and then analyze it? There are many good tools in python that can help me implement my curious ideas, let's start now.

Fun fun - using Python to analyze the social network in The Romance of the Three Kingdoms

Preparation

Get the text of "The Romance of the Three Kingdoms".

chapters = get_sanguo()                 # 文本列表,每个元素为一章的文本print(chapters[0][:106])第一回 宴桃园豪杰三结义 斩黄巾英雄首立功滚滚长江东逝水,浪花淘尽英雄。是非成败转头空。青山依旧在,几度夕阳红。白发渔樵江渚上,惯看秋月春风。一壶浊酒喜相逢。古今多少事,都付笑谈中

"The Romance of the Three Kingdoms" is not an easy text to deal with. It is close to ancient texts, and we will face a series of aliases such as the font sizes of the ancients. For example, how does the computer know that "Xuande" refers to "Liu Bei"? Then we need to give it some knowledge. We humans know through study that "Xuande" is Liu Bei's character, and computers can also use a similar method to complete the connection of this concept. We need to tell the computer that "Liu Bei" is an entity (similar to the standard name of an object), and "Xuande" is a reference to "Liu Bei". The way to tell it is to provide the computer with a knowledge base.

entity_mention_dict, entity_type_dict = get_sanguo_entity_dict()
print("刘备的指称有:",entity_mention_dict["刘备"]

刘备的指称有: [ 刘备 , 刘玄德 , 玄德 , 使君 ]

In addition to human entities and references, we can also include other types of references such as the forces of the Three Kingdoms. For example, "Shu" can also be called "Shuhan", so the knowledge base can also include entities. type information to distinguish.

print("刘备的类型为",entity_type_dict["刘备"])
print("蜀的类型为",entity_type_dict["蜀"])
print("蜀的指称有",entity_mention_dict["蜀"])

刘备的类型为 人名
蜀的类型为 势力
蜀的指称有 [ 蜀 , 蜀汉 ]

With this knowledge, in theory we can programmatically connect various nicknames of entities. But if you have to start from scratch, there will still be a lot of work involved. HarvestText[1] is a text processing library that encapsulates these steps and can help us complete this task easily.

ht = HarvestText()ht.add_entities(entity_mention_dict, entity_type_dict)      # 加载模型print(ht.seg("誓毕,拜玄德为兄,关羽次之,张飞为弟。",standard_name=True))[ 誓毕 , , , 拜 , 刘备 , 为兄 , , , 关羽 , 次之 , , , 张飞 , 为弟 , 。 ]

Social network establishment

After successfully unifying the references into standard entity names, we can begin to explore the social networks of the Three Kingdoms. The specific way to establish it is to use neighboring co-occurrence relationships. Whenever a pair of entities appear together in two sentences, add an edge to them. Then the entire process of establishing a network is as shown in the figure below:

Fun fun - using Python to analyze the social network in The Romance of the Three Kingdoms

We can use the function provided by HarvestText to directly complete this process. Let us first start with the small text in Chapter 1 Let’s practice it:

# 准备工作
doc = chapters[0].replace("操","曹操")                                  # 由于有时使用缩写,这里做一个微调
ch1_sentences = ht.cut_sentences(doc)     # 分句
doc_ch01 = [ch1_sentences[i]+ch1_sentences[i+1] for i in range(len(ch1_sentences)-1)]  #获得所有的二连句
ht.set_linking_strategy("freq")

# 建立网络
G = ht.build_entity_graph(doc_ch01, used_types=["人名"])              # 对所有人物建立网络,即社交网络

# 挑选主要人物画图
important_nodes = [node for node in G.nodes if G.degree[node]>=5]
G_sub = G.subgraph(important_nodes).copy()
draw_graph(G_sub,alpha=0.5,node_scale=30,figsize=(6,4))

Fun fun - using Python to analyze the social network in The Romance of the Three Kingdoms

What is the specific relationship between them? We can use the text summary to get the specific content of this chapter:

stopwords = get_baidu_stopwords()    #过滤停用词以提高质量for i,doc in enumerate(ht.get_summary(doc_ch01, topK=3, stopwords=stopwords)): print(i,doc)玄德见皇甫嵩、朱儁,具道卢植之意。嵩曰:“张梁、张宝势穷力乏,必投广宗去依张角。时张角贼众十五万,植兵五万,相拒于广宗,未见胜负。植谓玄德曰:“我今围贼在此,贼弟张梁、张宝在颍川,与皇甫嵩、朱儁对垒。次日,于桃园中,备下乌牛白马祭礼等项,三人焚香再拜而说誓曰:“念刘备、关羽、张飞,虽然异姓,既结为兄弟,则同心协力,

The main content of this chapter seems to be the story of Liu, Guan, and Zhang Taoyuan becoming sworn brothers and fighting against the Yellow Turban thieves together.

Three Kingdoms Full Network Drawing

With the foundation of small-scale practice, we can use the same method to integrate the content of each chapter and draw A big picture spanning all generations of the Three Kingdoms.

G_chapters = []

The entire social network has as many as 1,290 people and tens of thousands of edges! Then it is almost impossible for us to draw it, so let's select the key figures among them and draw a subset.

important_nodes = [node for node in G_global.nodes if G_global.degree[node]>=30]

Use pyecharts for visualization

from pyecharts import Graph

Interactive charts cannot be displayed on the blog, so here is a screenshot: showing Liu Bei’s adjacent nodes

Fun fun - using Python to analyze the social network in The Romance of the Three Kingdoms

The entire network is intricate, and behind it are countless conquests and intrigues in the story of the Three Kingdoms. However, with the powerful computing power of computers, we can still sort out certain key clues, such as:

Character Ranking-Importance

For this question, we It can be solved using the sorting algorithm in the network. PageRank is such a typical method. It is originally a method for search engines to use the connections between websites to rank search results, but the same applies to the connections between people. Let’s get the top 20 most important:

page_ranks = pd.Series(nx.algorithms.pagerank(G_global)).sort_values()
page_ranks.tail(20).plot(kind="barh")
plt.show()

Fun fun - using Python to analyze the social network in The Romance of the Three Kingdoms

结果的确和上面的排序有所不同,我们看到刘备、曹操、孙权、袁绍等主公都名列前茅。而另一个有趣的发现是,司马懿、司马昭、司马师父子三人同样榜上有名,而曹氏的其他后裔则不见其名,可见司马氏之权倾朝野。司马氏之心,似乎就这样被大数据揭示了出来!

社群发现

人物关系有亲疏远近,因此往往会形成一些集团。社交网络分析里的社区发现算法就能够让我们发现这些集团,让我使用community库[2]中的提供的算法来揭示这些关系吧。

import community                                    # python-louvainpartition = community.best_partition(G_main)         # Louvain算法划分社区comm_dict = defaultdict(list)for person in partition:   comm_dict[partition[person]].append(person)

在下面3个社区里,我们看到的主要是魏蜀吴三国重臣们。(只有一些小“问题”,有趣的是,电脑并不知道他们的所属势力,只是使用算法。)

draw_community(2)
ommunity 2: 张辽 曹仁 夏侯惇 徐晃 曹洪 夏侯渊 张郃 许褚 乐进 李典 于禁 荀彧 刘晔 郭嘉 满宠 程昱 荀攸 吕虔 典韦 文聘 董昭 毛玠
draw_community(4)
community 4: 曹操 诸葛亮 刘备 关羽 赵云 张飞 马超 黄忠 许昌 孟达[魏] 孙乾
曹安民 刘璋 关平 庞德 法正 伊籍 张鲁 刘封 庞统 孟获 严颜 马良 简雍 蔡瑁 
陶谦 孔融 刘琮[刘表子] 刘望之 夏侯楙 周仓 陈登
draw_community(3)
community 3: 孙权 孙策 周瑜 陆逊 吕蒙 丁奉 周泰 程普 韩当 徐盛 张昭[吴] 马相 黄盖[吴] 潘璋 甘宁 鲁肃 凌统 太史慈 诸葛瑾 韩吴郡 蒋钦 黄祖 阚泽 朱桓 陈武 吕范
draw_community(0)
community 0: 袁绍 吕布 刘表 袁术 董卓 李傕 贾诩 审配 孙坚 郭汜 陈宫 马腾 
袁尚 韩遂 公孙瓒 高顺 许攸[袁绍] 臧霸 沮授 郭图 颜良 杨奉 张绣 袁谭 董承 
文丑 何进 张邈[魏] 袁熙

还有一些其他社区。比如在这里,我们看到三国前期,孙坚、袁绍、董卓等主公们群雄逐鹿,好不热闹。

draw_community(1)
community 1: 司马懿 魏延 姜维 张翼 马岱 廖化 吴懿 司马昭 关兴 吴班 王平 
邓芝 邓艾 张苞[蜀] 马忠[吴] 费祎 谯周 马谡 曹真 曹丕 李恢 黄权 钟会 蒋琬
司马师 刘巴[蜀] 张嶷 杨洪 许靖 费诗 李严 郭淮 曹休 樊建 秦宓 夏侯霸 杨仪
 高翔 张南[魏] 华歆 曹爽 郤正 许允[魏] 王朗[司徒] 董厥 杜琼 霍峻 胡济 贾充
  彭羕 吴兰 诸葛诞 雷铜 孙綝 卓膺 费观 杜义 阎晏 盛勃 刘敏 刘琰 杜祺 上官雝 
  丁咸 爨习 樊岐 曹芳 周群

这个社区是三国后期的主要人物了。这个网络背后的故事,是司马氏两代三人打败姜维率领的蜀汉群雄,又扫除了曹魏内部的曹家势力,终于登上权力的顶峰。

动态网络

研究社交网络随时间的变化,是个很有意思的任务。而《三国演义》大致按照时间线叙述,且有着极长的时间跨度,顺着故事线往下走,社交网络会发生什么样的变化呢?

这里,我取10章的文本作为跨度,每5章记录一次当前跨度中的社交网络,就相当于留下一张快照,把这些快照连接起来,我们就能够看到一个社交网络变化的动画。快照还是用networkx得到,而制作动画,我们可以用moviepy。

江山代有才人出,让我们看看在故事发展的各个阶段,都是哪一群人活跃在舞台中央呢?

import moviepy.editor as mpy
from moviepy.video.io.bindings import mplfig_to_npimage
width, step = 10,5
range0 = range(0,len(G_chapters)-width+1,step)
numFrame, fps = len(range0), 1
duration = numFrame/fps
pos_global = nx.spring_layout(G_main)

def make_frame_mpl(t):
   i = step*int(t*fps)
   G_part = nx.Graph()
   for G0 in G_chapters[i:i+width]:
       for (u,v) in G0.edges:
           if G_part.has_edge(u,v):
               G_part[u][v]["weight"] += G0[u][v]["weight"]
           else:
               G_part.add_edge(u,v,weight=G0[u][v]["weight"])
   largest_comp = max(nx.connected_components(G_part), key=len)
   used_nodes = set(largest_comp) & set(G_main.nodes)
   G = G_part.subgraph(used_nodes)
   fig = plt.figure(figsize=(12,8),dpi=100)
   nx.draw_networkx_nodes(G,pos_global,node_size=[G.degree[x]*10 for x in G.nodes])
#     nx.draw_networkx_edges(G,pos_global)
   nx.draw_networkx_labels(G,pos_global)
   plt.xlim([-1,1])
   plt.ylim([-1,1])
   plt.axis("off")
   plt.title(f"第{i+1}到第{i+width+1}章的社交网络")
   return mplfig_to_npimage(fig)
animation = mpy.VideoClip(make_frame_mpl, duration=duration)

animation.write_gif("./images/三国社交网络变化.gif", fps=fps)

美观起见,动画中省略了网络中的边。

Fun fun - using Python to analyze the social network in The Romance of the Three Kingdoms

随着时间的变化,曾经站在历史舞台中央的人们也渐渐地会渐渐离开,让人不禁唏嘘感叹。正如《三国演义》开篇所言:

古今多少事,都付笑谈中。

今日,小辈利用python做的一番笑谈也就到此结束吧……

【推荐课程:Python视频教程】  

The above is the detailed content of Fun fun - using Python to analyze the social network in "The Romance of the Three Kingdoms". For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:Python爬虫与数据挖掘. If there is any infringement, please contact admin@php.cn delete