Python draws stunning Sankey diagrams, have you learned it?
Introduction to Sankey Diagram
Many times, we need a situation where we must visualize how data flows between entities. For example, take how residents move from one country to another. Here's a demonstration of how many residents moved from England to Northern Ireland, Scotland and Wales.
It is clear from this Sankey visualization that more residents moved from England to Wales than from Scotland or Northern Ireland.
What is a Sankey diagram?
Sankey diagrams usually depict the flow of data from one entity (or node) to another entity (or node).
The entities to which data flows are called nodes. The node where the data flow originates is the source node (for example, England on the left), and the node where the flow ends is the target node (for example, Wales on the right). Source and target nodes are usually represented as labeled rectangles.
The flow itself is represented by straight or curved paths, called links. The width of a stream/link is directly proportional to the volume/number of streams. In the example above, the movement from England to Wales (i.e. the migration of residents) is more extensive (i.e. the migration of residents) than the movement from England to Scotland or Northern Ireland (i.e. the migration of residents), indicating that more residents move to Wales than to other countries .
Sankey diagrams can be used to represent the flow of energy, money, costs, and anything that has a flow concept.
Minard's classic chart of Napoleon's invasion of Russia is probably the most famous example of a Sankey chart. This visualization using a Sankey diagram shows very effectively how the French army progressed (or decreased?) on its way to Russia and back.
#In this article, we use python’s plotly to draw a Sankey diagram.
How to draw a Sankey diagram?
This article uses the 2021 Olympic Games data set to draw a Sankey diagram. The dataset contains detailed information about the total number of medals - country, total number of medals, and individual totals for gold, silver, and bronze medals. We plot a Sankey chart to find out how many gold, silver and bronze medals a country has won.
df_medals = pd.read_excel("data/Medals.xlsx") print(df_medals.info()) df_medals.rename(columns={'Team/NOC':'Country', 'Total': 'Total Medals', 'Gold':'Gold Medals', 'Silver': 'Silver Medals', 'Bronze': 'Bronze Medals'}, inplace=True) df_medals.drop(columns=['Unnamed: 7','Unnamed: 8','Rank by Total'], inplace=True) df_medals
<class 'pandas.core.frame.DataFrame'> RangeIndex: 93 entries, 0 to 92 Data columns (total 9 columns): # Column Non-Null CountDtype --------- ------------------- 0 Rank 93 non-null int64 1 Team/NOC 93 non-null object 2 Gold 93 non-null int64 3 Silver 93 non-null int64 4 Bronze 93 non-null int64 5 Total93 non-null int64 6 Rank by Total93 non-null int64 7 Unnamed: 7 0 non-nullfloat64 8 Unnamed: 8 1 non-nullfloat64 dtypes: float64(2), int64(6), object(1) memory usage: 6.7+ KB None
Sankey diagram drawing basics
Use plotly's go.Sankey, this method takes 2 parameters - nodes and links (nodes and links ).
Note: All nodes - source and target should have unique identifiers.
In the case of the Olympic medals data set in this article:
Source is the country. Consider the first 3 countries (United States, China, and Japan) as source nodes. Label these source nodes with the following (unique) identifiers, labels, and colors:
- 0: United States: Green
- 1: China: Blue
- 2: Japan: Orange
Target is gold, silver or bronze. Label these target nodes with the following (unique) identifiers, labels, and colors:
- 3: Gold Medal: Gold
- 4: Silver Medal: Silver
- 5 : Bronze: Brown
Link (between source node and target node) is the number of medals of each type. In each source there are 3 links, each ending with a target - Gold, Silver and Bronze. So there are 9 links in total. The width of each link should be the number of gold, silver and bronze medals. Tag these links to targets, values and colors with the following sources:
- 0 (US) to 3,4,5 : 39, 41, 33
- 1 (China) to 3 ,4,5 : 38, 32, 18
- 2 (Japan) to 3,4,5 : 27, 14, 17
You need to instantiate 2 python dict objects to Represents
- nodes (source and target): labels and colors as separate lists and
- links: source node, target node, value (width) and color of the link as separate List
and pass it to plotly's go.Sankey.
Each index of the list (label, source, target, value and color) corresponds to a node or link.
NODES = dict( # 0 1 23 4 5 label = ["United States of America", "People's Republic of China", "Japan", "Gold", "Silver", "Bronze"], color = ["seagreen", "dodgerblue", "orange", "gold", "silver", "brown" ],) LINKS = dict( source = [0,0,0,1,1,1,2,2,2], # 链接的起点或源节点 target = [3,4,5,3,4,5,3,4,5], # 链接的目的地或目标节点 value =[ 39, 41, 33, 38, 32, 18, 27, 14, 17], # 链接的宽度(数量) # 链接的颜色 # 目标节点: 3-Gold4-Silver5-Bronze color = [ "lightgreen", "lightgreen", "lightgreen",# 源节点:0 - 美国 States of America "lightskyblue", "lightskyblue", "lightskyblue",# 源节点:1 - 中华人民共和国China "bisque", "bisque", "bisque"],)# 源节点:2 - 日本 data = go.Sankey(node = NODES, link = LINKS) fig = go.Figure(data) fig.show()
This is a very basic Sankey diagram. But have you noticed that the chart is too wide and the silver medals appear before the gold medals?
Here’s how to adjust the position and width of nodes.
Adjust node positions and chart width
Add x and y positions to nodes to explicitly specify the node's position. Value should be between 0 and 1.
NODES = dict( # 0 1 23 4 5 label = ["United States of America", "People's Republic of China", "Japan", "Gold", "Silver", "Bronze"], color = ["seagreen", "dodgerblue", "orange", "gold", "silver", "brown" ],) x = [ 0,0,0,0.5,0.5,0.5], y = [ 0,0.5,1,0.1,0.5,1],) data = go.Sankey(node = NODES, link = LINKS) fig = go.Figure(data) fig.update_layout(title="Olympics - 2021: Country &Medals",font_size=16) fig.show()
So we got a compact Sankey diagram:
Let’s take a look at how the various parameters passed in the code are mapped to the nodes and nodes in the graph. Link.
代码如何映射到桑基图
添加有意义的悬停标签
我们都知道plotly绘图是交互的,我们可以将鼠标悬停在节点和链接上以获取更多信息。
带有默认悬停标签的桑基图
当将鼠标悬停在图上,将会显示详细信息。悬停标签中显示的信息是默认文本:节点、节点名称、传入流数、传出流数和总值。
例如:
- 节点美国共获得11枚奖牌(=39金+41银+33铜)
- 节点金牌共有104枚奖牌(=美国39枚,中国38枚,日本27枚)
如果我们觉得这些标签太冗长了,我们可以对此进程改进。使用hovertemplate参数改进悬停标签的格式
- 对于节点,由于hoverlabels 没有提供新信息,通过传递一个空hovertemplate = ""来去掉hoverlabel
- 对于链接,可以使标签简洁,格式为-
- 对于节点和链接,让我们使用后缀"Medals"显示值。例如 113 枚奖牌而不是 113 枚。这可以通过使用具有适当valueformat和valuesuffix的update_traces函数来实现。
NODES = dict( # 0 1 23 4 5 label = ["United States of America", "People's Republic of China", "Japan", "Gold", "Silver", "Bronze"], color = ["seagreen", "dodgerblue","orange", "gold", "silver", "brown" ], x = [ 0,0, 0,0.5,0.5,0.5], y = [ 0,0.5, 1,0.1,0.5,1], hovertemplate=" ",) LINK_LABELS = [] for country in ["USA","China","Japan"]: for medal in ["Gold","Silver","Bronze"]: LINK_LABELS.append(f"{country}-{medal}") LINKS = dict(source = [0,0,0,1,1,1,2,2,2], # 链接的起点或源节点 target = [3,4,5,3,4,5,3,4,5], # 链接的目的地或目标节点 value =[ 39, 41, 33, 38, 32, 18, 27, 14, 17], # 链接的宽度(数量) # 链接的颜色 # 目标节点:3-Gold4 -Silver5-Bronze color = ["lightgreen", "lightgreen", "lightgreen", # 源节点:0 - 美国 "lightskyblue", "lightskyblue", "lightskyblue", # 源节点:1 - 中国 "bisque", "bisque", "bisque"],# 源节点:2 - 日本 label = LINK_LABELS, hovertemplate="%{label}",) data = go.Sankey(node = NODES, link = LINKS) fig = go.Figure(data) fig.update_layout(title="Olympics - 2021: Country &Medals", font_size=16, width=1200, height=500,) fig.update_traces(valueformat='3d', valuesuffix='Medals', selector=dict(type='sankey')) fig.update_layout(hoverlabel=dict(bgcolor="lightgray", font_size=16, font_family="Rockwell")) fig.show("png") #fig.show()
带有改进的悬停标签的桑基图
对多个节点和级别进行泛化相对于链接,节点被称为源和目标。作为一个链接目标的节点可以是另一个链接的源。
该代码可以推广到处理数据集中的所有国家。
还可以将图表扩展到另一个层次,以可视化各国的奖牌总数。
NUM_COUNTRIES = 5 X_POS, Y_POS = 0.5, 1/(NUM_COUNTRIES-1) NODE_COLORS = ["seagreen", "dodgerblue", "orange", "palevioletred", "darkcyan"] LINK_COLORS = ["lightgreen", "lightskyblue", "bisque", "pink", "lightcyan"] source = [] node_x_pos, node_y_pos = [], [] node_labels, node_colors = [], NODE_COLORS[0:NUM_COUNTRIES] link_labels, link_colors, link_values = [], [], [] # 第一组链接和节点 for i in range(NUM_COUNTRIES): source.extend([i]*3) node_x_pos.append(0.01) node_y_pos.append(round(i*Y_POS+0.01,2)) country = df_medals['Country'][i] node_labels.append(country) for medal in ["Gold", "Silver", "Bronze"]: link_labels.append(f"{country}-{medal}") link_values.append(df_medals[f"{medal} Medals"][i]) link_colors.extend([LINK_COLORS[i]]*3) source_last = max(source)+1 target = [ source_last, source_last+1, source_last+2] * NUM_COUNTRIES target_last = max(target)+1 node_labels.extend(["Gold", "Silver", "Bronze"]) node_colors.extend(["gold", "silver", "brown"]) node_x_pos.extend([X_POS, X_POS, X_POS]) node_y_pos.extend([0.01, 0.5, 1]) # 最后一组链接和节点 source.extend([ source_last, source_last+1, source_last+2]) target.extend([target_last]*3) node_labels.extend(["Total Medals"]) node_colors.extend(["grey"]) node_x_pos.extend([X_POS+0.25]) node_y_pos.extend([0.5]) for medal in ["Gold","Silver","Bronze"]: link_labels.append(f"{medal}") link_values.append(df_medals[f"{medal} Medals"][:i+1].sum()) link_colors.extend(["gold", "silver", "brown"]) print("node_labels", node_labels) print("node_x_pos", node_x_pos); print("node_y_pos", node_y_pos)
node_labels ['United States of America', "People's Republic of China", 'Japan', 'Great Britain', 'ROC', 'Gold', 'Silver', 'Bronze', 'Total Medals'] node_x_pos [0.01, 0.01, 0.01, 0.01, 0.01, 0.5, 0.5, 0.5, 0.75] node_y_pos [0.01, 0.26, 0.51, 0.76, 1.01, 0.01, 0.5, 1, 0.5]
# 显示的图 NODES = dict(pad= 20, thickness = 20, line = dict(color = "lightslategrey", width = 0.5), hovertemplate=" ", label = node_labels, color = node_colors, x = node_x_pos, y = node_y_pos, ) LINKS = dict(source = source, target = target, value = link_values, label = link_labels, color = link_colors, hovertemplate="%{label}",) data = go.Sankey(arrangement='snap', node = NODES, link = LINKS) fig = go.Figure(data) fig.update_traces(valueformat='3d', valuesuffix=' Medals', selector=dict(type='sankey')) fig.update_layout(title="Olympics - 2021: Country &Medals", font_size=16, width=1200, height=500,) fig.update_layout(hoverlabel=dict(bgcolor="grey", font_size=14, font_family="Rockwell")) fig.show("png")
The above is the detailed content of Python draws stunning Sankey diagrams, have you learned it?. For more information, please follow other related articles on the PHP Chinese website!

Arraysarebetterforelement-wiseoperationsduetofasteraccessandoptimizedimplementations.1)Arrayshavecontiguousmemoryfordirectaccess,enhancingperformance.2)Listsareflexiblebutslowerduetopotentialdynamicresizing.3)Forlargedatasets,arrays,especiallywithlib

Mathematical operations of the entire array in NumPy can be efficiently implemented through vectorized operations. 1) Use simple operators such as addition (arr 2) to perform operations on arrays. 2) NumPy uses the underlying C language library, which improves the computing speed. 3) You can perform complex operations such as multiplication, division, and exponents. 4) Pay attention to broadcast operations to ensure that the array shape is compatible. 5) Using NumPy functions such as np.sum() can significantly improve performance.

In Python, there are two main methods for inserting elements into a list: 1) Using the insert(index, value) method, you can insert elements at the specified index, but inserting at the beginning of a large list is inefficient; 2) Using the append(value) method, add elements at the end of the list, which is highly efficient. For large lists, it is recommended to use append() or consider using deque or NumPy arrays to optimize performance.

TomakeaPythonscriptexecutableonbothUnixandWindows:1)Addashebangline(#!/usr/bin/envpython3)andusechmod xtomakeitexecutableonUnix.2)OnWindows,ensurePythonisinstalledandassociatedwith.pyfiles,oruseabatchfile(run.bat)torunthescript.

When encountering a "commandnotfound" error, the following points should be checked: 1. Confirm that the script exists and the path is correct; 2. Check file permissions and use chmod to add execution permissions if necessary; 3. Make sure the script interpreter is installed and in PATH; 4. Verify that the shebang line at the beginning of the script is correct. Doing so can effectively solve the script operation problem and ensure the coding process is smooth.

Arraysaregenerallymorememory-efficientthanlistsforstoringnumericaldataduetotheirfixed-sizenatureanddirectmemoryaccess.1)Arraysstoreelementsinacontiguousblock,reducingoverheadfrompointersormetadata.2)Lists,oftenimplementedasdynamicarraysorlinkedstruct

ToconvertaPythonlisttoanarray,usethearraymodule:1)Importthearraymodule,2)Createalist,3)Usearray(typecode,list)toconvertit,specifyingthetypecodelike'i'forintegers.Thisconversionoptimizesmemoryusageforhomogeneousdata,enhancingperformanceinnumericalcomp

Python lists can store different types of data. The example list contains integers, strings, floating point numbers, booleans, nested lists, and dictionaries. List flexibility is valuable in data processing and prototyping, but it needs to be used with caution to ensure the readability and maintainability of the code.


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Linux new version
SublimeText3 Linux latest version

MantisBT
Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

SublimeText3 Mac version
God-level code editing software (SublimeText3)

PhpStorm Mac version
The latest (2018.2.1) professional PHP integrated development tool
