Data analysis is one of the most popular skills today. It can help us extract valuable information from massive data to support decision-making and innovation. However, to conduct effective data analysis, we not only need to master relevant theories and methods, but also need to be proficient in using some tools and platforms. The Linux system is one of the operating systems commonly used by data analysts. It provides many powerful and flexible commands that can help us deal with various data problems. This article will introduce you to 9 commonly used commands for data analysis under Linux systems, as well as their functions and usage. Whether you are a Linux newbie or a veteran, these commands will make your data analysis work more efficient and convenient.

1. Head and tail
First, let’s start with file processing. What is the content in the file? What is its format? You can use the cat command to display the file in the terminal, but it is obviously not suitable for processing files with long content.
Enter head and tail, which can completely display the specified number of lines in the file. If you do not specify the number of rows, 10 of them will be displayed by default.
1. $ tail -n 3 jan2017articles.csv 2. 02 Jan 2017,Article,Scott Nesbitt,3 tips for effectively using wikis for documentation,1 ,/article/17/1/tips-using-wiki-documentation,"Documentation, Wiki",710 3. 02 Jan 2017,Article,Jen Wike Huger,The Opensource.com preview for January,0,/article/17 /1/editorial-preview-january,,358 4. 02 Jan 2017,Poll,Jason Baker,What is your open source New Year's resolution?,1,/poll/1 7/1/what-your-open-source-new-years-resolution,,186
In the last three lines, I was able to find the date, author name, title, and some other information. However, due to the lack of column headers, I don't know the specific meaning of each column. Check out the specific titles of each column below:
1. $ head -n 1 jan2017articles.csv 2. Post date,Content type,Author,Title,Comment count,Path,Tags,Word count
Now everything is very clear, we can see the publication date, content type, author, title, number of submissions, related URLs, tags for each article, and word count.
2. wc
But what if you need to analyze hundreds or even thousands of articles? Here you need to use the wc command - which is the abbreviation of "word count". wc can count bytes, characters, words or lines of a file. In this example, we want to know the number of lines in the article.
- $ wc -l jan2017articles.csv 93 jan2017articles.csv
This file has 93 lines in total. Considering that the first line contains the file title, it can be inferred that this file is a list of 92 articles.
3. grep
A new question is raised below: How many of these articles are related to security topics? To achieve the goal, we assume that the required articles will mention the word security in the title, tag, or other location. At this time, the grep tool can be used to search files by specific characters or implement other search patterns. This is an extremely powerful tool because we can even create extremely precise matching patterns using regular expressions. But here, we only need to find a simple string.
1. $ grep -i "security" jan2017articles.csv 2. 30 Jan 2017,Article,Tiberius Hefflin,4 ways to improve your security online right now,3, /article/17/1/4-ways-improve-your-online-security,Security and encryption,1242 3. 28 Jan 2017,Article,Subhashish Panigrahi,How communities in India support privacy and software freedom,0,/article/17/1/how-communities-india-support-privacy-software-freedom,Security and encryption,453 4. 27 Jan 2017,Article,Alan Smithee,Data Privacy Day 2017: Solutions for everyday privacy ,5,/article/17/1/every-day-privacy,"Big data, Security and encryption",1424 5. 04 Jan 2017,Article,Daniel J Walsh,50 ways to avoid getting hacked in 2017,14,/article /17/1/yearbook-50-ways-avoid-getting-hacked,"Yearbook, 2016 Open Source Yearbook, Security and encryption, Containers, Docker, Linux",2143 我们使用的格式为grep加-i标记(告知grep不区分大小写),再加我们希望搜索的模式,最后是我们所搜索的目标文件的 位置。最后我们找到了
4 security-related articles. If the scope of the search is more specific, we can use pipe - it can combine grep with the wc command to find out how many lines mention security content.
$ grep -i "security" jan2017articles.csv | wc -l 4
In this way, wc will extract the output of the grep command and use it as input. It's obvious that this combination, coupled with a bit of shell scripting, instantly transforms the terminal into a powerful data analysis tool.
Four, tr
In most analysis scenarios, we will face CSV files - but how do we convert it to other formats to achieve different application methods? Here, we convert it to HTML form for data use through tables. The tr command can help you achieve this goal, it can convert one type of characters into another type. Similarly, you can also use the pipe command to achieve output/input docking.
Next, let’s try another multi-part example, which is to create a TSV (tab-separated values) file that only contains articles published on January 20.
=$ grep "20 Jan 2017" jan2017articles.csv | tr ',' '/t' > jan20only.tsv
First, we use grep for date query. We pipe this result to the tr command and use the latter to replace all commas with tabs (represented as '/t'). But where does the result go? Here we use the > character to output the result to a new file rather than to the screen. In this way, we can ensure that the dqywjan20only.tsv file contains the expected data.
$ cat jan20only.tsv 20 Jan 2017 Article Kushal Das 5 ways to expand your project's contributor base 2 /article/17/1/expand-project-contributor-base Getting started 690 20 Jan 2017 Article D Ruth Bavousett How to write web apps in R with Shiny 2 /article/17/1/writing-new-web-apps-shiny Web development 218 20 Jan 2017 Article Jason Baker "Top 5: Shell scripting the Cinnamon Linux desktop environment and more" 0 /article/17/1/top-5-january-20 Top 5 214 20 Jan 2017 Article Tracy Miranda How is your community promoting diversity? 1 /article/17/1/take-action-diversity-tech Diversity and inclusion 1007
五、sort
如果我们先要找到包含信息最多的特定列,又该如何操作?假设我们需要了解哪篇文章包含最长的新文章列表,那么面对之前得出的1月20日文章列表,我们可以使用sort命令对列字数进行排序。在这种情况下,我们并不需要使用中间文件,而可以继续使用pipe。不过将长命令链拆分成较短的部分往往能够简化整个操作过程。
- ‘/t’ -k8 jan20only.tsv | head -n 1
- 20 Jan 2017 Article Tracy Miranda How is your community promoting diversity? 1 /article/17/1/take-action-diversity-tech Diversity and inclusion 1007
以上是一条长命令,我们尝试进行拆分。首先,我们使用sort命令对字数进行排序。-nr选项告知sort以数字排序,并将结果进行反向排序(由大到小)。此后的-t则告知其中的分隔符为。其中的要求此shell为一条需要处理的字符串,并将/n返回为tab。而-k8部分则告知sort命令使用第八列,即本示例中进行字数统计的目标列。
最后,输出结果被pipe至head,处理后在结果中显示此文件中包含最多字数的文章标题。
六、sed
大家可能还需要在文件中选择特定某行。这里可以使用sed。如果希望将全部包含标题的多个文件加以合并,并只为整体文件显示一组标题,即需要清除额外内容; 或者希望只提取特定行范围,同样可以使用sed。另外,sed还能够很好地完成批量查找与替换任务。
下面立足之前的文章列表创建一个不含标题的新文件,用于同其他文件合并(例如我们每月都会定期生成某个文件,现在需要将各个月份的内容进行合并)。
- $ sed ‘1 d’ jan2017articles.csv > jan17no_headers.csv
其中的“1 d”选项要求sed删除第一行。
七、cut
了解了如何删除行,那么我们该如何删除列?或者说如何只选定某一列?下面我们尝试为之前生成的列表创建一份新的作者清单。
- $ cut -d’,’ -f3 jan17no_headers.csv > authors.txt
在这里,通过cut与-d相配合代表着我们需要第三列(-f3),并将结果发送至名为authors.txt的新文件。
八、uniq
作者清单已经完成,但我们要如何知悉其中包含多少位不同的作者?每位作者又各自编写了多少篇文章?这里使用unip。下面我们对文件进行sort排序,找到唯一值,而后计算每位作者的文章数量,并用结果替换原本内容。
- sort authors.txt | uniq -c > authors.txt
现在已经可以看到每位作者的对应文章数,下面检查最后三行以确保结果正确。
- $ tail -n3 authors-sorted.txt
- 1 Tracy Miranda
- 1 Veer Muchandi
- 3 VM (Vicky) Brasseur
九、awk
最后让我们了解最后一款工具,awk。awk是一款出色的替换性工具,当然其功能远不止如此。下面我们重新回归1月12日文章列表TSV文件,利用awk创建新列表以标明各篇文章的作者以及各作者编写的具体字数。
- $ awk -F “/t” ‘{print NF}’ jan20only.tsv
- Kushal Das 690
- D Ruth Bavousett 218
- Jason Baker 214
- Tracy Miranda 1007
The -F "/t" is used to tell awk that it is currently processing data separated by tabs. Inside the curly braces, we provide awk with the execution code. means that it will output the third line, while NF means that it will output the last line (that is, the abbreviation of 'number of fields'), and add two spaces between the two results. Clear divisions.
Although the example listed here is small and does not seem to require the use of the above tools, if the scope is expanded to a file containing 93,000 lines, then it is obviously difficult to use a spreadsheet program to process.
Using these simple tools and small scripts, you can avoid using database tools and easily complete a large amount of data statistics work. Whether you are a professional or an amateur, its role cannot be ignored.
Through this article, you have learned about the 9 commands commonly used for data analysis under Linux systems, as well as their functions and usage. These commands cover file operations, directory management, output redirection, pipes, links, etc., and can help you perform various data processing and analysis under Linux systems. Of course, these commands are only some of the many commands provided by the Linux system. If you want to learn more about the Linux system and data analysis, you still need to continue to explore and practice. I hope this article can be helpful to your study and work. You are also welcome to share other practical Linux commands that you use or discover.
The above is the detailed content of Linux data analysis essentials: 9 practical commands. For more information, please follow other related articles on the PHP Chinese website!

linux设备节点是应用程序和设备驱动程序沟通的一个桥梁;设备节点被创建在“/dev”,是连接内核与用户层的枢纽,相当于硬盘的inode一样的东西,记录了硬件设备的位置和信息。设备节点使用户可以与内核进行硬件的沟通,读写设备以及其他的操作。

区别:1、open是UNIX系统调用函数,而fopen是ANSIC标准中的C语言库函数;2、open的移植性没fopen好;3、fopen只能操纵普通正规文件,而open可以操作普通文件、网络套接字等;4、open无缓冲,fopen有缓冲。

端口映射又称端口转发,是指将外部主机的IP地址的端口映射到Intranet中的一台计算机,当用户访问外网IP的这个端口时,服务器自动将请求映射到对应局域网内部的机器上;可以通过使用动态或固定的公共网络IP路由ADSL宽带路由器来实现。

在linux中,交叉编译是指在一个平台上生成另一个平台上的可执行代码,即编译源代码的平台和执行源代码编译后程序的平台是两个不同的平台。使用交叉编译的原因:1、目标系统没有能力在其上进行本地编译;2、有能力进行源代码编译的平台与目标平台不同。

在linux中,eof是自定义终止符,是“END Of File”的缩写;因为是自定义的终止符,所以eof就不是固定的,可以随意的设置别名,linux中按“ctrl+d”就代表eof,eof一般会配合cat命令用于多行文本输出,指文件末尾。

在linux中,可以利用“rpm -qa pcre”命令判断pcre是否安装;rpm命令专门用于管理各项套件,使用该命令后,若结果中出现pcre的版本信息,则表示pcre已经安装,若没有出现版本信息,则表示没有安装pcre。

linux查询mac地址的方法:1、打开系统,在桌面中点击鼠标右键,选择“打开终端”;2、在终端中,执行“ifconfig”命令,查看输出结果,在输出信息第四行中紧跟“ether”单词后的字符串就是mac地址。

在linux中,rpc是远程过程调用的意思,是Reomote Procedure Call的缩写,特指一种隐藏了过程调用时实际通信细节的IPC方法;linux中通过RPC可以充分利用非共享内存的多处理器环境,提高系统资源的利用率。


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

DVWA
Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

Atom editor mac version download
The most popular open source editor

Dreamweaver Mac version
Visual web development tools

PhpStorm Mac version
The latest (2018.2.1) professional PHP integrated development tool

SecLists
SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.
