Home >Backend Development >Python Tutorial >Python Data Analysis Practical Overview Data Analysis

Python Data Analysis Practical Overview Data Analysis

coldplay.xixi
coldplay.xixiforward
2021-01-06 09:48:392107browse

Python TutorialThe column introduces the overview data.

Python Data Analysis Practical Overview Data Analysis

Recommended (free): Python Tutorial

Article Directory

  • 1. Introduction to data analysis
    • 1. Fundamentals of the big data era
    • 2. Career prospects of data analysts
    • 3. The road to becoming a data analyst
  • 2. Python installation and environment configuration
    • 1.Python version
    • 2. Install Python on different systems
    • 3. Environment variable configuration
    • 4. Install pip
    • 5. Integrated development environment selection
  • 3. Introduction and installation of Anaconda
    • 1.What is Anaconda
    • 2. Download and install Anaconda
    • 3.conda Introduction to tools and package management
  • 4. Jupyter Notebook
    • ##1.Basic introduction to Jupyter Notebook
    • 2.Jupyter Notebook Usage
    • 3. Using Python in Jupyter
    • 4. Data interaction case
      • Load csv data, process the data, and save it to the MongoDB database
      • Use Jupyter to process store data
##1. Introductory data analysis

1. Fundamentals of the big data era

The development status of the big data industry:

Now data has shown

explosive
growth, and there may be 100% of data every minute :

13,000 iPhone apps downloaded
  • 98,000 new Weibo posts posted on Twitter
  • 168 million emails sent
  • Taobao Double Eleven 10,680 New orders
  • 12306 issued 1840 tickets
  • In the era of big data, three major changes have occurred:

From random samples to full data
  • From accuracy to confounding
  • From causation to correlation
  • Give a typical example:
Men will buy some diapers when they go to the supermarket. Beer, the results of big data analysis have prompted supermarkets to put some beer near the diaper shelves, thereby increasing sales. There is no causal relationship between buying diapers and buying beer, but there is a certain correlation.


The domestic big data application status is as follows (from CSDN):


Python Data Analysis Practical Overview Data AnalysisIt can be seen that the application of big data has reached a certain scale, but there is still a lot of room for development. .

Talent needs mainly include:

Data Analyst
  • Statistical Analysis
    • Predictive Analysis
    • Process Optimization
    Big Data Engineer
  • Platform Development
    • Application Development
    • Technical Support
    Data Architect
  • Business Understanding
    • Application Deployment
    • Architecture Design
  • The reason why we need to learn data analysis is Because data is becoming more common and cheaper, analytics can provide scarce services that come with additional value.

2. Data analyst career prospects

Problems that data analysts need to solve:

Estimated demand, allocation Production Capacity

In the era of big data, the ability to interpret data is even more needed.
    Q: The oven's production capacity is limited, which types of bread should be produced?
  • A: List the most popular breads and give priority to the production of

    star products
    .
    The key is to find the star product, which requires counting the total turnover of bread, and then calculating the relative proportion of each type of bread to the total turnover, and giving priority to the production of product combinations that can account for 70% of the turnover. This will use the statistical distribution table and histogram. This analysis method is also called the ABC analysis method, as follows:


    Python Data Analysis Practical Overview Data AnalysisEvaluate the effectiveness of the marketing plan

    Statistics is not just about analyzing data. The key is to infer how to influence customer behavior from the analysis results, and formulate a specific
  • business plan
  • , and act accordingly.

    Q: If you want to sell bread online, which kind of advertising is more effective?
    A: Write two types of copywriting and advertise them for a period of time to see how effective they are. To compare advertising effectiveness, the best way is to use statistical randomized controlled experiments
    , where two types of advertising appear randomly. After a period of time, observe which advertising effect is better, and then use it on a large scale. Advertising that is more effective.

    Product Quality Control

    It is very important to discover the relationship between the results and the causes of the results.
  • Q: How can you tell from the bread whether the baker has cut corners?
  • A: Randomly check a few loaves and use a scale to see if the weight difference is too large.

    You need to know the average weight of the bread first, and then sample the bread to see if the weight of the bread shows a bell-shaped curve with a normal distribution? If it deviates from the curve, it may indicate a problem with the quality of the bread. as follows:


A good data analyst is a good product planner and a leader in the industry;
In IT companies, excellent data analysts are very promising Become a senior member of the company.

The workflow of a data analyst is as follows:
Python Data Analysis Practical Overview Data Analysis

The three major tasks of a data analyst:

  • Analyzing history
  • Predict the future
  • Optimization selection

8 skills required by data analysts:

  • Statistics
    • Statistical testing, P Values, Distributions, Estimation
  • Basic Tools
    • Python
    • SQL
  • Multivariable Calculus Sum Linear algebra
  • Data sorting
  • Data visualization
  • Software engineering
  • Machine learning
  • The thinking of a data scientist
    • Data-driven
    • Problem solving

Three major abilities required by data analysts:

  • Statistical foundation and application of analytical tools
  • Computer Coding Ability
  • Knowledge of specific application areas or industries

Typical Data Analyst Growth History:
Python Data Analysis Practical Overview Data Analysis

3. The road to becoming a data analyst

Self-cultivation to become a data analyst:

  • Sensitive
  • Exploration
  • Detailed
  • Pragmatic

The skills that data analysts need to possess are as follows:

  • Familiar with Excel data processing
  • Data sensitive Strong degree
  • Familiar with company business and industry knowledge
  • Master data analysis methods
    • Basic analysis methods
      • Contrastive analysis method
      • Group analysis method
      • Cross analysis method
      • Structural analysis method
      • Funnel plot analysis method
      • Comprehensive evaluation analysis method
      • Factor analysis method
      • Matrix correlation analysis
    • ##Advanced analysis method
        Correlation analysis method
      • Regression analysis method
      • Cluster analysis method
      • Discriminant analysis method
      • Principal component analysis method
      • Factor analysis method
      • Correspondence analysis method
      • Time series
Practice in data analysis in different industries Job content and responsibilities of personnel:

    Engage in data analysis
    • Learn to make daily reports
    • Daily sales and inventory tables
    • Products Sales forecast
    • Inventory calculation and early warning
    • Traffic analysis related tables
    • Review
  • Data analysis and mining staff
    • Provide data support for product optimization
    • Verify product improvement effects
    • Provide emails and reports for senior management
  • Internet analysis
    • KPI indicator monitoring
    • Various periodic reports
    • Analysis reports for a certain business issue
    • Offline modeling and analysis for the business
The very important subject foundation for data analysis is mathematics, but it doesn’t matter if you are not good at mathematics. You can use

Python to help you learn: Python is not only a programming language , and is the basis of data mining machine learning and other technologies, which facilitates the establishment of automated workflows;
It is not difficult to get started with Python, and its mathematical requirements are not too high. The important thing is to know how to express an algorithmic logic in language;
Python has many encapsulated tool libraries and commands. What needs to be done is to use mathematical methods to solve a problem and build it.

If you want to quickly get started with Python data analysis, you must make good use of Python related toolkits:

(1) The biggest feature of Python is that it has a huge and active
Scientific Computing community , the trend of using python for scientific computing is becoming more and more obvious. (2) Because Python has continuously improved libraries, it has become a major alternative for data processing tasks. Combined with its strong strength in general programming, you can just use Python as a language to build data-based applications. Central applications, including:

    Commonly used data analysis libraries
    • Numpy
    • Scipy
    • Pandas
    • matplotlib
  • Commonly used advanced data analysis library
    • nltk
    • igraph
    • scikit-learn
(3) As a scientific computing platform, Python can easily integrate C, C and Fortran codes.

Preparation for data analysis:

    Understanding the data
  • Data cleaning and preliminary analysis
  • Drawing and visualization
  • Data Aggregation and grouping processing
  • Data mining
Commonly used algorithms for data analysis and data mining:

    Linear regression
  • Time series analysis
  • Classification algorithm
  • Clustering algorithm
  • Dimensionality reduction algorithm
The method of learning and engaging in data analysis is:

    Think frequently
  • Do more
  • Summary
## 2. Python installation and environment configuration

1.Python version

Python is divided into two major versions: 3.X and 2.X.
Version 3.0 of Python is often called Python 3000, or Py3k for short. This is a major upgrade compared to earlier versions of Python.
In order not to bring too much burden, Python 3.X was not designed with downward compatibility in mind. Many programs designed for earlier Python versions cannot run normally on Python 3.X.
Most third-party libraries are working hard to be compatible with Python 3.X versions.

2. Install Python on different systems

(1)Unix & Linux system

  • Visit http://www.python.org /download/
  • Select the source code compressed package suitable for Unix/Linux
  • Download and decompress the compressed package
  • If you need to customize some options, modify Modules/Setup
  • Execute./configurescript
  • make
  • ##make install
(2) Window system

    Visit http://www.python.org/download/
  • Select the Window platform installation package in the download list
  • Since the download from the official website is very slow Slow, so I have downloaded and organized the installation packages for each version of Python. You can directly click to join the QQ group
    963624318 Just download it from the group folder Python Data Analysis Practical Overview Data AnalysisPython related installation package.
  • After downloading, double-click the download package to enter the Python installation wizard. The installation is very simple. Just use the default settings and click
  • Next until the installation is completed.
(3) Mac system

comes with python 2.7, you can execute
brew install python to install the new version.

3. Environment variable configuration

Windows systems need to configure environment variables.

If you did not choose to add environment variables when installing Python, you need to add them manually. You need to add the path to install Python

XXX\PythonXXX and XXX\PythonXXX\Scripts To the environment variables, there are two ways:

    Command line addition
  • Execute
    path=%path%;XXX\PythonXXX and path=% respectively in CMD path%;XXX\PythonXXX\Scripts is enough.
  • Add
  • in the system settings. Right-click the computer → Properties → Advanced system settings → System properties → Environment variables → Double-click path → Add
    XXX\PythonXXX and XXX\PythonXXX\ ScriptsInstallation path is as follows:
    Python Data Analysis Practical Overview Data Analysis
Finally click Confirm to exit.

4. Install pip

pip is a package installation and management tool in Python. You can choose to install pip when installing Python. In Python 2 >=2.7. 9 or Python 3 >=3.4.

If pip is not installed, you can install it through the command:

    Linux or Mac

  • pip install -U pip
  • Windows( cmd input)

  • python -m pip install -U pip

5. Integrated development environment selection

There are many Python Editors, including PyCharm, etc., choose PyCharm here:

PyCharm is a Python IDE created by JetBrains, supporting Mac OS, Windows, and Linux systems.
Includes
debugging, syntax highlighting, Project management, code jump, smart prompts, auto-complete, unit testing, version control and other functions.

You can choose the appropriate version to download and install at https://www.jetbrains.com/pycharm/download/.

3. Introduction and installation of Anaconda

1.What is Anaconda

Anaconda is a tool that can be used

The Python distribution of Scientific Computing supports Linux, Mac, and Windows systems, and has built-in commonly used scientific computing libraries. It solves two major pain points of official Python:
(1) Provides package management function, solving the scenario where third-party package installation on Windows platform often fails;
(2) Provides environment management function, The function is similar to virtualenv, which solves the problem of coexistence and switching of multiple versions of Python.

2. Download and install Anaconda

Download the installation package directly from the official website https://www.anaconda.com/products/inpidual and select download

Python3 The installation package of .8Personal Edition is enough, but the download speed from the official website is slow, so I have downloaded and sorted out the Anaconda installation package corresponding to Python 3.8. You can directly click to add the QQ group 963624318 Just download it from the group folder Python Data Analysis Practical Overview Data AnalysisPython related installation package.

Install directly after downloading. Please note that during the click process, a prompt to add environment variables will appear. You need to check it, as follows:


安装Anaconda 选择环境变量

Finally click Next. After the installation is complete, click the Win key (under Windows system) to see the recently added or application list A as shown below:
启动栏 最近添加
启动栏 A

At this time, you can click Anaconda Navigator, as shown below:
Anaconda Navigator

You can see that the environment is Python 3.8.3, and the basic environment created by Anaconda is named base. It is also the default environment, and you can also see the libraries installed by default.

Open the Anaconda command line tool Anaconda Powershell Prompt, enter python -V, and Python 3.8.3 will also be printed.

You can also create a new conda environment through commands, such as conda create --name py27 python=2.7After execution, a conda environment with a Python version of 2.7 named py27 will be created.

Activate the environment and execute the command conda activate py27, and deactivate the command conda deactivate.

You can execute conda list on the command line to view the installed libraries, as follows:

# packages in environment at E:\Anaconda3:
#
# Name                    Version                   Build  Channel
_ipyw_jlab_nb_ext_conf    0.1.0                    py38_0
alabaster                 0.7.12                     py_0
anaconda                  2020.07                  py38_0
anaconda-client           1.7.2                    py38_0
anaconda-navigator        1.9.12                   py38_0
...
zlib                      1.2.11               h62dcd97_4
zope                      1.0                      py38_1
zope.event                4.4                      py38_0
zope.interface            4.7.1            py38he774522_0
zstd                      1.4.5                ha9fde0e_0

3. Introduction to the conda tool and package management

conda is a tool for package management and environment management under Anaconda. Its function is similar to the combination of pip and virtualenv. Conda’s environment management is basically the same as virtualenv. It's a similar operation.
After successful installation, conda will be added to the environment variables by default, so you can run the conda command directly in the command line window.

Common conda commands and their meanings are as follows:

Command meaning conda command
conda –h View help
Create an environment named python36 based on python3.6 version conda create - -name python36 python=3.6
Activate this environment activate python36 (Windows), source activate python36 (linux/mac)
View python version python -V
Exit the current environment deactivate python36
Delete the environment conda remove -n py27 --all
View all installed environments conda info -e

Common conda package management commands are as follows:

Package management command meaning Package management command
Install matplotlib conda install matplotlib
View installed packages conda list
Package update conda update matplotlib
Remove package conda remove matplotlib

In conda, anything is a package, everything is a package, conda itself can be regarded as a package, the python environment can be regarded as a package, anaconda can also be regarded as It is a package, so in addition to ordinary third-party packages supporting updates, these 3 packages also support the following commands:

Operation Command
Update conda itself conda update conda
Update anaconda application conda update anaconda
Update python, assuming the current python environment is 3.8.1 and the latest version is 3.8.2, then it will be upgraded to 3.8.2 conda update python

四、Jupyter Notebook

1.Jupyter Notebook基本介绍

Jupyter Notebook(此前被称为IPython notebook)是一个交互式笔记本,支持运行40多种编程语言。

在开始使用notebook之前,需要先安装该库:
(1)在命令行中执行pip install jupyter来安装;
(2)安装Anaconda后自带Jupyter Notebook。

在命令行中执行jupyter notebook,就会在当前目录下启动Jupyter服务并使用默认浏览器打开页面,还可以复制链接到其他浏览器中打开,如下:
jupyter 界面

可以看到,notebook界面由以下部分组成:
(1)notebook名称;
(2)主工具栏,提供了保存、导出、重载notebook,以及重启内核等选项;
(3)notebook主要区域,包含了notebook的内容编辑区。

2.Jupyter Notebook的使用

在Jupyter页面下方的主要区域,由被称为单元格的部分组成。每个notebook由多个单元格构成,而每个单元格又可以有不同的用途。
上图中看到的是一个代码单元格(code cell),以[ ]开头,在这种类型的单元格中,可以输入任意代码并执行。
例如,输入1 + 2并按下Shift + Enter,单元格中的代码就会被计算,光标也会被移动到一个新的单元格中。

如果想新建一个notebook,只需要点击New,选择希望启动的notebook类型即可。

简单使用示意如下:
python da jupyter simple

可以看到,notebook可以修改之前的单元格,对其重新计算,这样就可以更新整个文档了。如果你不想重新运行整个脚本,只想用不同的参数测试某个程式的话,这个特性显得尤其强大。
不过,也可以重新计算整个notebook,只要点击Cell -> Run all即可。

再测试标题和其他代码如下:
python da jupyter for head

可以看到,在顶部添加了一个notebook的标题,还可以执行for循环等语句。

3.Jupyter中使用Python

Jupyter测试Python变量和数据类型如下:
python da jupyter variable data type

测试Python函数如下:
python da jupyter function

测试Python模块如下:
python da jupyter module package

可以看到,在执行出错时,也会抛出异常。

测试数据读写如下:
python da jupyter data io

数据读写很重要,因为进行数据分析时必须先读取数据,进行数据处理后也要进行保存

4.数据交互案例

加载csv数据,处理数据,保存到MongoDB数据库

有csv文件Python Data Analysis Practical Overview Data Analysis.csv和Python Data Analysis Practical Overview Data Analysis.csv,分别是商品数据和用户评分数据,如下:
Python Data Analysis Practical Overview Data Analysis
Python Data Analysis Practical Overview Data Analysis

如需获取数据、代码等相关文件进行测试学习,可以直接点击加QQ群 Python Data Analysis Practical Overview Data Analysis963624318 在群文件夹Python数据分析实战中下载即可。

现在需要通过Python将其读取出来,并将指定的字段保存到MongoDB中,需要在Anaconda中执行命令conda install pymongo安装pymongo。

Python代码如下:

import pymongoclass Product:
    def __init__(self,productId:int ,name, imageUrl, categories, tags):
        self.productId = productId
        self.name = name
        self.imageUrl = imageUrl
        self.categories = categories
        self.tags = tags    def __str__(self) -> str:
        return self.productId +'^' + self.name +'^' + self.imageUrl +'^' + self.categories +'^' + self.tagsclass Rating:
    def __init__(self, userId:int, productId:int, score:float, timestamp:int):
        self.userId = userId
        self.productId = productId
        self.score = score
        self.timestamp = timestamp    def __str__(self) -> str:
        return self.userId +'^' + self.productId +'^' + self.score +'^' + self.timestampif __name__ == '__main__':
    myclient = pymongo.MongoClient("mongodb://127.0.0.1:27017/")
    mydb = myclient["goods-users"]
    # val attr = item.split("\\^")
    # // 转换成Product
    # Product(attr(0).toInt, attr(1).trim, attr(4).trim, attr(5).trim, attr(6).trim)

    Python Data Analysis Practical Overview Data Analysis = mydb['Python Data Analysis Practical Overview Data Analysis']
    with open('Python Data Analysis Practical Overview Data Analysis.csv', 'r',encoding='UTF-8') as f:
        item = f.readline()
        while item:
            attr = item.split('^')
            product = Product(int(attr[0]), attr[1].strip(), attr[4].strip(), attr[5].strip(), attr[6].strip())
            Python Data Analysis Practical Overview Data Analysis.insert_one(product.__dict__)
            # print(product)
            # print(json.dumps(obj=product.__dict__,ensure_ascii=False))
            item = f.readline()

    # val attr = item.split(",")
    # Rating(attr(0).toInt, attr(1).toInt, attr(2).toDouble, attr(3).toInt)
    Python Data Analysis Practical Overview Data Analysis = mydb['Python Data Analysis Practical Overview Data Analysis']
    with open('Python Data Analysis Practical Overview Data Analysis.csv', 'r',encoding='UTF-8') as f:
        item = f.readline()
        while item:
            attr = item.split(',')
            rating = Rating(int(attr[0]), int(attr[1].strip()), float(attr[2].strip()), int(attr[3].strip()))
            Python Data Analysis Practical Overview Data Analysis.insert_one(rating.__dict__)
            # print(rating)
            item = f.readline()

在启动MongoDB服务后,运行Python代码,运行完成后,再通过Robo 3T查看数据库如下:
robo 3T

显然,保存数据成功。

使用Jupyter处理商铺数据

待处理的数据是商铺数据,如下:
shop data

包括名称、评论数、价格、地址、评分列表等,其中评论数、价格和评分均不规则、需要进行数据清洗。

如需获取数据、代码等相关文件进行测试学习,可以直接点击加QQ群 Python Data Analysis Practical Overview Data Analysis963624318 在群文件夹Python数据分析实战中下载即可。

Jupyter中处理如下:
python da jupyter shop data

可以看到,最后得到了经过清洗后的规则数据。

完整Python代码如下:

# 数据读取f = open('商铺数据.csv', 'r', encoding='utf8')for i in f.readlines()[1:15]:
    print(i.split(','))# 创建comment、price、commentlist清洗函数def fcomment(s):
    '''comment清洗函数:用空格分段,选取结果list的第一个为点评数,并且转化为整型'''
    if '条' in s:
        return int(s.split(' ')[0])
    else:
        return '缺失数据'def fprice(s):
    '''price清洗函数:用¥分段,选取结果list的最后一个为人均价格,并且转化为浮点型'''
    if '¥' in s:
        return float(s.split('¥')[-1])
    else:
        return '缺失数据'def fcommentl(s):
    '''commentlist清洗函数:用空格分段,分别清洗出质量、环境及服务数据,并转化为浮点型'''
    if ' ' in s:
        quality = float(s.split('                                ')[0][2:])
        environment = float(s.split('                                ')[1][2:])
        service = float(s.split('                                ')[2][2:-1])
        return [quality, environment, service]
    else:
        return '缺失数据'# 数据处理清洗datalist = []  # 创建空列表f.seek(0)n = 0  # 创建计数变量for i in f.readlines():
    data = i.split(',')
    # print(data)
    classify = data[0]  # 提取分类
    name = data[1]  # 提取店铺名称
    comment_count = fcomment(data[2])  # 提取评论数量
    star = data[3]  # 提取星级
    price = fprice(data[4])  # 提取人均
    address = data[5]  # 提取地址
    quality = fcommentl(data[6])[0]  # 提取质量评分
    env = fcommentl(data[6])[1]  # 提取环境评分
    service = fcommentl(data[6])[2]  # 提取服务评分
    if '缺失数据' not in [comment_count, price, quality]:  # 用于判断是否有数据缺失
        n += 1
        data_re = [['classify', classify],
                   ['name', name],
                   ['comment_count', comment_count],
                   ['star', star],
                   ['price', price],
                   ['address', address],
                   ['quality', quality],
                   ['environment', env],
                   ['service', service]]
        datalist.append(dict(data_re))  # 字典生成,并存入列表datalist
        print('成功加载%i条数据' % n)
    else:
        continueprint(datalist)print('总共加载%i条数据' % n)f.close()

更多编程相关知识,请访问:编程教学!!

The above is the detailed content of Python Data Analysis Practical Overview Data Analysis. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:csdn.net. If there is any infringement, please contact admin@php.cn delete