首页  >  文章  >  后端开发  >  了解您的数据:探索性数据分析的要点。

了解您的数据:探索性数据分析的要点。

WBOY
WBOY原创
2024-08-10 06:56:02985浏览

Understanding Your Data: The Essentials of Exploratory Data Analysis.

简介

作为数据科学家和数据分析师,这是必须执行的非常非常重要且关键的初始步骤。数据收集后,数据处于原始形式和未经处理的事实,数据科学家、分析师或任何其他人无法理解该数据的结构和内容,这就是 EDA 的用武之地;分析和可视化数据以了解其关键特征、发现模式并识别变量之间的关系。

理解数据需要理解数据的预期质量和特征。您拥有的关于数据的知识、数据将满足的需求、数据的内容和创造。现在让我们更深入地研究 EDA,了解如何将数据转换为信息。信息是经过处理、组织、解释和结构化的数据。

探索性数据分析

如上所述,EDA 是指分析和可视化数据,以了解其关键特征、发现模式并识别变量之间的关系。它有助于确定如何最好地操纵数据源以获得所需的答案,使数据科学家更容易发现模式、发现异常、测试假设或假设。这是数据分析重要的第一步,是理解和解释复杂数据集的基础。

EDA 类型
这些是探索性数据分析过程中使用的不同方法和途径。以下是 EDA 的三种主要类型:

单变量分析:这是可用于分析数据的最简单形式,它探索数据集中的每个变量。涉及查看值的范围以及值的集中趋势。它描述了响应模式,每个变量都有自己的例如,检查公司员工的年龄。

双变量分析:此分析,观察到两个变量。它的目的是确定两个变量之间是否存在统计联系,如果是的话,它们的强度有多大。双变量让研究人员可以查看两个变量之间的关系。在使用此分析之前,您必须了解其重要性;

 Bivariate analysis helps identify trends and patterns
 Helps identify cause and effect relationships.
 Helps researchers to make predictions.
 It also inform decision-making.

双变量分析中使用的技术包括散点图、相关性、回归、卡方检验、t 检验和方差分析,可用于确定两个变量的相关性。

多元分析:这涉及实验的统计研究,其中对每个实验单元进行多次测量,并且多变量测量之间的关系及其结构对于实验非常重要。实验的理解。 例如,一个人每天在 Instagram 上花费多少小时。

技术包括依赖技术和相互依赖技术。

EDA 精要

a. 数据收集:处理数据的第一步是先拥有你想要的数据。根据您正在研究的主题,使用网络抓取或从 Kaggle 等平台下载数据集等方法从各种来源收集数据。

b. 了解您的数据:在进行清洁之前,您首先必须了解您收集的数据。尝试了解您将使用的行数和列数、每列的信息、数据的特征、数据类型等等。

c. 数据清理:此步骤涉及识别和解决数据中的错误、不一致、重复或不完整条目。此步骤的主要目标是提高数据的质量和有用性,从而获得更可靠和精确的发现。数据清理涉及几个步骤;
如何清理数据;

      i)Handling missing values: by imputing them using mean, mode, median of the column, fill with a constant, forward-fill, backward-fill, interpolation or dropping them using the dropna() function.

      ii)Detecting outliers: you can detect outliers using the interquartile range, visualizing, using Z-Score or using One-Class SVM.

      iii)Handle duplicates: Drop duplicate records

      iv)Fix structural errors: Address issues with the layout and format of your data such as date formats or misaligned fields.

      v)Remove unnecessary values: Your dataset might contain irrelevant or redundant information that is unnecessary for your analysis. You can identify and remove any records or fields that won't contribute to the insights you are trying to derive. 

d. 摘要统计。此步骤使用 pandas 或 numpy 中的描述方法快速概述数据集的中心趋势和分布,包括平均值、中位数、众数、标准差、最小值、最大值对于数字特征。对于分类特征,我们可以使用图表和实际的汇总统计数据。

e. 数据可视化:这是设计和创建大量复杂的定量和定性数据的易于沟通和易于理解的图形或视觉表示的实践。尝试使用 matplotlib、seaborn 或 tableau 等工具使用线条图、条形图、散点图和箱线图来识别数据集中的趋势和模式。

f. Data relationship. Identify the relationship between your data by performing correlation analysis to examine correlations between variables.

  • Analyze relationships between categorical variables. Use techniques like correlation matrices, heatmaps to visualize.

g. Test Hypothesis: Conduct tests like t-tests, chi-square tests, and ANOVA to determine statistical significance.

h. Communicate Your findings and Insights: This is the final step in carrying out EDA. This includes summarizing your evaluation, highlighting fundamental discoveries, and imparting your outcomes cleanly.

  • Clearly state the targets and scope of your analysis.
  • Use visualizations to display your findings.
  • Highlight critical insights, patterns, or anomalies you discovered in your EDA.
  • Discuss any barriers or caveats related to your analysis.

The next step after conducting Exploratory Data Analysis (EDA) in a data science project is feature engineering. This process involves transforming your features into a format that can be effectively understood and utilized by your model. Feature engineering builds on the insights gained from EDA to enhance the data, ensuring that it is in the best possible form for model training and performance. Let’s explore feature engineering in simple terms.

Feature Engineering.

This is the process of selecting, manipulating and transforming raw data into features that can be used in model creation. This process involves 4 main steps;

  1. Feature Creation:- Create new features from the existing features, using your domain knowledge or observing patterns in the data. This step helps to improve the model performance.

  2. Feature Transformation: This involves the transformation of your features into more suitable representation for your model. This is done to ensure that the model can effectively learn from the data. Transforming data involves 4 types;

     i) Normalization: Changing the shape of your distribution data. Map data to a bounded range using methods like Min-Max Normalization or Z-score Normalization.
    
     ii) Scaling. Rescale your features to have a similar scale  to make sure the model considers all features equally using methods like Min-Max Scaling, Standardization and  MaxAbs Scaling.
    
     iii) Encoding. Apply encoding to your categorical features to transform them to numerical features using methods like label encoding, One-hot encoding, Ordinal encoding or any other encoding according to the structure of your categorical columns.
    
     iv) Transformation. Transform the features using mathematical operations to change the distribution of features for example logarithmic, square root.
    
  3. Feature Extraction: Extract new features from the existing attributes. It is concerned with reducing the number of features in the model, such as using Principal Component Analysis(PCA).

  4. Feature Selection: Identify and select the most relevant features for further analysis. Use filter method( Evaluate features based on statistical metrics and select the most relevant ones), wrapper method(Use machine learning models to evaluate feature subsets and select the best combination based on model performance) or embedded method(Perform feature selection as part of model training e.g regularization techniques)

Tools Used for Performing EDA

-Let's look at the tools we can use to perform our analysis efficiently.

Python libraries

         i)   Pandas: Provides extensive functions for data manipulation and analysis.

         ii)  Matplotlib: Used for creating static, interactive, and animated visualizations.

         iii) Seaborn: Built on top of Matplotlib, providing a high-level interface for drawing attractive and informative capabilities.

         iv)  Plotly: Used for making interactive plots and offers more sophisticated visualization capabilities.

R Packages

     i)  ggplot2: This is used for making complex plots from data 
      in a dataframe.

    ii)  dplyr: It helps in solving the most common data manipulation challenges.

   iii)  tidyr: This tool is used to tidy your dataset; Storing it in a consistent form that matches the semantics of the dataset with the way it is stored.

Conclusion
Exploratory Data Analysis (EDA) forms the foundation of data science, offering insights and guiding informed decision-making. EDA empowers data scientists to uncover hidden truths and steer projects toward success. Always ensure to perform thorough EDA for effective model performance.

以上是了解您的数据:探索性数据分析的要点。的详细内容。更多信息请关注PHP中文网其他相关文章!

声明:
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系admin@php.cn