Home >Backend Development >Python Tutorial >Python toolkit for formatting and cleaning data

Python toolkit for formatting and cleaning data

大家讲道理
大家讲道理Original
2016-11-08 10:23:141458browse

The world is messy, and so is the data from the real world. A recent survey report shows that 60% of data scientists’ time is spent organizing data. Unfortunately, 57% of people think this is the most troublesome part of their job.

Organizing data is very time-consuming, but many tools have been developed to make this crucial step slightly more bearable. The Python community provides many libraries to make data organized—from formatting DataFrames to anonymizing datasets.

Tell us which libraries you find useful - we're always working on optimizing the libraries that go into Mode Python Notebooks.

Python toolkit for formatting and cleaning data

Dora

Dora is designed for exploratory analysis. Especially the most painful parts of automated analysis - like feature selection and extraction, visualization, and you guessed it - data cleaning. Functions related to data cleaning can:

Read data tables containing missing data and unstandardized data

Assign values ​​to missing data

Standardized variables

Developer: Nathan Epstein
More information: https://github.com/ NathanEpstein/Dora

datacleaner

As the name suggests, datacleaner cleans your data - but only if your data is a pandas DataFrame instance. Developer Randy Olson said: "Datacleaner is not magic. It cannot magically parse your unstructured data."

It can delete rows containing missing data, or use the mode or median of the column to fill in missing data, replacing non-structured data. Numeric variables are converted into numeric variables. This library is very new, but considering that DataFrame is the basic data structure for Python data analysis, it is worth giving it a try.

Developer: Randy Olson
More information: https://github.com/rhiever/datacleaner

PrettyPandas

DataFrames are powerful, but they can’t make tables you can show directly to your boss. PrettyPandas uses the pandas style API to convert DataFrame into a presentation-ready table. Generate data summaries, set styles, and adjust data formats, columns, and rows. Bonus: Robust, highly readable usage documentation.

Developer: Henry Hammond
More information: https://github.com/HHammond/PrettyPandas

tabulate

tabulate allows you to generate small and attractive tables with just one function call. Great for making tables more readable by adjusting decimal column alignment, data formatting, table headers and more.

It has a super cool function that allows the table to be output into different formats: HTML, PHP or Markdown Extra, so that you can use other tools or languages ​​to continue to use the data you have tabulated.

Developer: Sergey Astanin
More information: https://pypi.python.org/pypi/tabulate

scrubadub

Data scientists in the health and financial fields often need to anonymize data sets. Scrubadub can remove private information (PII) from text. For example:

Name (noun)

Email address

Internet link

Phone number

Username/password set

Skype username

Social Security Number

The document does a good job of demonstrating the ways you can Customize scrubadub's behavior, such as defining new PII or retaining specific PII.

Developer: Datascope Analytics
More information: http://scrubadub.readthedocs.io/en/stable/index.html

Arrow

Let’s be honest: dealing with dates and times in Python is a pain . The local time zone is not recognized automatically. It takes several uncomfortable lines of code to convert time zones and timestamps.

Arrow aims to solve this problem and fill this functional gap, so that you can complete date and time operations with less code and imported libraries. Unlike Python's standard time library, Arrow automatically recognizes time zones and UTC by default. You can perform time zone conversion or parse time strings with just one line of code.

Developer: Chris Smith
More information: http://arrow.readthedocs.io/en/latest/

Beautifier

Beautifier’s mission is simple: clean URLs and email addresses and make them look prettier. You can parse email by domain name and username; parse URL by domain name and parameters. (UTM or tag)

Developer: Sachin Philip Mathew
More information: https://github.com/sachinvettithanam/beautifier

ftfy

ftfy (fixes text for you) takes in bad Unicode outputs good Unicode. Basically , it fixes all the junk characters. “quotesâ€x9d becomes "quotes"; ü becomes ü;

ftfy (fixes text for you) converts messy Unicode into recognizable Unicode. Simply put, it handles all garbage characters. “quotesâ€x9d becomes "quotes"; ü becomes ü;

Developer: Luminoso
More information: https://github.com/LuminosoInsight/python-ftfy


Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn