Maison >développement back-end >Tutoriel Python >[Machine Learning] Prétraitement des données : convertir des données catégorielles en valeurs numériques
Lors de l'analyse des données Python, le prétraitement des données doit être effectué en premier.
Parfois, vous devez gérer des données non numériques. Eh bien, ce dont je veux parler aujourd'hui, c'est de la manière de gérer ces données.
Il y a probablement trois méthodes que je connais jusqu'à présent :
1, utilisez LabelEncoder pour une conversion rapide
2. Cartographie des catégories en valeurs numériques grâce à la cartographie. Cependant, cette méthode a un champ d'application limité
;3. Convertissez via la méthode get_dummies.
<span style="color: #008080"> 1</span> <span style="color: #0000ff">import</span><span style="color: #000000"> pandas as pd </span><span style="color: #008080"> 2</span> <span style="color: #0000ff">from</span> io <span style="color: #0000ff">import</span><span style="color: #000000"> StringIO </span><span style="color: #008080"> 3</span> <span style="color: #008080"> 4</span> csv_data = <span style="color: #800000">'''</span><span style="color: #800000">A,B,C,D </span><span style="color: #008080"> 5</span> <span style="color: #800000">1,2,3,4 </span><span style="color: #008080"> 6</span> <span style="color: #800000">5,6,,8 </span><span style="color: #008080"> 7</span> <span style="color: #800000">0,11,12,</span><span style="color: #800000">'''</span> <span style="color: #008080"> 8</span> <span style="color: #008080"> 9</span> df =<span style="color: #000000"> pd.read_csv(StringIO(csv_data)) </span><span style="color: #008080">10</span> <span style="color: #0000ff">print</span><span style="color: #000000">(df) </span><span style="color: #008080">11</span> <span style="color: #008000">#</span><span style="color: #008000">统计为空的数目</span> <span style="color: #008080">12</span> <span style="color: #0000ff">print</span><span style="color: #000000">(df.isnull().sum()) </span><span style="color: #008080">13</span> <span style="color: #0000ff">print</span><span style="color: #000000">(df.values) </span><span style="color: #008080">14</span> <span style="color: #008080">15</span> <span style="color: #008000">#</span><span style="color: #008000">丢弃空的</span> <span style="color: #008080">16</span> <span style="color: #0000ff">print</span><span style="color: #000000">(df.dropna()) </span><span style="color: #008080">17</span> <span style="color: #0000ff">print</span>(<span style="color: #800000">'</span><span style="color: #800000">after</span><span style="color: #800000">'</span><span style="color: #000000">, df) </span><span style="color: #008080">18</span> <span style="color: #0000ff">from</span> sklearn.preprocessing <span style="color: #0000ff">import</span><span style="color: #000000"> Imputer </span><span style="color: #008080">19</span> <span style="color: #008000">#</span><span style="color: #008000"> axis=0 列 axis = 1 行</span> <span style="color: #008080">20</span> imr = Imputer(missing_values=<span style="color: #800000">'</span><span style="color: #800000">NaN</span><span style="color: #800000">'</span>, strategy=<span style="color: #800000">'</span><span style="color: #800000">mean</span><span style="color: #800000">'</span>, axis=<span style="color: #000000">0) </span><span style="color: #008080">21</span> imr.fit(df) <span style="color: #008000">#</span><span style="color: #008000"> fit 构建得到数据</span> <span style="color: #008080">22</span> imputed_data = imr.transform(df.values) <span style="color: #008000">#</span><span style="color: #008000">transform 将数据进行填充</span> <span style="color: #008080">23</span> <span style="color: #0000ff">print</span><span style="color: #000000">(imputed_data) </span><span style="color: #008080">24</span> <span style="color: #008080">25</span> df = pd.DataFrame([[<span style="color: #800000">'</span><span style="color: #800000">green</span><span style="color: #800000">'</span>, <span style="color: #800000">'</span><span style="color: #800000">M</span><span style="color: #800000">'</span>, 10.1, <span style="color: #800000">'</span><span style="color: #800000">class1</span><span style="color: #800000">'</span><span style="color: #000000">], </span><span style="color: #008080">26</span> [<span style="color: #800000">'</span><span style="color: #800000">red</span><span style="color: #800000">'</span>, <span style="color: #800000">'</span><span style="color: #800000">L</span><span style="color: #800000">'</span>, 13.5, <span style="color: #800000">'</span><span style="color: #800000">class2</span><span style="color: #800000">'</span><span style="color: #000000">], </span><span style="color: #008080">27</span> [<span style="color: #800000">'</span><span style="color: #800000">blue</span><span style="color: #800000">'</span>, <span style="color: #800000">'</span><span style="color: #800000">XL</span><span style="color: #800000">'</span>, 15.3, <span style="color: #800000">'</span><span style="color: #800000">class1</span><span style="color: #800000">'</span><span style="color: #000000">]]) </span><span style="color: #008080">28</span> df.columns =[<span style="color: #800000">'</span><span style="color: #800000">color</span><span style="color: #800000">'</span>, <span style="color: #800000">'</span><span style="color: #800000">size</span><span style="color: #800000">'</span>, <span style="color: #800000">'</span><span style="color: #800000">price</span><span style="color: #800000">'</span>, <span style="color: #800000">'</span><span style="color: #800000">classlabel</span><span style="color: #800000">'</span><span style="color: #000000">] </span><span style="color: #008080">29</span> <span style="color: #0000ff">print</span><span style="color: #000000">(df) </span><span style="color: #008080">30</span> <span style="color: #008080">31</span> size_mapping = {<span style="color: #800000">'</span><span style="color: #800000">XL</span><span style="color: #800000">'</span>:3, <span style="color: #800000">'</span><span style="color: #800000">L</span><span style="color: #800000">'</span>:2, <span style="color: #800000">'</span><span style="color: #800000">M</span><span style="color: #800000">'</span>:1<span style="color: #000000">} </span><span style="color: #008080">32</span> df[<span style="color: #800000">'</span><span style="color: #800000">size</span><span style="color: #800000">'</span>] = df[<span style="color: #800000">'</span><span style="color: #800000">size</span><span style="color: #800000">'</span><span style="color: #000000">].map(size_mapping) </span><span style="color: #008080">33</span> <span style="color: #0000ff">print</span><span style="color: #000000">(df) </span><span style="color: #008080">34</span> <span style="color: #008080">35</span> <span style="color: #008000">#</span><span style="color: #008000"># 遍历Series</span> <span style="color: #008080">36</span> <span style="color: #0000ff">for</span> idx, label <span style="color: #0000ff">in</span> enumerate(df[<span style="color: #800000">'</span><span style="color: #800000">classlabel</span><span style="color: #800000">'</span><span style="color: #000000">]): </span><span style="color: #008080">37</span> <span style="color: #0000ff">print</span><span style="color: #000000">(idx, label) </span><span style="color: #008080">38</span> <span style="color: #008080">39</span> <span style="color: #008000">#</span><span style="color: #008000">1, 利用LabelEncoder类快速编码,但此时对color并不适合,</span> <span style="color: #008080">40</span> <span style="color: #008000">#</span><span style="color: #008000">看起来,好像是有大小的</span> <span style="color: #008080">41</span> <span style="color: #0000ff">from</span> sklearn.preprocessing <span style="color: #0000ff">import</span><span style="color: #000000"> LabelEncoder </span><span style="color: #008080">42</span> class_le =<span style="color: #000000"> LabelEncoder() </span><span style="color: #008080">43</span> color_le =<span style="color: #000000"> LabelEncoder() </span><span style="color: #008080">44</span> df[<span style="color: #800000">'</span><span style="color: #800000">classlabel</span><span style="color: #800000">'</span>] = class_le.fit_transform(df[<span style="color: #800000">'</span><span style="color: #800000">classlabel</span><span style="color: #800000">'</span><span style="color: #000000">].values) </span><span style="color: #008080">45</span> <span style="color: #008000">#</span><span style="color: #008000">df['color'] = color_le.fit_transform(df['color'].values)</span> <span style="color: #008080">46</span> <span style="color: #0000ff">print</span><span style="color: #000000">(df) </span><span style="color: #008080">47</span> <span style="color: #008080">48</span> <span style="color: #008000">#</span><span style="color: #008000">2, 映射字典将类标转换为整数</span> <span style="color: #008080">49</span> <span style="color: #0000ff">import</span><span style="color: #000000"> numpy as np </span><span style="color: #008080">50</span> class_mapping = {label: idx <span style="color: #0000ff">for</span> idx, label <span style="color: #0000ff">in</span> enumerate(np.unique(df[<span style="color: #800000">'</span><span style="color: #800000">classlabel</span><span style="color: #800000">'</span><span style="color: #000000">]))} </span><span style="color: #008080">51</span> df[<span style="color: #800000">'</span><span style="color: #800000">classlabel</span><span style="color: #800000">'</span>] = df[<span style="color: #800000">'</span><span style="color: #800000">classlabel</span><span style="color: #800000">'</span><span style="color: #000000">].map(class_mapping) </span><span style="color: #008080">52</span> <span style="color: #0000ff">print</span>(<span style="color: #800000">'</span><span style="color: #800000">2,</span><span style="color: #800000">'</span><span style="color: #000000">, df) </span><span style="color: #008080">53</span> <span style="color: #008080">54</span> <span style="color: #008080">55</span> <span style="color: #008000">#</span><span style="color: #008000">3,处理1不适用的</span> <span style="color: #008080">56</span> <span style="color: #008000">#</span><span style="color: #008000">利用创建一个新的虚拟特征</span> <span style="color: #008080">57</span> <span style="color: #0000ff">from</span> sklearn.preprocessing <span style="color: #0000ff">import</span><span style="color: #000000"> OneHotEncoder </span><span style="color: #008080">58</span> pf = pd.get_dummies(df[[<span style="color: #800000">'</span><span style="color: #800000">color</span><span style="color: #800000">'</span><span style="color: #000000">]]) </span><span style="color: #008080">59</span> df = pd.concat([df, pf], axis=1<span style="color: #000000">) </span><span style="color: #008080">60</span> df.drop([<span style="color: #800000">'</span><span style="color: #800000">color</span><span style="color: #800000">'</span>], axis=1, inplace=<span style="color: #000000">True) </span><span style="color: #008080">61</span> <span style="color: #0000ff">print</span>(df)
Ce qui précède est le contenu détaillé de. pour plus d'informations, suivez d'autres articles connexes sur le site Web de PHP en chinois!