翻译: 2.2 Pandas Pytorch 数据预处理深入神经网络2.2.1 读取数据集2.2.2 处理缺失数据2.2.3 转换为张量格式2.2.4 概括2.2.5 练习参考

74 阅读 0 评论 49 点赞

我是靠谱客的博主英俊铃铛，这篇文章主要介绍翻译: 2.2 Pandas Pytorch 数据预处理深入神经网络2.2.1 读取数据集2.2.2 处理缺失数据2.2.3 转换为张量格式2.2.4 概括2.2.5 练习参考，现在分享给大家，希望可以做个参考。

到目前为止，我们已经介绍了多种技术来处理已经存储在张量中的数据。为了将深度学习应用于解决现实世界的问题，我们通常从预处理原始数据开始，而不是那些精心准备的张量格式数据。在 Python 中流行的数据分析工具中，pandas包是常用的。与庞大的 Python 生态系统中的许多其他扩展包一样， pandas可以与张量一起使用。因此，我们将简要介绍预处理原始数据pandas并将其转换为张量格式的步骤。我们将在后面的章节中介绍更多的数据预处理技术。

2.2.1 读取数据集

例如，我们首先创建一个人工数据集，该数据集存储在 csv（逗号分隔值）文件…/data/house_tiny.csv中。以其他格式存储的数据可以以类似的方式处理。

下面我们将数据集逐行写入一个csv文件。

复制代码

1
2
3
4
5
6
7
8
9
10
11
import os

os.makedirs(os.path.join('..', 'data'), exist_ok=True)
data_file = os.path.join('..', 'data', 'house_tiny.csv')
with open(data_file, 'w') as f:
    f.write('NumRooms,Alley,Pricen')  # Column names
    f.write('NA,Pave,127500n')  # Each row represents a data example
    f.write('2,NA,106000n')
    f.write('4,NA,178100n')
    f.write('NA,NA,140000n')

为了从创建的 csv 文件加载原始数据集，我们导入 pandas包并调用read_csv函数。该数据集有四行三列，其中每一行描述了房间的数量（“NumRooms”）、小巷类型（“Alley”）和房子的价格（“Price”）。

复制代码

1
2
3
4
5
6
7
# If pandas is not installed, just uncomment the following line:
# !pip install pandas
import pandas as pd

data = pd.read_csv(data_file)
print(data)

复制代码

1
2
3
4
5
6
   NumRooms Alley   Price
0       NaN  Pave  127500
1       2.0   NaN  106000
2       4.0   NaN  178100
3       NaN   NaN  140000

2.2.2 处理缺失数据

请注意，“NaN”条目是缺失值。为了处理缺失数据，典型的方法包括插补和删除，其中插补用替换值替换缺失值，而删除忽略缺失值。在这里，我们将考虑插补。

iloc通过基于整数位置data的索引inputs（outputs对于缺失的数值 inputs，我们将“NaN”条目替换为同一列的平均值。

复制代码

1
2
3
4
5
6
inputs, outputs = data.iloc[:, 0:2], data.iloc[:, 2]
print(inputs)
print(outputs)
inputs = inputs.fillna(inputs.mean())
print(inputs)

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
   NumRooms Alley
0       NaN  Pave
1       2.0   NaN
2       4.0   NaN
3       NaN   NaN
0    127500
1    106000
2    178100
3     14000
Name: Price, dtype: int64
   NumRooms Alley
0       3.0  Pave
1       2.0   NaN
2       4.0   NaN
3       3.0   NaN

对于中的分类或离散值inputs，我们将“NaN”视为一个类别。由于“Alley”列只取“Pave”和“NaN”两种分类值，pandas可以自动将该列转换为“Alley_Pave”和“Alley_nan”两列。巷道类型为“Pave”的行会将“Alley_Pave”和“Alley_nan”的值设置为 1 和 0。缺少巷道类型的行会将其值设置为 0 和 1。

复制代码

1
2
3
inputs = pd.get_dummies(inputs, dummy_na=True)
print(inputs)

复制代码

1
2
3
4
5
6
   NumRooms  Alley_Pave  Alley_nan
0       3.0           1          0
1       2.0           0          1
2       4.0           0          1
3       3.0           0          1

2.2.3 转换为张量格式

现在inputs和中的所有条目outputs都是数字的，它们可以转换为张量格式。一旦数据采用这种格式，就可以使用我们在第 2.1 节中介绍的张量功能进一步处理它们。

复制代码

1
2
3
4
5
import torch

X, y = torch.tensor(inputs.values), torch.tensor(outputs.values)
X, y

复制代码

1
2
3
4
5
6
(tensor([[3., 1., 0.],
         [2., 0., 1.],
         [4., 0., 1.],
         [3., 0., 1.]], dtype=torch.float64),
 tensor([127500, 106000, 178100, 140000]))

2.2.4 概括

与庞大的 Python 生态系统中的许多其他扩展包一样， pandas可以与张量一起使用。

插补和删除可用于处理缺失数据。

2.2.5 练习

创建具有更多行和列的原始数据集。

删除缺失值最多的列。

复制代码

1
2
3
4
5
6
print(data)
m = max(data.isnull().sum(axis=0))
print(m)
data_dropmaxnan = data.dropna(axis = 1, thresh = len(data)+1-m)
print(data_dropmaxnan)

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
   NumRooms Alley   Price
0       NaN  Pave  127500
1       2.0   NaN  106000
2       4.0   NaN  178100
3       NaN   NaN   14000
3
   NumRooms   Price
0       NaN  127500
1       2.0  106000
2       4.0  178100
3       NaN   14000

将预处理后的数据集转换为张量格式。

1.The best way to read pytorch’s source code?Please give me some tips.

Here are some official API documents that may be helpful.

https://pytorch.org/tutorials/beginner/ptcheat.html
https://pytorch.org/docs/stable/index.html#

2. how to loop by dataframe’s colomns?I’m trying to use loop to calculate data.isnull().sum().

There are a vast amount of tutorials for pandas. You can just search online. Here is the official guide.
https://pandas.pydata.org/docs/user_guide/index.html#user-guide

参考

https://d2l.ai/chapter_preliminaries/pandas.html

最后

以上就是英俊铃铛最近收集整理的关于翻译: 2.2 Pandas Pytorch 数据预处理深入神经网络2.2.1 读取数据集2.2.2 处理缺失数据2.2.3 转换为张量格式2.2.4 概括2.2.5 练习参考的全部内容，更多相关翻译:内容请搜索靠谱客的其他文章。

本图文内容来源于网友提供，作为学习参考使用，或来自网络收集整理，版权属于原作者所有。

本文分类：李沐动手学深度学习
浏览次数：74 次浏览
发布日期：2023-10-18 18:56:28
本文链接：https://www.kaopuke.com/article/k-p-k_13_u_23_o_26_f3_14_zk5.html

翻译: 2.2 Pandas Pytorch 数据预处理深入神经网络2.2.1 读取数据集2.2.2 处理缺失数据2.2.3 转换为张量格式2.2.4 概括2.2.5 练习参考

2.2.1 读取数据集

2.2.2 处理缺失数据

2.2.3 转换为张量格式

2.2.4 概括

2.2.5 练习

删除缺失值最多的列。

1.The best way to read pytorch’s source code?Please give me some tips.

2. how to loop by dataframe’s colomns?I’m trying to use loop to calculate data.isnull().sum().

参考

最后

评论列表共有 0 条评论

发表评论取消回复

翻译: 2.2 Pandas Pytorch 数据预处理 深入神经网络2.2.1 读取数据集2.2.2 处理缺失数据2.2.3 转换为张量格式2.2.4 概括2.2.5 练习参考

2.2.1 读取数据集

2.2.2 处理缺失数据

2.2.3 转换为张量格式

2.2.4 概括

2.2.5 练习

删除缺失值最多的列。

1.The best way to read pytorch’s source code?Please give me some tips.

2. how to loop by dataframe’s colomns?I’m trying to use loop to calculate data.isnull().sum().

参考

最后

相关文章

评论列表共有 0 条评论

发表评论 取消回复

微信扫一扫：分享

翻译: 2.2 Pandas Pytorch 数据预处理深入神经网络2.2.1 读取数据集2.2.2 处理缺失数据2.2.3 转换为张量格式2.2.4 概括2.2.5 练习参考

发表评论取消回复