概述
在python库pandas库下,导入数据,简单查看数据结构,缺失。然后分离数据集为 输入和输出集。最后为一些简单的数据分析,数据相关性,异常值。
仅为个人学习笔记分享,不够全面请见谅。
1. Get Data
# imput dataset
df = pd.read_csv('a.csv')
# shape of dataset
df.shape
# first 5 rows of dataset
df.head(5)
# check for missing values
df.isna().sum()
2. Split Dataset
# value counts in each column (bool)
df ['bool'].value_counts()
# convert boolean value to binary
df ['bool'] = [1 if x == True else 0 for x in df ['bool']]
# split the dataset to input (all the others) and output (bool)
x = df.drop('bool', 1) # 1 refers to drop 1 or more columns
y = df.bool
3. Data Exploration
其中包括 Boxplot (箱型图), Outlier (异常值), Scatter (散点图), Barplot (柱状图)
# feature type and size of dataset
df.info()
# Columns data description (mean, std, min, max ....)
df[['a','b','c']].describe()
# Boxplot
df[['a','b','c']].boxplot()
# Outliers (异常值】)
# https://github.com/aprilypchen/depy2016/blob/master/DePy_Talk.ipynb
# This is a function that finds the outliers that are outside of 1.5xIQR.
# It returns what value the outlier has, and where in the dataset they are
def find_outliers_tukey(x):
q1 = np.percentile(x, 25)
q3 = np.percentile(x, 75)
iqr = q3-q1
floor = q1 - 1.5*iqr
ceiling = q3 + 1.5*iqr
outlier_indices = list(x.index[(x < floor)|(x > ceiling)])
outlier_values = list(x[outlier_indices])
return outlier_indices, outlier_values
# outlier_indices 为异常值位置,outlier_values 为异常值
outlier_indices, outlier_values = find_outliers_tukey(df['a'])
# Correlation heatmap
# https://medium.com/@szabo.bibor/how-to-create-a-seaborn-correlation-heatmap-in-python-834c0686b88e
# The heatmap below gives a correlation of each feature against every other feature.
# For categorical varibales it doesn't make much sense though.
import seaborn as sns
sns.heatmap(df.corr());
sns.set(font_scale=2)
heatmap = sns.heatmap(df.corr(), vmin=-1, vmax=1, cbar = False, annot=True)
heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':12}, pad=12);
# Scatter relation map
df.plot(kind='scatter', x='a', y='b', color = 'blue')
# Barplot comparing data in a column
df['a'].value_counts().plot(kind='bar', title= 'Barplot')
以上。
最后
以上就是秀丽枕头为你收集整理的机器学习 之 python 数据预处理与数据分析的全部内容,希望文章能够帮你解决机器学习 之 python 数据预处理与数据分析所遇到的程序开发问题。
如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。
本图文内容来源于网友提供,作为学习参考使用,或来自网络收集整理,版权属于原作者所有。
发表评论 取消回复