java银行定期存款_银行机构定期存款预测模型

86 阅读 0 评论 57 点赞

我是靠谱客的博主欢喜翅膀，最近开发中收集的这篇文章主要介绍java银行定期存款_银行机构定期存款预测模型，觉得挺不错的，现在分享给大家，希望可以做个参考。

概述

java银行定期存款

Courtesy of the 10 Academy training program, I’ve been introduced to many data science concepts by working on different projects, each of them challenging in their own way.

受到10 Academy培训计划的礼遇，我通过从事不同的项目而被介绍了许多数据科学概念，每个概念都以自己的方式挑战。

Bank Institution Term Deposit Predictive Model is a project I found interesting. Its main objective is to build a model that predicts the customers that would or would not subscribe to bank term deposits, and this article aims at sharing my step by step approach of building the model.

我发现银行机构定期存款预测模型是一个有趣的项目。它的主要目的是建立一个模型，该模型预测将订阅或不订阅银行定期存款的客户，本文旨在逐步分享我建立模型的方法。

内容 (Contents)

The Data
数据
Exploratory Data Analysis
探索性数据分析
Data Preprocessing
数据预处理
Machine Learning Model
机器学习模型
Comparing Results
比较结果
Prediction
预测
Conclusion
结论
Further Study
进一步研究

数据 (The Data)

The dataset (Bank-additional-full.csv) used in this project contains bank customers’ data. The dataset, together with its information, can be gotten here. The first step to take when performing data analysis is to import the necessary libraries and the dataset to get you going.

该项目中使用的数据集(Bank-additional-full.csv)包含银行客户的数据。数据集及其信息可在此处获得。执行数据分析时要采取的第一步是导入必要的库和数据集，以使您开始工作。

# importing the necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')#importing the dataset
dataset = pd.read_csv('bank-additional-full.csv', sep=';')
dataset.name = 'dataset'
dataset.head()

探索性数据分析(EDA) (Exploratory Data Analysis (EDA))

EDA is an essential part of machine learning model development because it helps us in understanding our data and extract useful insights that will help in feature engineering. Some of the EDA performed in this project includes but not limited to the following;

EDA是机器学习模型开发的重要组成部分，因为它有助于我们理解数据并提取有用的见解，这将有助于功能工程。此项目中执行的一些EDA包括但不限于以下内容；

Shape and size of dataset
数据集的形状和大小

# function to check the shape of a dataset
def data_shape(data):

print(data.name,'shape:',data.shape)# function to check the size of a dataset
def data_size(data):

print(data.name,'size:',data.size)# Getting the shape of the dataset
data_shape(dataset)# Getting the size of the dataset
data_size(dataset)

dataset shape: (41188, 21)dataset size: 864948

数据集形状：(41188，21)数据集大小：864948

.shape returns the number of rows and columns of the dataset.

.shape返回数据集的行数和列数。

.size returns the number of elements in the data i.e the number of rows times number of columns.

.size返回数据中的元素数，即行数乘以列数。

Information and Statistical summary
信息统计汇总

# function to ckeck the information of a dataset
def data_info(data):

print(data.name,'information:')

print('---------------------------------------------')

print(data.info())

print('---------------------------------------------')# Getting the information of the dataset
data_info(dataset)

.info() is used to get concise summary of the dataset.

.info()用于获取数据集的简洁摘要。

# Getting the statistical summary
dataset.describe().T

.describe() is used to view some basic statistical details like percentile, mean, std etc. of numerical columns in the dataset.

.describe()用于查看数据集中数字列的一些基本统计详细信息，例如百分位数，均值，std等。

Unique and missing values
唯一值和缺失值

# function to get all unique values in the categorical variables
def unique_val(data):

cols = data.columns

for i in cols:

if data[i].dtype == 'O':

print('Unique values in',i,'are',data[i].unique())

print('----------------------------------------------')# Getting the unique values in the categorical columns
unique_val(dataset)

.unique() returns the unique values in a categorical column of the dataset.

.unique()在数据集的分类列中返回唯一值。

# function to check for missing values
def missing_val(data):

print('Sum of missing values in', data.name)

print('------------------------------')

print(data.isnull().sum())

print('------------------------------')# Getting the missing values in the dataset
missing_val(dataset)

.isnull().sum() returns the sum of missing values in each column of the dataset. Luckily for us, our dataset does not have missing values.

.isnull()。sum()返回数据集各列中缺失值的总和。幸运的是，我们的数据集没有缺失值。

Categorical and numerical variables
分类和数值变量

# Categorical variables
cat_data = dataset.select_dtypes(exclude='number')
cat_data.head()# Numerical variables
num_data = dataset.select_dtypes(include='number')
num_data.head()

.select_dtypes(exclude=’number) returns all the columns that does not have a numerical data type.

.select_dtypes(exclude ='number)返回所有没有数值数据类型的列。

.select_dtypes(exclude=’number) returns all the columns that has a numerical data type.

.select_dtypes(exclude ='number)返回具有数值数据类型的所有列。

Univariate and Bivariate Analysis
单变量和双变量分析

I made use of tableau (a data visualization tool) for the univariate and bivariate analysis and the tableau story can be found here.

我使用tableau(一种数据可视化工具)进行单变量和双变量分析，可以在此处找到tableau的故事。

Correlation
相关性

# using heatmap to visualize correlation between the columns
fig_size(20,10)
ax = sns.heatmap(dataset.corr(), annot=True, fmt='.1g', 

vmin=-1, vmax=1, center= 0)# setting the parameters
fig_att(ax, "Heatmap correlation between Data Features", 

"Features", "Features", 35, 25, "bold")
plt.show()

Correlation shows the relationship between variables in the dataset.

相关性显示数据集中变量之间的关系。

Outliers
离群值

Seaborn boxplot is one of the ways of checking a dataset for outliers.

Seaborn箱线图是检查数据集中异常值的一种方法。

# Using boxplot to identify outliers
for col in num_data:

ax = sns.boxplot(num_data[col])

save(f"{col}")

plt.show()

The code above visualizes the numerical columns in the dataset and outliers detected were treated using the Interquartile Range (IQR) method. The code can be found in this GitHub repository.

上面的代码将数据集中的数字列可视化，并使用四分位间距(IQR)方法处理检测到的异常值。可以在此GitHub存储库中找到代码。

In the course of the EDA, I found out that our target variable ‘y’ — has the client subscribed to a term deposit? (binary: ‘yes’,’no’), is highly imbalanced and that can affect our prediction model. This will be taken care of shortly and this article gives justice to some techniques of dealing with class imbalance.

在EDA的过程中，我发现我们的目标变量“ y”-客户是否已订阅定期存款？ (二进制：“是”，“否”)，高度不平衡，可能会影响我们的预测模型。这将很快得到解决，并且本文为处理阶级失衡的一些技巧提供了合理的参考。

数据预处理 (Data Preprocessing)

When building a machine learning model, it is important to preprocess the data to have an efficient model.

在构建机器学习模型时，对数据进行预处理以使其具有有效的模型非常重要。

# create list containing categorical columns
cat_cols = ['job', 'marital', 'education', 'default', 'housing',

'loan', 'contact', 'month', 'day_of_week', 'poutcome']# create list containing numerical columns
num_cols = ['duration', 'campaign', 'emp.var.rate',"pdays","age",
'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed', 'previous']

The following preprocessing was done in this stage:

在此阶段完成了以下预处理：

Encoding Categorical columns
编码分类列

Machine learning algorithms only read numerical values, which is why we need to change our categorical values to numerical values. I made use of pandas get_dummies method and type-casting to one-hot encode the columns.

机器学习算法仅读取数值，这就是为什么我们需要将分类值更改为数值。我利用pandas的get_dummies方法和类型转换对列进行一次热编码。

# function to encode categorical columns
def encode(data):

cat_var_enc = pd.get_dummies(data[cat_cols], drop_first=False)

return cat_var_enc# defining output variable for classification
dataset_new['subscribed'] = (dataset_new.y == 'yes').astype('int')

Rescaling Numerical columns
重新缩放数值列

Another data preprocessing method is to rescale our numerical columns; this helps to normalize our data within a particular range. Sklearn preprocessing StandardScaler() was used here.

另一种数据预处理方法是重新调整数字列的大小。这有助于规范我们在特定范围内的数据。这里使用了Sklearn预处理StandardScaler()。

# import library for rescaling
from sklearn.preprocessing import StandardScaler# function to rescale numerical columns
def rescale(data):

# creating an instance of the scaler object

scaler = StandardScaler()

data[num_cols] = scaler.fit_transform(data[num_cols])

return data

Specifying Dependent and Independent Variables
指定因变量和自变量

To proceed in building our prediction model, we have to specify our dependent and independent variables.

为了继续建立我们的预测模型，我们必须指定我们的因变量和自变量。

Independent variables — are the input for a process that is being analyzed.

自变量-是正在分析的流程的输入。

Dependent variable — Dependent variable is the output of the process.

因变量-因变量是过程的输出。

X = data.drop(columns=[ "subscribed", 'duration'])
y = data["subscribed"]

The column ‘duration’ was dropped because it highly affects the output target (e.g., if duration=0 then y=’no’).

删除了“持续时间”列，因为它会严重影响输出目标(例如，如果duration = 0，则y ='no')。

Splitting the Dataset
分割数据集

It is reasonable to always split the dataset into train and test set when building a machine learning model because it helps us to evaluate the performance of the model.

在构建机器学习模型时，总是将数据集分为训练集和测试集是合理的，因为它有助于我们评估模型的性能。

# import library for splitting dataset
from sklearn.model_selection import train_test_split# split the data
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.1,random_state=1)

Dimensionality Reduction
降维

In a case whereby we have a large number of variables, it is advisable to consider reducing these variables by keeping the most important ones, and there are various techniques for doing this, such as; PCA, TSNE, autoencoders, etc. For this project, we will be considering PCA.

在存在大量变量的情况下，建议考虑通过保留最重要的变量来减少这些变量，并且有多种技术可以做到这一点，例如； PCA，TSNE，自动编码器等。对于此项目，我们将考虑PCA。

# import PCA
from sklearn.decomposition import PCA# create an instance of pca
pca = PCA(n_components=20)
# fit pca to our data
pca.fit(X_train)
pca_train = pca.transform(X_train)
X_train_reduced = pd.DataFrame(pca_train)

Class Imbalance
班级失衡

As earlier stated, we have a highly imbalanced class, and this can affect our prediction if not treated.

如前所述，我们的班级高度失衡，如果不加以处理，这可能会影响我们的预测。

In this project, I made use of SMOTE (Synthetic Minority Oversampling Technique) for dealing with class imbalance.

在这个项目中，我利用SMOTE(综合少数族裔过采样技术)来解决班级失衡问题。

# importing the necessary function 
from imblearn.over_sampling import SMOTE# creating an instance
sm = SMOTE(random_state=27)# applying it to the training set
X_train_smote, y_train_smote = sm.fit_sample(X_train_reduced, y_train)

Note: It is advisable to use SMOTE on the training data.

注意：建议对训练数据使用SMOTE。

机器学习模型 (Machine Learning Model)

Whew!, we finally made it to building the model; data preprocessing can be such a handful when trying to build a machine learning model. Let’s not waste any time and dive right in.

哇！我们终于可以建立模型了！尝试建立机器学习模型时，数据预处理可能非常有用。让我们不要浪费任何时间直接潜水。

The machine learning algorithm that was considered in this project includes;

该项目中考虑的机器学习算法包括：

Logistic Regression
逻辑回归
XGBoost
XGBoost
Multi Layer Perceptron
多层感知器

and the cross validation (this is essential especially in our case where we have an imbalanced class) method used includes;

以及使用的交叉验证(这对我们来说是不平衡类的情况尤为重要)包括：

K-Fold: K-Fold splits a given data set into a K number of sections/folds where each fold is used as a testing set at some point.
K折： K折将给定的数据集划分为K个部分/折，其中每个折在某个点用作测试集。
Stratified K-Fold: This is a variation of K-Fold that returns stratified folds. The folds are made by preserving the percentage of samples for each class.
分层K折：这是K折的变体，返回分层折。折叠是通过保留每个类别的样品百分比来进行的。

# import machine learning model libraries
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from sklearn.neural_network import MLPClassifier# import libraries for cross validation
from sklearn.model_selection import KFold
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_validatemetrics = ['accuracy', 'roc_auc', f1', 'precision', 'recall']# function to build machine learning models
def model(model, cv_method, metrics, X_train, X_test, y_train):

if (model == 'LR'):

# creating an instance of the regression

model_inst = LogisticRegression()

print('Logistic Regressionn----------------------')

elif (model == 'XGB'):

# creating an instance of the classifier

model_inst = XGBClassifier()

print('XGBoostn----------------------')

elif (model == 'MLP'):

# creating an instance of the classifier

model_inst = MLPClassifier()

print('Multi Layer Perceptronn----------------------')

# cross validation

if (cv_method == 'KFold'):

print('Cross validation: KFoldn--------------------------')

cv = KFold(n_splits=10, random_state=100)

elif (cv_method == 'StratifiedKFold'):

print('Cross validation: StratifiedKFoldn-----------------')

cv = StratifiedKFold(n_splits=10, random_state=100)

else:

print('Cross validation method not found!')

try:

cv_scores = cross_validate(model_inst, X_train, y_train, 

cv=cv, scoring=metrics)


# displaying evaluation metric scores

cv_metric = cv_scores.keys()

for metric in cv_metric:

mean_score = cv_scores[metric].mean()*100

print(metric+':', '%.2f%%' % mean_score)

print('')

except:

metrics = ['accuracy', 'f1', 'precision', 'recall']

cv_scores = cross_validate(model_inst, X_train, y_train, 

cv=cv, scoring=metrics)

# displaying evaluation metric scores

cv_metric = cv_scores.keys()

for metric in cv_metric:

mean_score = cv_scores[metric].mean()*100

print(metric+':', '%.2f%%' % mean_score)

print('')
return model_inst

Evaluation Metrics

评估指标

Accuracy: The number of correctly predicted data points. This can be a misleading metric for an imbalanced dataset. Therefore, it is advisable to consider other evaluation metrics.
准确性：正确预测的数据点的数量。对于不平衡的数据集，这可能是一个误导性指标。因此，建议考虑其他评估指标。
AUC (Area under the ROC Curve): It provides an aggregate measure of performance across all possible classification thresholds.
AUC (ROC曲线下的面积)：它提供了所有可能的分类阈值下性能的汇总度量。
Precision: It is calculated as the ratio of correctly predicted positive examples divided by the total number of positive examples that were predicted.
精确度：计算为正确预测的阳性样本数除以预测的阳性样本总数的比率。
Recall: It refers to the percentage of total relevant results correctly classified by your algorithm.
回忆：它是指您的算法正确分类的所有相关结果的百分比。
F1 score: This is the weighted average of Precision and Recall.
F1得分：这是“精确度”和“召回率”的加权平均值。

比较结果 (Comparing Results)

K-Fold vs Stratified K-Fold
K折vs分层K折

As can be seen from the table above, Stratified K-Fold presented a much better result compared to the K-Fold cross validation. The K-Fold cross validation failed to provide the AUC score for the Logistic Regression and XGBoost model. Therefore, for further comparison, Stratified K-Fold results would be used.

从上表可以看出，与K-Fold交叉验证相比，分层K-Fold呈现出更好的结果。 K折叠交叉验证未能提供Logistic回归和XGBoost模型的AUC分数。因此，为进行进一步比较，将使用分层K折结果。

Machine Learning Models
机器学习模型

From the result gotten, XGBoost proves to be a better prediction model than Logistic Regression and MLP because it has the highest percentage values in 4/5 of the evaluation metrics.

从得到的结果来看，XGBoost被证明是比Logistic回归和MLP更好的预测模型，因为它在4/5的评估指标中具有最高的百分比值。

预测 (Prediction)

XGboost, being the best performing model, is used for prediction.

XGboost是性能最好的模型，用于预测。

# fitting the model to the train data
model_xgb = xgb.fit(X_train_smote, y_train_smote)# make predictions
y_pred = xgb.predict(X_test_pca)

结论 (Conclusion)

The main objective of this project is to build a model that predicts customers that would subscribe to a bank term deposit, and we were able to achieve that by considering three different models and using the best one for the prediction. We also went through rigorous steps of preparing our data for the model and choosing various evaluation metrics to measure the performance of our models.

该项目的主要目标是建立一个模型，该模型可以预测将订阅银行定期存款的客户，而我们能够通过考虑三种不同的模型并使用最佳模型进行预测来实现这一目标。我们还经历了严格的步骤，为模型准备数据并选择各种评估指标来衡量模型的性能。

In the result gotten, we observe that XGBoost was the best model with high percentage values in 4/5 of the evaluation metrics.

在得到的结果中，我们观察到XGBoost是在4/5的评估指标中具有较高百分比值的最佳模型。

进一步研究 (Further Study)

In this project, I used only three machine learning algorithms. However, algorithms such as; SVM, Random Forest, Decision Trees, etc. can be explored.

在这个项目中，我仅使用了三种机器学习算法。但是，算法如下；可以探索SVM，随机森林，决策树等。

A detailed code for this project can be found in this GitHub repository.

该项目的详细代码可以在GitHub存储库中找到。

I know this was a very long ride, but thank you for sticking with me to the end. I also appreciate 10 Academy once again, and my fellow learners for the wonderful opportunity to partake in this project.

我知道这是一段漫长的旅程，但感谢您坚持到底。我也再次感谢10 Academy ，也非常感谢我的同学们参加该项目的绝好机会。