我是靠谱客的博主 甜美毛衣,最近开发中收集的这篇文章主要介绍Scikit Learn-异常检测 Scikit Learn-异常检测 (Scikit Learn - Anomaly Detection),觉得挺不错的,现在分享给大家,希望可以做个参考。

概述

Scikit Learn-异常检测 (Scikit Learn - Anomaly Detection)

Here, we will learn about what is anomaly detection in Sklearn and how it is used in identification of the data points.

在这里,我们将了解什么是Sklearn中的异常检测以及如何将其用于识别数据点。

Anomaly detection is a technique used to identify data points in dataset that does not fit well with the rest of the data. It has many applications in business such as fraud detection, intrusion detection, system health monitoring, surveillance, and predictive maintenance. Anomalies, which are also called outlier, can be divided into following three categories −

异常检测是一种用于识别数据集中与其他数据不太吻合的数据点的技术。 它在商业中具有许多应用程序,例如欺诈检测,入侵检测,系统运行状况监视,监视和预测性维护。 异常也称为离群值,可以分为以下三类:

  • Point anomalies − It occurs when an individual data instance is considered as anomalous w.r.t the rest of the data.

    点异常 -当单个数据实例被认为与其余数据异常时,会发生异常。

  • Contextual anomalies − Such kind of anomaly is context specific. It occurs if a data instance is anomalous in a specific context.

    上下文异常 -这种异常是上下文特定的。 如果数据实例在特定上下文中异常,则会发生这种情况。

  • Collective anomalies − It occurs when a collection of related data instances is anomalous w.r.t entire dataset rather than individual values.

    集体异常 -当相关数据实例的集合相对于整个数据集而不是单个值异常时,就会发生这种情况。

方法 (Methods)

Two methods namely outlier detection and novelty detection can be used for anomaly detection. It’s necessary to see the distinction between them.

异常检测可以使用异常检测新颖性检测这两种方法。 有必要看到它们之间的区别。

离群值检测 (Outlier detection)

The training data contains outliers that are far from the rest of the data. Such outliers are defined as observations. That’s the reason, outlier detection estimators always try to fit the region having most concentrated training data while ignoring the deviant observations. It is also known as unsupervised anomaly detection.

训练数据包含离其他数据远的异常值。 这些异常值被定义为观察值。 这就是原因,离群检测估计器总是尝试拟合训练数据最集中的区域,而忽略了异常观测值。 这也称为无监督异常检测。

新颖性检测 (Novelty detection)

It is concerned with detecting an unobserved pattern in new observations which is not included in training data. Here, the training data is not polluted by the outliers. It is also known as semi-supervised anomaly detection.

它与在训练数据中不包括的新观察中检测到未观察到的模式有关。 在这里,训练数据不受异常值的污染。 这也称为半监督异常检测。

There are set of ML tools, provided by scikit-learn, which can be used for both outlier detection as well novelty detection. These tools first implementing object learning from the data in an unsupervised by using fit () method as follows −

scikit-learn提供了一套ML工具,可用于异常检测和新颖性检测。 这些工具首先通过使用fit()方法在无监督的情况下从数据中实现对象学习-


estimator.fit(X_train)

Now, the new observations would be sorted as inliers (labeled 1) or outliers (labeled -1) by using predict() method as follows −

现在,可以通过使用predict()方法将新观察值分类为离群值(标记 为1)离群值(标记为-1) ,如下所示:


estimator.fit(X_test)

The estimator will first compute the raw scoring function and then predict method will make use of threshold on that raw scoring function. We can access this raw scoring function with the help of score_sample method and can control the threshold by contamination parameter.

估计器将首先计算原始评分函数,然后预测方法将使用该原始评分函数的阈值。 我们可以借助score_sample方法访问此原始评分功能,并可以通过污染参数控制阈值。

We can also define decision_function method that defines outliers as negative value and inliers as non-negative value.

我们还可以定义Decision_function方法,将离群值定义为负值,将离群值定义为非负值。


estimator.decision_function(X_test)

用于异常值检测的Sklearn算法 (Sklearn algorithms for Outlier Detection)

Let us begin by understanding what an elliptic envelop is.

让我们首先了解什么是椭圆形信封。

拟合椭圆形信封 (Fitting an elliptic envelop)

This algorithm assume that regular data comes from a known distribution such as Gaussian distribution. For outlier detection, Scikit-learn provides an object named covariance.EllipticEnvelop.

该算法假定常规数据来自已知分布,例如高斯分布。 为了检测异常值,Scikit-learn提供了一个名为covariance.EllipticEnvelop的对象。

This object fits a robust covariance estimate to the data, and thus, fits an ellipse to the central data points. It ignores the points outside the central mode.

该对象将稳健的协方差估计值拟合到数据,因此将椭圆拟合到中心数据点。 它忽略中心模式之外的点。

参量 (Parameters)

Following table consist the parameters used by sklearn. covariance.EllipticEnvelop method −

下表包含sklearn使用的参数 covariance.EllipticEnvelop方法-

Sr.NoParameter & Description
1

store_precision − Boolean, optional, default = True

We can specify it if the estimated precision is stored.

2

assume_centered − Boolean, optional, default = False

If we set it False, it will compute the robust location and covariance directly with the help of FastMCD algorithm. On the other hand, if set True, it will compute the support of robust location and covarian.

3

support_fraction − float in (0., 1.), optional, default = None

This parameter tells the method that how much proportion of points to be included in the support of the raw MCD estimates.

4

contamination − float in (0., 1.), optional, default = 0.1

It provides the proportion of the outliers in the data set.

5

random_state − int, RandomState instance or None, optional, default = none

This parameter represents the seed of the pseudo random number generated which is used while shuffling the data. Followings are the options −

  • int − In this case, random_state is the seed used by random number generator.

  • RandomState instance − In this case, random_state is the random number generator.

  • None − In this case, the random number generator is the RandonState instance used by np.random.

序号 参数及说明
1个

store_precision-布尔值,可选,默认= True

如果存储了估计的精度,我们可以指定它。

2

假定为中心 -布尔值,可选,默认= False

如果将其设置为False,它将直接在FastMCD算法的帮助下计算鲁棒的位置和协方差。 另一方面,如果设置为True,它将计算对稳健位置和协方差的支持。

3

support_fraction-浮入(0.,1.),可选,默认= None

此参数告诉方法原始MCD估计的支持中将包含多少比例的点。

4

污染 -漂浮在(0.,1.)中,可选,默认= 0.1

它提供了异常值在数据集中的比例。

5

random_state -int,RandomState实例或无,可选,默认=无

此参数表示生成的伪随机数的种子,在对数据进行混洗时会使用该种子。 以下是选项-

  • INT -在这种情况下,random_state是由随机数生成所使用的种子。

  • RandomState实例 -在这种情况下, random_state是随机数生成器。

  • -在这种情况下,随机数生成器是np.random使用的RandonState实例。

属性 (Attributes)

Following table consist the attributes used by sklearn. covariance.EllipticEnvelop method −

下表包含sklearn使用的属性 covariance.EllipticEnvelop方法-

Sr.NoAttributes & Description
1

support_ − array-like, shape(n_samples,)

It represents the mask of the observations used to compute robust estimates of location and shape.

2

location_ − array-like, shape (n_features)

It returns the estimated robust location.

3

covariance_ − array-like, shape (n_features, n_features)

It returns the estimated robust covariance matrix.

4

precision_ − array-like, shape (n_features, n_features)

It returns the estimated pseudo inverse matrix.

5

offset_ − float

It is used to define the decision function from the raw scores. decision_function = score_samples -offset_

序号 属性和说明
1个

support_-像数组一样的形状(n_samples,)

它代表用于计算位置和形状的可靠估计值的观测值的遮罩。

2

location_-阵列状,形状(n_features)

它返回估计的稳健位置。

3

covariance_-数组状,形状(n_features,n_features)

它返回估计的鲁棒协方差矩阵。

4

precision_-像数组一样的形状(n_features,n_features)

它返回估计的伪逆矩阵。

5

offset_-浮动

它用于根据原始分数定义决策函数。 决策功能=分数样本-偏移量

Implementation Example

实施实例


import numpy as np^M
from sklearn.covariance import EllipticEnvelope^M
true_cov = np.array([[.5, .6],[.6, .4]])
X = np.random.RandomState(0).multivariate_normal(mean = [0, 0], cov=true_cov,size=500)
cov = EllipticEnvelope(random_state = 0).fit(X)^M
# Now we can use predict method. It will return 1 for an inlier and -1 for an outlier.
cov.predict([[0, 0],[2, 2]])

Output

输出量


array([ 1, -1])

隔离林 (Isolation Forest)

In case of high-dimensional dataset, one efficient way for outlier detection is to use random forests. The scikit-learn provides ensemble.IsolationForest method that isolates the observations by randomly selecting a feature. Afterwards, it randomly selects a value between the maximum and minimum values of the selected features.

对于高维数据集,一种有效的离群值检测方法是使用随机森林。 scikit-learn提供了ensemble.IsolationForest方法,该方法通过随机选择特征来隔离观察结果。 之后,它会在所选特征的最大值和最小值之间随机选择一个值。

Here, the number of splitting needed to isolate a sample is equivalent to path length from the root node to the terminating node.

在这里,隔离样本所需的拆分次数等于从根节点到终止节点的路径长度。

参量 (Parameters)

Followings table consist the parameters used by sklearn. ensemble.IsolationForest method −

跟随表包括sklearn使用的参数 ensemble.IsolationForest方法-

Sr.NoParameter & Description
1

n_estimators − int, optional, default = 100

It represents the number of base estimators in the ensemble.

2

max_samples − int or float, optional, default = “auto”

It represents the number of samples to be drawn from X to train each base estimator. If we choose int as its value, it will draw max_samples samples. If we choose float as its value, it will draw max_samples ∗ ????.shape[0] samples. And, if we choose auto as its value, it will draw max_samples = min(256,n_samples).

3

support_fraction − float in (0., 1.), optional, default = None

This parameter tells the method that how much proportion of points to be included in the support of the raw MCD estimates.

4

contamination − auto or float, optional, default = auto

It provides the proportion of the outliers in the data set. If we set it default i.e. auto, it will determine the threshold as in the original paper. If set to float, the range of contamination will be in the range of [0,0.5].

5

random_state − int, RandomState instance or None, optional, default = none

This parameter represents the seed of the pseudo random number generated which is used while shuffling the data. Followings are the options −

  • int − In this case, random_state is the seed used by random number generator.

  • RandomState instance − In this case, random_state is the random number generator.

  • None − In this case, the random number generator is the RandonState instance used by np.random.

6

max_features − int or float, optional (default = 1.0)

It represents the number of features to be drawn from X to train each base estimator. If we choose int as its value, it will draw max_features features. If we choose float as its value, it will draw max_features * X.shape[????] samples.

7

bootstrap − Boolean, optional (default = False)

Its default option is False which means the sampling would be performed without replacement. And on the other hand, if set to True, means individual trees are fit on a random subset of the training data sampled with replacement.

8

n_jobs − int or None, optional (default = None)

It represents the number of jobs to be run in parallel for fit() and predict() methods both.

9

verbose − int, optional (default = 0)

This parameter controls the verbosity of the tree building process.

10

warm_start − Bool, optional (default=False)

If warm_start = true, we can reuse previous calls solution to fit and can add more estimators to the ensemble. But if is set to false, we need to fit a whole new forest.

序号 参数及说明
1个

n_estimators -int,可选,默认= 100

它表示集合中基本估计量的数量。

2

max_samples -int或float,可选,默认=“ auto”

它表示要从X抽取以训练每个基本估计量的样本数。 如果我们选择int作为其值,它将绘制max_samples个样本。 如果选择float作为其值,它将绘制max_samples * .shape [0]个样本。 并且,如果我们选择auto作为其值,它将绘制max_samples = min(256,n_samples)。

3

support_fraction-浮入(0.,1.),可选,默认= None

此参数告诉方法原始MCD估计的支持中将包含多少比例的点。

4

污染 -自动或浮动,可选,默认=自动

它提供了异常值在数据集中的比例。 如果我们将其设置为默认值,即自动,它将像原始纸张一样确定阈值。 如果设置为浮动,则污染范围将在[0,0.5]范围内。

5

random_state -int,RandomState实例或无,可选,默认=无

此参数表示生成的伪随机数的种子,在对数据进行混洗时会使用该种子。 以下是选项-

  • INT -在这种情况下,random_state是由随机数生成所使用的种子。

  • RandomState实例 -在这种情况下, random_state是随机数生成器。

  • -在这种情况下,随机数生成器是np.random使用的RandonState实例。

6

max_features -int或float,可选(默认= 1.0)

它表示从X绘制以训练每个基本估计量的要素数量。 如果我们选择int作为其值,它将绘制max_features特征。 如果选择float作为其值,它将绘制max_features * X.shape [????]样本。

7

bootstrap-布尔值,可选(默认= False)

其默认选项为False,这意味着将执行采样而无需替换。 另一方面,如果将其设置为True,则意味着将单独的树拟合到替换采样的训练数据的随机子集上。

8

n_jobs -int或None,可选(默认= None)

它代表fit()predict()方法并行运行的作业数。

9

详细 -int,可选(默认= 0)

此参数控制树构建过程的详细程度。

10

warm_start-布尔型,可选(默认= False)

如果warm_start = true,我们可以重用以前的调用解决方案以适应并可以向集合添加更多估计量。 但是,如果将其设置为false,则需要适应一个全新的森林。

属性 (Attributes)

Following table consist the attributes used by sklearn. ensemble.IsolationForest method −

下表包含sklearn使用的属性 ensemble.IsolationForest方法-

Sr.NoAttributes & Description
1

estimators_ − list of DecisionTreeClassifier

Providing the collection of all fitted sub-estimators.

2

max_samples_ − integer

It provides the actual number of samples used.

3

offset_ − float

It is used to define the decision function from the raw scores. decision_function = score_samples -offset_

序号 属性和说明
1个

estimators_ -DecisionTreeClassifier列表

提供所有拟合的子估计量的集合。

2

max_samples_-整数

它提供了实际使用的样本数。

3

offset_-浮动

它用于根据原始分数定义决策函数。 决策功能=分数样本-偏移量

Implementation Example

实施实例

The Python script below will use sklearn. ensemble.IsolationForest method to fit 10 trees on given data

下面的Python脚本将使用sklearn。 ensemble.IsolationForest方法可在给定数据上拟合10棵树


from sklearn.ensemble import IsolationForest
import numpy as np
X = np.array([[-1, -2], [-3, -3], [-3, -4], [0, 0], [-50, 60]])
OUTDClf = IsolationForest(n_estimators = 10)
OUTDclf.fit(X)

Output

输出量


IsolationForest(
behaviour = 'old', bootstrap = False, contamination='legacy',
max_features = 1.0, max_samples = 'auto', n_estimators = 10, n_jobs=None,
random_state = None, verbose = 0
)

局部离群因子 (Local Outlier Factor)

Local Outlier Factor (LOF) algorithm is another efficient algorithm to perform outlier detection on high dimension data. The scikit-learn provides neighbors.LocalOutlierFactor method that computes a score, called local outlier factor, reflecting the degree of anomality of the observations. The main logic of this algorithm is to detect the samples that have a substantially lower density than its neighbors. Thats why it measures the local density deviation of given data points w.r.t. their neighbors.

局部离群因子(LOF)算法是对高维数据执行离群检测的另一种有效算法。 scikit-learn提供neighbors.LocalOutlierFactor方法,该方法计算得分(称为局部异常值),以反映观测值的异常程度。 该算法的主要逻辑是检测密度远低于其邻居密度的样本。 这就是为什么它测量给定数据点及其邻居的局部密度偏差的原因。

参量 (Parameters)

Followings table consist the parameters used by sklearn. neighbors.LocalOutlierFactor method

跟随表包括sklearn使用的参数 neighbors.LocalOutlierFactor方法

Sr.NoParameter & Description
1

n_neighbors − int, optional, default = 20

It represents the number of neighbors use by default for kneighbors query. All samples would be used if .

2

algorithm − optional

Which algorithm to be used for computing nearest neighbors.

  • If you choose ball_tree, it will use BallTree algorithm.

  • If you choose kd_tree, it will use KDTree algorithm.

  • If you choose brute, it will use brute-force search algorithm.

  • If you choose auto, it will decide the most appropriate algorithm on the basis of the value we passed to fit() method.

3

leaf_size − int, optional, default = 30

The value of this parameter can affect the speed of the construction and query. It also affects the memory required to store the tree. This parameter is passed to BallTree or KdTree algorithms.

4

contamination − auto or float, optional, default = auto

It provides the proportion of the outliers in the data set. If we set it default i.e. auto, it will determine the threshold as in the original paper. If set to float, the range of contamination will be in the range of [0,0.5].

5

metric − string or callable, default

It represents the metric used for distance computation.

6

P − int, optional (default = 2)

It is the parameter for the Minkowski metric. P=1 is equivalent to using manhattan_distance i.e. L1, whereas P=2 is equivalent to using euclidean_distance i.e. L2.

7

novelty − Boolean, (default = False)

By default, LOF algorithm is used for outlier detection but it can be used for novelty detection if we set novelty = true.

8

n_jobs − int or None, optional (default = None)

It represents the number of jobs to be run in parallel for fit() and predict() methods both.

序号 参数及说明
1个

n_neighbors − int,可选,默认= 20

它表示默认情况下用于kneighbors查询的邻居数。 如果使用所有样本。

2

算法 -可选

用于计算最近邻居的算法。

  • 如果选择ball_tree,它将使用BallTree算法。

  • 如果选择kd_tree,它将使用KDTree算法。

  • 如果选择蛮力,它将使用蛮力搜索算法。

  • 如果选择自动,它将根据我们传递给fit()方法的值来决定最合适的算法。

3

leaf_size − int,可选,默认= 30

该参数的值会影响构造和查询的速度。 它还会影响存储树所需的内存。 此参数传递给BallTree或KdTree算法。

4

污染 -自动或浮动,可选,默认=自动

它提供了异常值在数据集中的比例。 如果我们将其设置为默认值,即自动,它将像原始纸张一样确定阈值。 如果设置为浮动,则污染范围将在[0,0.5]范围内。

5

指标 -字符串或可调用,默认

它代表用于距离计算的度量。

6

P − int,可选(默认= 2)

它是Minkowski指标的参数。 P = 1等同于使用manhattan_distance即L1,而P = 2等同于使用euclidean_distance即L2。

7

新颖性 -布尔值,(默认= False)

默认情况下,LOF算法用于离群值检测,但是如果我们将novellity = true设置,则可以将其用于新颖性检测。

8

n_jobs -int或None,可选(默认= None)

它代表fit()和predict()方法并行运行的作业数。

属性 (Attributes)

Following table consist the attributes used by sklearn.neighbors.LocalOutlierFactor method −

下表包含sklearn.neighbors.LocalOutlierFactor方法使用的属性-

Sr.NoAttributes & Description
1

negative_outlier_factor_ − numpy array, shape(n_samples,)

Providing opposite LOF of the training samples.

2

n_neighbors_ − integer

It provides the actual number of neighbors used for neighbors queries.

3

offset_ − float

It is used to define the binary labels from the raw scores.

序号 属性和说明
1个

negative_outlier_factor_ − numpy数组,形状(n_samples,)

提供训练样本的相反LOF。

2

n_neighbors_-整数

它提供了用于邻居查询的邻居的实际数量。

3

offset_-浮动

它用于根据原始分数定义二进制标签。

Implementation Example

实施实例

The Python script given below will use sklearn.neighbors.LocalOutlierFactor method to construct NeighborsClassifier class from any array corresponding our data set

下面给出的Python脚本将使用sklearn.neighbors.LocalOutlierFactor方法从对应于我们数据集的任何数组构造NeighborsClassifier类


from sklearn.neighbors import NearestNeighbors
samples = [[0., 0., 0.], [0., .5, 0.], [1., 1., .5]]
LOFneigh = NearestNeighbors(n_neighbors = 1, algorithm = "ball_tree",p=1)
LOFneigh.fit(samples)

Output

输出量


NearestNeighbors(
algorithm = 'ball_tree', leaf_size = 30, metric='minkowski',
metric_params = None, n_jobs = None, n_neighbors = 1, p = 1, radius = 1.0
)

Example

Now, we can ask from this constructed classifier is the closet point to [0.5, 1., 1.5] by using the following python script −

现在,我们可以使用以下python脚本从此构造的分类器中询问[0.5,1.,1.5]的壁橱点-


print(neigh.kneighbors([[.5, 1., 1.5]])

Output

输出量


(array([[1.7]]), array([[1]], dtype = int64))

一类SVM (One-Class SVM)

The One-Class SVM, introduced by Schölkopf et al., is the unsupervised Outlier Detection. It is also very efficient in high-dimensional data and estimates the support of a high-dimensional distribution. It is implemented in the Support Vector Machines module in the Sklearn.svm.OneClassSVM object. For defining a frontier, it requires a kernel (mostly used is RBF) and a scalar parameter.

Schölkopf等人介绍的One-Class SVM是无监督的离群值检测。 它在高维数据中也非常有效,并估计了高维分布的支持。 它在Sklearn.svm.OneClassSVM对象的“ 支持向量机”模块中实现。 为了定义边界,它需要一个内核(最常用的是RBF)和一个标量参数。

For better understanding let's fit our data with svm.OneClassSVM object −

为了更好地理解,让我们将数据与svm.OneClassSVM对象配合起来 -

(Example)


from sklearn.svm import OneClassSVM
X = [[0], [0.89], [0.90], [0.91], [1]]
OSVMclf = OneClassSVM(gamma = 'scale').fit(X)

Now, we can get the score_samples for input data as follows −

现在,我们可以获得输入数据的score_samples,如下所示:


OSVMclf.score_samples(X)

输出量 (Output)


array([1.12218594, 1.58645126, 1.58673086, 1.58645127, 1.55713767])

翻译自: https://www.tutorialspoint.com/scikit_learn/scikit_learn_anomaly_detection.htm

最后

以上就是甜美毛衣为你收集整理的Scikit Learn-异常检测 Scikit Learn-异常检测 (Scikit Learn - Anomaly Detection)的全部内容,希望文章能够帮你解决Scikit Learn-异常检测 Scikit Learn-异常检测 (Scikit Learn - Anomaly Detection)所遇到的程序开发问题。

如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。

本图文内容来源于网友提供,作为学习参考使用,或来自网络收集整理,版权属于原作者所有。
点赞(48)

评论列表共有 0 条评论

立即
投稿
返回
顶部