概述
https://arxiv.org/abs/1907.06831
Yang F, Du M, Hu X. Evaluating explanation without ground truth in interpretable machine learning[J]. arXiv preprint arXiv:1907.06831, 2019.
0 Abstract
可解释机器学习(IML)在许多现实世界的应用中变得越来越重要,例如自动驾驶汽车和医疗诊断,在这些应用中,解释可以帮助人们更好地理解机器学习系统的工作方式,并进一步增强他们对系统的信任。然而,由于场景的多样性和解释的主观性,我们在IML中很少有关于所生成解释质量的基准评估的基本事实(ground truth)。评估解释的质量不仅对评估系统边界很重要,而且有助于在实际环境中实现对人类用户的真正好处。为了对IML中的评估进行基准测试,在本文中,我们严格定义了评估解释的问题,并系统地回顾了现有的最新研究成果。具体而言,我们总结了解释的三个一般属性(即概括性generalizability、忠实性fidelity和说服性persuasibility),并给出了正式定义,然后分别回顾了它们在不同任务下的代表性方法。此外,根据开发人员和最终用户的分层需求,设计了一个统一的评估框架,可以方便地应用于实践中的不同场景。最后,讨论了尚未解决的问题,并提出了当前评估技术的一些局限性,以供今后的探索。
1 Introduction
复杂机器学习模型(比如深度学习模型)虽然有着较高的精度,但通常难以解释,这限制了它们在高风险投资场景下的应用机会。因此,可解释机器学习(IML)的概念被提出,旨在帮助人类更好地理解机器学习的决策过程。
IML技术的核心思想:
IML技术的研究趋势:
尽管已经对IML技术进行了全面的讨论,包括方法和应用,但IML评估视角的见解仍然相当有限,这严重阻碍了IML进入严谨科学领域的道路。为了准确反映IML系统的边界并衡量IML给人类用户带来的好处,有效的评估是关键和不可或缺的。与单纯依赖模型性能的传统评估不同,IML评估还需要关注生成的解释的质量,这使得其难以处理和基准化。
IML评估的挑战:设计实验时需要权衡主观和客观的问题
- 一方面,不同的用户可能会对不同场景下的良好解释有不同的偏好,因此,用一组共同的基本事实对IML评估进行客观评估是不切实际的。
- 另一方面,用户对IML的主观满意度也不能单独作为对IML的评估,有研究表明用户的主观满意度往往与用户的反应时间有关,而与系统的预测性能没有太大的关系,
- 而且,完全依靠用户的主观评估还会导致道德问题,因为操纵解释以更好地迎合人类用户是不道德的。过分追求人类的满意度可能会导致解释说服用户,而不是实际解释系统。
本文工作:
- 概述ML相关方法,从两个维度进行分类(interpretation scope和interpretation manner)
- 概括三个一般属性(概括性generalizability、忠实性fidelity和说服性persuasibility),对其进行正式地定义,并且严格定义IML场景下的评估问题
- 依据三个一般属性,系统回顾现有关于IML评估的工作,聚焦于不同应用中的不同解释技术
- 回顾一些其他一些特殊场景下IML评估的特殊属性
- 依据以上的这些属性,设计一个统一的评估框架,该框架同时考虑了模型开发人员和终端用户的层次需求
- 最后,我们提出了当前评估技术存在的几个问题,并讨论了未来探索的一些潜在局限性。
2 Explanation and Evaluation
2.1 Explanation Overview
A two-dimensional categorization
- Scope dimension
-
- global: indicates the overall working mechanism of models by interpreting structures or parameters
- local: reflects the particular model behavior for individual instance by analyzing specific decisions
- Manner dimension
-
- intrinsic: achieved by self-interpretable models
- post-hoc (also written as posthoc): requires another independent interpretation model or technique to provide explanations for the target model
2.2 General Properties of Explanation
- generalizability
- fidelity
- persuasibility
2.3 Explanation Evaluation Problem
IML评估问题的定义:
- model evaluation: 与传统机器学习系统评估方式相同,采用accuracy、precision、recall、F1-score等指标进行评估
- explanation evaluation:explanation evaluation在目标和方法上都不同于mode evaluation。由于解释通常包含多个视角,并且在不同场景中没有共同的基本事实,因此传统的模型评估技术无法完美应用。
Defination of explanation evaluation: The explanation evaluation problem within IML context is to assess the quality of the generated explanations from systems, where high-quality explanation corresponds to large values of generalizability, fidelity and persuasibility with relevant measurement. In general, good explanation ought to be well generalized, highly faithful and easily understood.
3 Evaluation Review
3.1 Evaluation on Generalizability
- intrinsic-global explanations: 本身就是可解释的,将这些explanations应用于测试集,看看它们的泛化能力(概括推广能力),与model evaluation有些相似
-
- metrics: accuracy, F1-score,AUC
- examples:generalized linear model (with informative coefficients) , decision tree (with structured branches) , K-nearest neighbors (with significant instances) and rule-based systems (with discriminative patterns)
- posthoc-global explanations:与内在全局解释类似,只不过这里的模型不是目标模型,而是代理模型
-
- metrics: accuracy, F1-score,AUC
- examples:knowledge distillation, mimic learning
3.2 Evaluation on Fidelity
- intrinsic explanations: full fidelity,因为它本身就完全反映模型的真实运作机制
- posthoc-global explanations:代理模型是不具有full fidelity的,因为它建立在一个不同的模型基础上。通过衡量目标模型和代理模型性能上的差异来反映fidelity。
-
- examples: teacher-student models
- posthoc-local explanations:一般的方式是通过消融分析或者扰动分析(ablation and perturbation)来衡量fidelity,核心思想是依据生成的解释在进行对抗性改变之后检查预测的波动(check the prediction variation after the adversarial changes made according to the generated explanation)。
-
- philosophy:如果生成的解释是faithful的,那么依据该解释,对输入样例的特征进行相应地修改,那么会对预测结果造成显著的影响(modifications made in accordance with the generated explanation to the input instances bring about significant differences to model prediction ,if the explanation is faithful to the target system). 预测结果改变越大,说明explanation越faithful。
- examples:mask the attributing regions in images for image classification tasks;
3.3 Evaluation on Persuasibility
- uncontentious tasks(eg. object detection):如目标检测等无争议的任务(typically keep consistent across different groups of user and one particular task),可以使用与人工注解(human annotation)比较的方法来评估
-
- example 1:bounding box:利用交并比Intersection over Union(IoU)或者Jaccard index等指标量化说服力
- example 2:semantic segmentation:利用像素级别的差异作为指标衡量解释的说服力
- example 3:类似human annotation, NLP中有rationale,它是被标注员认为对预测产生重要影响的输入特征的一个子集。
- complicated tasks:复杂任务下,基于人工标注的方法不再适用,因为相关的标注在不同用户组中会不一致。这种情况下,常见的说服力评估方法是employing users for human studies
-
- metrics for human evaluation:mental mode, human-machineperformance, user satisfaction user trust, response time, decision accuracy
3.4 Evaluation on Other Properties
The reason to introduce thers properties seperately:
- these properies are not representative and general for explanation eavluation among IML systems, and are simply considered under specific architectures or applications
- related to both prediction model and generated explanation, and need novel and special design to evaluate
Other properties:
- robustness:how similar the explanations are for similar examples
-
- metrics: sensitivity
- capability:the exend that corresponding explanations can be generated
-
- used on those explanations generated from search based mathdologies , instead of those obtained from gradient base or perturbation based methdologies
- example: recomender system, use explainability precision and explainability recall as metrics
- certainty:whether the explanations reflct the uncertainty of the target IML system
-
- discrepancy in prediction confidence of the IML system between one category and the others
4 Discussion and Exploration
4.1 Unified Framework for Evaluation
- Generalizability -> reflect the true knowledge for particular tasks? -> the precondition for human users to make accurate decisions with the generated explanations
- Fidelity -> explanation relevance -> trust the IML system or not?
- Persuasibility -> directly bridge the gap between human users and machine explanations -> tangible impacts
Each tier should have a consistent pipeline with a fixed set of data, user, and metrics correspondingly.
The overall evaluation results can be further derived through an ensemble way, e.g., weighted sum.
最后
以上就是动听老师为你收集整理的论文阅读:Evaluating explanation without ground truth in interpretable machine learning0 Abstract1 Introduction2 Explanation and Evaluation3 Evaluation Review4 Discussion and Exploration的全部内容,希望文章能够帮你解决论文阅读:Evaluating explanation without ground truth in interpretable machine learning0 Abstract1 Introduction2 Explanation and Evaluation3 Evaluation Review4 Discussion and Exploration所遇到的程序开发问题。
如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。
发表评论 取消回复