概述
BeyondCorp
超越组织边界方案:“零信任”
Design to Deployment at Google
谷歌的设计和落地方案
作者:BARCLAY OSBORN, JUSTIN MCWILLIAMS, BETSY BEYER, AND MAX SALTONSTALL
作者介绍:
Barclay Osborn is a Site Reliability Engineering Manager at Google in Los Angeles. He previously worked at a variety of software, hardware, and security startups in San Diego. He holds a BA in computer science from the University of California, San Diego. barclay@google.com
BarclayOsborn是洛杉矶Google公司网站可靠性工程经理。他曾在圣地亚哥的多个软硬件和安全初创公司工作。他拥有加州大学圣地亚哥分校计算机科学学士学位。邮箱barclay@google.com。
Justin McWilliams is a Google Engineering Manager based in NYC. Since joining Google in 2006, he has held positions in IT Support and IT Ops Focused Software Engineering. He holds a BA from the University of Michigan, Ann Arbor. jjm@google.com
Justin McWilliams 是Google纽约总部的工程经理。自2006年加入谷歌以来,他一直担任IT支持和专注于IT运营的软件工程职位。他拥有密歇根大学安阿伯分校的学士学位。邮箱Ann Arbor. jjm@google.com
Betsy Beyer is a Technical Writer for Google Site Reliability Engineering in NYC. She has previously provided documentation for Google Data Center and Hardware Operations teams. Before moving to New York, Betsy was a lecturer in technical writing at Stanford University. She holds degrees from Stanford and Tulane. bbeyer@google.com
Betsy Beyer是纽约Google公司网站可靠性工程技术作家。来纽约之前她之前曾为Google数据中心和硬件运营团队提供过文档支持。Betsy是斯坦福大学的技术写作讲师。她拥有斯坦福大学和杜兰大学的学位。邮箱bbeyer@google.com。
Max Saltonstall is a Program Manager for Google Corporate Engineering in New York. Since joining Google in 2011 he has worked on advertising products, internal change management, and IT externalization. He has a degree in computer science and psychology from Yale. maxsaltonstall@google.com
Max Saltonstall是纽约Google工程公司的编程经理。自2011年加入Google以来,他一直致力于广告产品、内部变革管理和IT外部化。他在耶鲁大学获得计算机科学和心理学学位。邮箱maxsaltonstall@google.com
以下是正文
格式为:英文原文,中文翻译。
The goal of Google’s BeyondCorp initiative is to improve our security with regard to how employees and devices access internal applica-tions. Unlike the conventional perimeter security model, BeyondCorp doesn’t gate access to services and tools based on a user’s physical location or the originating network; instead, access policies are based on information about a device, its state, and its associated user. BeyondCorp considers both internal networks and external networks to be completely untrusted, and gates access to applications by dynamically asserting and enforcing levels, or “tiers,” of access.
谷歌BeyondCorp计划的目标是提高员工和设备访问内部应用程序的安全性。与传统的内外网安全模型不同,BeyondCorp不基于用户的物理位置、网络位置来放开服务、工具的访问权限;而是基于设备信息、状态信息、关联用户的信息。BeyondCorp认为内、外网络都完全不可信,应通过动态判断策略和强制执行访问分级。来进行应用程序的访问控制。
We present an overview of how Google transitioned from traditional security infrastructure to the BeyondCorp model and the challenges we faced and the lessons we learned in the pro-cess. For an architectural discussion of BeyondCorp, see [1].
本文,我们概述了谷歌如何从传统的安全基础设施控制模式 过渡到BeyondCorp模式,以及我们在这一过程中面临的挑战和经验教训。有关BeyondCorp的架构讨论,请参见[1]。
Overview
概述
As illustrated by Figure 1, the fundamental components of the BeyondCorp system include the Trust Inferer, Device Inventory Service, Access Control Engine, Access Policy, Gate-ways, and Resources. The following list defines each term as it is used by BeyondCorp:
如图1所示,BeyondCorp系统的基本组件包括:信任评估器,设备清单服务、访问控制引擎、访问策略、网关和资源。以下列表定义了BeyondCorp使用的每个术语:
◆◆Access requirements are organized into Trust Tiers representing levels of increasing sensitivity.
访问需求是随着敏感程度增加,被抽象组织成不同信任层级。
◆◆Resources are an enumeration of all the applications, services, and infrastructure that are subject to access control. Resources might include anything from online knowledge bases, to financial databases, to link-layer connectivity, to lab networks. Each resource is associated with a minimum trust tier required for access.
资源是受访问控制影响的所有应用程序、服务、基础结构的枚举。资源可能包括在线知识库、财务数据库、链路层连接、实验网络。 每个资源都与访问所需的最低信任级别相关联。
◆◆The Trust Inferer is a system that continuously analyzes and annotates device state. The system sets the maximum trust tier accessible by the device and assigns the VLAN to be used by the device on the corporate network. These data are recorded in the Device Inven-tory Service. Reevaluations are triggered either by state changes or by a failure to receive updates from a device.
信任评估器是一个持续分析、标注设备状态的系统。系统设置设备可访问的最大信任层级,并在公司网络上分配设备要使用的VLAN。这些数据记录在设备资产服务中。 状态变化或更新失败会触发重新评估。
◆◆The Access Policy is a programmatic representation of the Resources, Trust Tiers, and other predicates that must be satisfied for successful authorization.
访问策略是资源、信任层级、必须满足其他成功条件 的程序化描述。
◆◆The Access Control Engine is a centralized policy enforcement service referenced by each gateway that provides a binary authorization decision based on the access policy, output of the Trust Inferer, the resources requested, and real-time credentials.
访问控制引擎是全部网关引用的集中化策略的实施服务,它基于访问策略、信任评估器、资源需求、实时凭证提供二进制授权决策。
◆◆At the heart of this system, the Device Inventory Service continuously collects, process-es, and publishes changes about the state of known devices.
该系统的核心,设备清单服务不断收集,处理和发布有关已知设备状态的更改。
◆◆Resources are accessed via Gateways, such as SSH servers, Web proxies, or 802.1x-enabled networks. Gateways perform authorization actions, such as enforcing a minimum trust tier or assigning a VLAN
资源是可通过网关访问资源(如SSH服务、Web代理、802.1x网络),网关提供授权(如分配最低信任层、分配VLAN)的访问对象。
Figure 1: Architecture of the BeyondCorp Infrastructure Components
图1:BeyondCorp基础架构组件的体系结构
Components of BeyondCorp
BeyondCorp的组件
Using the components described below, BeyondCorp integrated various preexisting systems with new systems and components to enable f lexible and granular trust decisions.
使用以下组件,BeyondCorp将各种已存在的系统与新的系统和组件集成在一起,实现灵活而精细的信任决策。
Devices and Hosts
设备和主机
An inventory is the primary prerequisite to any inventory-based access control. Depending on your environment and security policy, you may need to make a concerted effort to distinguish between devices and hosts. A device is a collection of physical or virtual components that act as a computer, whereas a host is a snapshot of the state of a device at a given point in time. For example, a device might be a laptop or a mobile phone, while a host would be the specifics of the operating system and software running on that device. The Device Inventory Service contains information on devices, their associated hosts, and trust deci-sions for both. In the sections below, the generic term “device” can refer to either a physical device or a host, depending on the configuration of the access policy. After a basic inventory has been established, the remainder of the components discussed below can be deployed as desired in order to provide improved security, coverage, granularity, latency, and f lexibility.
设备库存信息是任何访问控制的前提。根据环境和安全策略,需要不断尽力区分设备device和主机host。计算机设备(device)是物理或虚拟组件的集合,而主机(host)是给定时间点设备状态的快照。例如,设备可能是笔记本电脑或移动电话,而主机则是该设备上运行的操作系统和软件的详细信息。设备清单服务记录了设备、主机以及两者的信任决策的信息。在以下各节中,通用术语“设备”可以指物理设备或主机,具体取决于访问策略的配置。在这个对应关系库建立后,可以根据需要部署其余组件,来满足安全性、覆盖范围、颗粒度、延迟和灵活性问题。
Tiered Access
分层访问
Trust levels are organized into tiers and assigned to each device by the Trust Inferer. Each resource is associated with a mini-mum trust tier required for access. In order to access a given resource, a device’s trust tier assignment must be equal to or greater than the resource’s minimum trust tier requirement. To provide a simplified example, consider the use cases of vari-ous employees of a catering company: a delivery crew may only require a low tier of access to retrieve the address of a wedding, so they don’t need to access more sensitive services like billing systems.
信任级别按层级划分,并由信任判断器分配给每个设备。每个资源都与最小信任层相关联。为了访问目标资源,访问设备分配的信任层必须等于或大于资源的最低信任层要求。以餐饮公司工种为例:送货员只需要一个较低信任层来获取婚礼的地址,因此他们不需要访问更(高级的)敏感计费系统服务。
Assigning the lowest tier of access required to complete a request has several advantages: it decreases the maintenance cost associated with highly secured devices (which primarily entails the costs associated with support and productivity) and also improves the usability of the device. As a device is allowed to access more sensitive data, we require more frequent tests of user presence on the device, so the more we trust a given device, the shorter-lived its credentials. Therefore, limiting a device’s trust tier to the minimum access requirement it needs means that its user is minimally interrupted. We may require installa-tion of the latest operating system update within a few business days to retain a high trust tier, whereas devices on lower trust tiers may have slightly more relaxed timelines.
分配设备所需的最低访问层有几个优点:降低了高安全级别设备的维护成本,提高了设备的可用性。由于允许设备访问更敏感的数据,我们将更频繁地测试设备上的用户是否存在,因此我们设备信任层级越高,其凭据的有效期就越短,因此,(由于频繁的检测授权)信任层限制为其所需的最低访问要求,使得其用户的中断最少。为保留高信任级别,最新的操作系统更新需要再几天内完成安装,而低信任级别的设备将有稍微宽松的时间表。
To provide another example, a laptop that’s centrally managed by the company but that hasn’t been connected to a network for some period of time may be out of date. If the operating system is missing some noncritical patches, trust can be downgraded to an intermediate tier, allowing access to some business applica-tions but denying access to others. If a device is missing a critical security patch, or its antivirus software reports an infection, it may only be allowed to contact remediation services. On the furthest end of the spectrum, a known lost or stolen device can be denied access to all corporate resources.
再举个例子,一台由公司集中管理的电脑由于一段时间未接入网络导致(接入策略)已经过时了。如果操作系统缺少一些非关键补丁,信任级别将降级到中间层,允许访问某些业务应用,但拒绝其他访问。若设备缺少关键的安全补丁,或其防病毒软件发现异常,则可能只允许其访问补救服务。在最差的情况下,已知丢失或被盗的设备将被拒绝访问所有公司资源。
In addition to providing tier assignments, the Trust Inferer also supports network segmentation efforts by annotating which VLANs a device may access. Network segmentation allows us to restrict access to special networks—lab and test environments, for example—based on the device state. When a device becomes untrustworthy, we can assign it to a quarantine network that provides limited resource access until the device is rehabilitated.
除了提供层分配之外,信任判断器还通过标注设备可以访问的VLAN来支持网络分段工作。基于设备状态的网络分段允许我们限制对特殊实验环境网络、测试环境网络的访问。当设备变得不可信时,我们可以将其分配给隔离网络,该网络提供有限的资源访问,直到设备恢复。
Device Inventory Service
设备清单服务
The Device Inventory Service (shown in Figure 2) is a continu-ously updated pipeline that imports data from a broad range of sources. Systems management sources might include Active Directory, Puppet, and Simian. Other on-device agents, configu-ration management systems, and corporate asset management systems should also feed into this pipeline. Out-of-band data sources include vulnerability scanners, certificate authorities, and network infrastructure elements such as ARP tables. Each data source sends either full or incremental updates about devices.
设备清单服务(如图2所示)是一个持续更新、从各种来源导入数据的管道服务。获取数据来源包括AD、Puppet(运维工具)、Simian(工具)、设备上的agent、配置管理系统、企业资产管理系统;带外数据源包括漏洞扫描信息、证书颁发机构、网络基础数据(如ARP表)。每个数据源都发送设备的完整数据或增量更新。
Since implementing the initial phases of the Device Inven-tory Service, we’ve ingested billions of deltas from over 15 data sources, at a typical rate of about three million per day, totaling over 80 terabytes. Retaining historical data is essential in allow-ing us to understand the end-to-end lifecycle of a given device, track and analyze fleet-wide trends, and perform security audits and forensic investigations.
设备资产服务初始化运用后,我们已经从超过15个数据源中摄取了数十亿个增量数据,平均每天大约300万个,总计超过80万亿字节。保留历史数据对于让我们了解设备的端到端生命周期至关重要,跟踪和分析整体趋势,执行安全审计和调查取证。
Figure 2: Device Inventory Service
图2:设备清单服务
Types of Data
数据类型
Data come in two main f lavors: observed and prescribed.
数据主要来自两个方面:监控的数据和指定的数据。
Observed data are programmatically generated and include items such as the following:
监控数据通过程序生成,包括:
◆◆The last time a security scan was performed on the device, in addition to the results of the scan
次对设备执行安全扫描的时间、扫描结果
◆◆The last-synced policies and timestamp from Active Directory
上次从AD同步的策略和时间戳
◆◆OS version and patch level
操作系统版本和补丁级别
◆◆Any installed software
已经安装的全部软件
Prescribed data are manually maintained by IT Operations and include the following:
指定数据数据由IT运营部门手动维护,包括:
◆◆The assigned owner of the device
设备的所有者
◆◆Users and groups allowed to access the device
用户和组允许访问设备
◆◆DNS and DHCP assignments
DNS和DHCP分配
◆◆Explicit access to particular VLANs
明确到特定VLAN的访问策略
Explicit assignments are required in cases of insufficient data or when a client platform isn’t customizable (as is the case for printers, for example). In contrast to the change rate that char-acterizes observed data, prescribed data are typically static. We analyze data from numerous disparate sources to identify cases where data conf lict, as opposed to blindly trusting a single or small number of systems as truth.
在数据不足或客户端平台不可自定义的情况下(例如打印机),需要显式分配。与观测数据不同,指定数据通常是静态的。我们分析来自多个不同来源的数据,以确定哪些情况下数据是合法的,不能盲目地相信一个或一小部分系统数据是真实的。
Data Processing
数据处理
TRANSFORMATION INTO A COMMON DATA FORMAT
转换为通用数据格式
Several phases of processing are required to keep the Device Inventory Service up to date. First, all data must be transformed into a common data format. Some data sources, such as in-house or open source solutions, can be tooled to publish changes to the inventory system on commit. Other sources, particularly those that are third party, cannot be extended to publish changes and therefore require periodic polling to obtain updates.
要使设备存储服务保持最新,需要几个处理阶段。首先,必须将所有数据转换为通用数据格式。一些如内部或开源解决方案,可将变更数据提交到库存系统。其他来源系统,特别是第三方系统,无法扩展以提交更新数据,则需要通过定期轮询来获取更新。
CORRELATION
相关性
Once the incoming data are in a common format, all data must be correlated. During this phase, the data from distinct sources must be reconciled into unique device-specific records. When we determine that two existing records describe the same device, they are combined into a single record. While data correlation may appear straightforward, in practice it becomes quite complicated because many data sources don’t share overlapping identifiers.
一旦传入的数据采用通用格式,所有数据就需要互相关联。在此阶段中,必须将来自不同来源的数据调整为一个设备的记录,当我们确定两个现有记录描述同一设备时,它们被组合成一个记录。数据关联看起来很简单,因为许多数据源不共享重叠的键值(key),实际关联相当复杂。
For example, it may be that the asset management system stores an asset ID and a device serial number, but disk encryp-tion escrow stores a hard drive serial number, the certificate authority stores a certificate fingerprint, and an ARP database stores a MAC address. It may not be clear that deltas from these individual systems describe the same device until an inven-tory reporting agent reports several or all of these identifiers together, at which point the disjoint records can be combined into a single record.
例如,资产管理系统可以存储资产ID和设备序列号,但磁盘加密托管存储硬盘序列号,证书颁发机构存储证书指纹,ARP数据库存储MAC地址。在库存上报agent上报这些标识符之前,不清楚这些单独系统中的增量是否描述同一设备,此时不相交的记录可以合并为一个记录。
The question of what, exactly, constitutes a device becomes even more muddled when you factor in the entire lifecycle, dur-ing which hard drives, NICs, cases, and motherboards may be replaced or even swapped among devices. Even more complica-tions arise if data are manually entered incorrectly.
当你考虑到在整个生命周期中,哪些硬盘、网卡、机箱和主板可能被替换,或在设备之间交换时,‘究竟是什么构成了一个设备的问题’更难判断。若手动输入的数据错误,则会出现更复杂的情况。
TRUST EVALUATION
信任评估
Once the incoming records are merged into an aggregate form, the Trust Inferer is notified to trigger reevaluation. This analy-sis references a variety of fields and aggregates the results in order to assign a trust tier. The Trust Inferer currently refer-ences dozens of fields, both platform-specific and platform-agnostic, across various data sources; millions of additional fields are available for analysis as the system continues to evolve. For example, to qualify for a high level of trust, we might require that a device meets all (or more) of the following requirements:
一旦传入的记录合并到聚合表单中,就会通知信任评估器触发重新评估。分析参考各种字段并聚合结果以分配信任层。信任评估器目前在不同的数据源中引用数十个字段,包括平台特定的自动和一些平台不可知的字段;随着系统发展,还有数百万个字段可供分析。例如,要获得高级别的信任,我们可能需要设备满足以下所有(或更多)要求:
◆◆Be encrypted
加密
◆◆Successfully execute all management and configuration agents
成功执行所有管理和配置代理
◆◆Install the most recent OS security patches
安装最新的操作系统安全补丁
◆◆Have a consistent state of data from all input sources
所有输入源的数据状态一致
This precomputation reduces the amount of data that must be pushed to the gateways, as well as the amount of computation that must be expended at access request time. This step also allows us to be confident that all of our enforcement gateways are using a consistent data set. We can make trust changes even for inactive devices at this stage. For example, in the past, we denied access for any devices that may have been subject to Stagefright [2] before such devices could even make an access request. Precomputation also provides us with an experiment framework in which we can write pre-commit tests to validate changes and canary small-percentage changes to the policy or Trust Inferer without impacting the company as a whole.
预计算减少了必须推送到网关的数据量,以及在访问请求时必须计算量。这一步,我们所有的强制策略网关使用一致的数据集。这个阶段,我们可以更改非活动设备的信任级别。在过去的情况下,在设备发起访问需求之前,我们拒绝访问任何处于Stagefright [2]影响的设备。预计算还为我们提供了一个实验框架,可以编写预策略测试,来验证策略或信任评估器的变更、小范围变更测试,从而使之不影响整个公司。
Of course, precomputation also has its downsides and can’t be relied on completely. For example, the access policy may require real-time two-factor authentication, or accesses originating from known-malicious netblocks may be restricted. Somewhat surprisingly, latency between a policy or device state change and the ability of gateways to enforce this change hasn’t proven problematic. Our update latency is typically less than a second. The fact that not all information is available to precompute is a more substantial concern.
预计算也有其缺点,不能完全依赖。例如,访问策略可能需要实时的双因素身份验证,来自已知恶意网段的访问将受到限制。超出预料的是,经过试验证明,策略或设备状态变更和网关执行更改之间的延迟没有问题。我们的更新延迟通常小于一秒。注意,不是所有的信息都可用于预计算,这点更为重要。
EXCEPTIONS
例外
The Trust Inferer has final say on what trust tier to apply to a given device. Trust evaluation considers preexisting exceptions in the Device Inventory Services that allow for overrides to the general access policy. Exceptions are primarily a mechanism aimed at reducing the deployment latency of policy changes or new policy primitives. In these cases, the most expedient course of action may be to immediately block a particular device that’s vulnerable to a zero-day exploit before the security scanners have been updated to look for it, or to permit untrusted devices to connect to a lab network. Internet of Things devices may be handled by exceptions and placed in their own trust tier, as installing and maintaining certificates on these devices could be infeasible.
信任评估器对设备信任层级分配有最终决定权。信任评估中,设备清单中预设例外策略的优先级高于于一般访问策略。例外策略主要是为了减少策略更改和新策略生效延迟。这种情况下,扫描器报告更新策略前,立即阻止易受零日利用的特定设备;允许不受信任的设备连接到实验室网络,如物联网设备也可以通过例外处理并放置在自己的信任层中(物联网设备上不能安装和维护证书)。
Deployment
部署
Initial Rollout
初始推出
The first phase of the BeyondCorp rollout integrated a sub-set of gateways with an interim meta-inventory service. This service comprised a small handful of data sources containing predominantly prescribed data. We initially implemented an access policy that mirrored Google’s existing IP-based perimeter security model, and applied this new policy to untrusted devices, leaving access enforcement unchanged for devices coming from privileged networks. This strategy allowed us to safely deploy various components of the system before it was fully complete and polished and without disturbing users.
BeyondCorp推出的第一阶段集成了一个有临时元清单(meta-inventory)服务的网关。此服务包含少量数据源,主要是指定数据(prescribed)。Google最初实现了一个访问策略映射了Google现有的基于IP的外网安全模型,并将此策略应用于不受信任的设备,使来自特权网络的设备的访问控制保持不变。 这个策略让我们在系统完善之前安全地部署各个组件,而不会干扰用户。
In parallel with this initial rollout, we designed, developed, and continue to iterate a higher-scale, lower-latency meta-inventory solution. This Device Inventory Service aggregates data from over 15 sources, ingesting between 30–100 changes per second, depending on how many devices are actively generating data. It is replete with trust eligibility annotation and authorization enforcement for all corporate devices. As the meta-inventory solution progressed and we obtained more information about each device, we were able to gradually replace IP-based policies with trust tier assignments. After we verified the workf lows of lower-tiered devices, we continued to apply fine-grained restric-tions to higher trust tiers, proceeding to our ultimate goal of retroactively increasing trust tier requirements for devices and corporate resources over time.
在最初推出的同时,我们设计开发并继续迭代更高规模、低延迟的元清单解决方案。此设备资源清单服务聚合了大于15个源的数据,根据设备生成数据情况,每秒接收30到100个数据变更。它记录了对所有公司设备的信任资格注释和授权执行。随着元清单解决方案的推进,我们获得了每个设备的更多信息,从而能够逐渐用信任层分配来取代基于IP的策略。在验证了低信任层设备的工作流程之后,我们继续对较高的信任层进行细粒度限制,以实现我们的最终目标:随着时间的推移,对公司的设备和资源追溯性地增信任层级。
Given the aforementioned complexity of correlating data from disparate sources, we decided to use an X.509 certificate as a persistent device identifier. This certificate provides us with two core functionalities:
由于上述信任关联数据来源广泛、类型复杂,我们使用X.509证书作为持久设备标识符。此证书为我们提供了两个核心功能:
◆◆If the certificate changes, the device is considered a different device, even if all other identifiers remain the same.
如果证书发生变化,则该设备被视为不同的设备。即便其他标识符保持不变。
◆◆If the certificate is installed on a different device, the correla-tion logic notices both the certificate collision and the mis-match in auxiliary identifiers, and degrades the trust tiers in response.
如果证书安装在不同的设备上,则相关逻辑发现辅助标识符中的证书冲突(不匹配),并相应降低信任层。
Thus, the certificate does not remove the necessity of correlation logic; nor is it sufficient to gain access in and of itself. However, it does provide a cryptographic GUID which enforcement gate-ways use to both encrypt traffic and to consistently and uniquely refer to the device.
因此,证书并没有消除相关逻辑的必要性;它本身也不获得访问权限。但是,它提供了一个加密GUID(用于加密通信量的标识设备的强制网关方法)。
Mobile
手机
Because Google seeks to make mobile a first-class platform, mobile must be able to accomplish the same tasks as other platforms and therefore requires the same levels of access. It turns out that deploying a tiered access model tends to be easier when it comes to mobile as compared to other platforms: mobile is typically characterized by a lack of legacy protocols and access methods, as almost all communications are exclusively HTTP-based. Android devices use cryptographically secured communications allowing identification of the device in the device inventory. Note that native applications are subject to the same authorization enforcement as resources accessed by a Web browser; this is because API endpoints also live behind proxies that are integrated with the Access Control Engine.
因为谷歌希望让手机成为一流的平台,手机必须能够完成与其他平台相同的任务,因此需要相同的访问级别。事实证明,与其他平台相比,部署分层访问模型在手机上更容易实现:移动的典型特征是缺乏传统的协议和访问方法,因为几乎所有通信都是基于HTTP的。Android设备使用加密的安全通信,允许在设备清单中识别设备。手机的应用程序与Web浏览器访问的资源受相同的授权强制;因为API也位于与访问控制引擎集成之后。
Legacy and Third-Party Platforms
传统平台和第三方平台
We determined that legacy and third-party platforms need a broader set of access methods than we require for mobile devices. We support the tunneling of arbitrary TCP and UDP traffic via SSH tunnels and on-client SSL/TLS proxies. How-ever, gateways only allow tunneled traffic that conforms with the policies laid out in the Access Control Engine. R ADIUS [3] is one special case: it is also integrated with the device inventory, but it receives VLAN assignments rather than trust-tier eligibil-ity semantics from the Trust Inferer. At network connection time, R ADIUS dynamically sets the VLAN by referencing Trust Inferer assignments using the certificate presented for 802.1x as the device identifier.
传统平台和第三方平台需要比移动设备更广泛的访问方法集。我们支持通过任意TCP和UDP通信进行ssh/ssl/tls隧道传输。网关只允许符合访问控制引擎中设定的策略联通。RADIUS[3]是一种特殊情况:它也与设备资源清单集成,但它从信任评估器接收分配的VLAN信息而不是信任层信息。在网络连接时,Radius使用802.1x提供的证书作为设备标识符,通过引用信任评估器动态设置VLAN。
Avoiding User Disruptions
避免用户中断
One of our biggest challenges in deploying BeyondCorp was figur-ing out how to accomplish such a massive undertaking without disrupting users. In order to craft a strategy, we needed to identify existing workf lows. From the existing workf lows, we identified:
在部署BeyondCorp时,我们最大的挑战之一是如何在不干扰用户的情况下完成如此庞大的任务。为了制定战略,我们需要确定现有的工作流程。从现有的工作流程中,我们确定:
◆◆Which workflows we could make compliant with an unprivi-leged network
哪些工作流程可以与非特权网络兼容
◆◆Which workflows either permitted more access than desirable or allowed users to circumvent restrictions that were already in place
哪些工作流允许比期望拥有更多的访问,或者允许用户绕过已经存在的限制
To make these determinations, we followed a two-pronged approach. We developed a simulation pipeline that examined IP-level metadata, classified the traffic into services, and applied our proposed network security policy in our simulated environ-ment. In addition, we translated the security policy into each platform’s local firewall configuration language. While on the corporate network, this measurement allowed us to log traf-fic metadata destined for Google corporate services that would cease to function on an unprivileged network. We found some surprising results, such as services that had supposedly been decommissioned but were still running with no clear purpose.
为了作这些决定,我们采取了双管齐下的办法。我们开发了一个模拟管道来检查IP级别的元数据,将流量分类为服务,并将我们提出的网络安全策略应用到我们的模拟环境中。此外,我们还将安全策略转换为每个平台的本地防火墙配置。在公司网络上,这种测量方法允许我们记录目的地为Google公司服务的流量元数据,这些服务将不能在没有特权的网络上访问。我们发现了一些令人惊讶的结果:许多本应回收的服务策略仍在运行。
After collecting this data, we worked with service owners to migrate their services to a BeyondCorp-enabled gateway. While some services were straightforward to migrate, others were more difficult and required policy exceptions. However, we made sure that all service owners were held accountable for exceptions by associating a programmatically enforced owner and expiration with each exception. As more services are updated and more users work for extended periods of time without exercising any excep-tions, the users’ devices can be assigned to an unprivileged VLAN. With this approach, users of noncompliant applications are not overly inconvenienced; the pressure is on the service providers and application developers to configure their services correctly.
在收集这些数据之后,我们与服务所有者一起将其服务迁移到一个支持BeyondCorp的网关下。有些服务很容易迁移,但有的服务需要例外策略,迁移则比较困难。我们以程序记录方式强制执行的异常与策略所有者关联,确保所有服务所有者都对异常负责。随着越来越多的服务被更新,越来越多的用户长时间工作而没有任何异常,用户的设备可以被分配到一个没有特权的VLAN。使用这种方法,不兼容应用程序的用户不会太不方便;服务提供商和应用程序开发人员面临着必须正确配置其服务的压力。
The exceptions model has resulted in an increased level of com-plexity in the BeyondCorp ecosystem, and over time, the answer to “why was my access denied? ” has become less obvious. Given the inventory data and real-time request data, we need to be able to ascertain why a specific request failed or succeeded at a specific point in time. The first layer of our approach in answer-ing this question has been to craft communications to end users (warning of potential problems, and how to proceed with self-remediation or contact support) and to train IT Operations staff. We also developed a service that can analyze the Trust Inferer’s decision tree and chronological history of events affecting a device’s trust tier assignment in order to propose steps for reme-diation. Some problems can be resolved by users themselves, without engaging support staff with elevated privileges. Users who have preserved another chain of trust are often able to self-remediate. For example, if a user believes his or her laptop has been improperly evaluated but still has a phone at a sufficient trust tier, we can forward the diagnosis request to the phone for evaluation.
异常模型导致BeyondCorp生态系统的复杂程度增加,随着时间的推移,“为什么我的访问不了?”的问题已经变得不那么多了。通过库存数据和实时数据,我们能够确定特定请求在特定时间点失败或成功的原因。解决这个问题的第一层方法是联系终端用户(警告潜在问题,以及如何进行补救或联系支持),培训IT运营人员。我们还开发了一个服务来分析信任评估器的决策树和信任层分配的记录,以便提出补救方案。有些问题用户可以自己解决,不需要支持人员给他提升权限。保留另一个信任通道的用户通常能够自我补救。例如,如果用户的笔记本电脑评估不正确导致无法访问,但仍有一部手机处于足够的信任层,我们可以将诊断请求转发给手机进行评估。
Challenges and Lessons Learned
挑战和经验教训
Data Quality and Correlation
数据质量和相关性
Poor data quality in asset management can cause devices to unintentionally loose access to corporate resources. Typos, transposed identifiers, and missing information are all com-mon occurrences. Such mistakes may happen when procure-ment teams receive asset shipments and add the assets to our systems, or may be due to errors in a manufacturer’s workf low. Data quality problems also originate quite frequently during device repairs, when physical parts or components of a device are replaced or moved between devices. Such issues can corrupt device records in ways that are difficult to fix without manually inspecting the device. For example, a single device record might actually contain data for two unique devices, but automatically fixing and splitting the data may require physically reconciling the asset tags and motherboard serial numbers.
资产管理中坏数据可能导致设备失去对公司资源的访问权。输入错误、转置的标识符和丢失的信息都会引发异常。错误可能发生在采购团队收到资产并将资产录入到系统时或者由于制造商在工作流程中出现差错时。在设备维修期间,设备的物理部件或组件在设备间变化时,数据错误问题也会产生。如果不手动检查设备这些问题会很难修复,从而导致设备记录出问题。例如,单个设备记录可能实际包含两个唯一标识设备的数据,但自动修复数据、拆分数据需要物理资产标签和主板序列号。
The most effective solutions in this arena have been to find local work flow improvements and automated input validation that can catch or mitigate human error at input time. Double-entry accounting helps, but doesn’t catch all cases. However, the need for highly accurate inventory data in order to make correct trust evaluations forces a renewed focus on inventory data qual-ity. Our data are the most accurate they’ve ever been, and this accuracy has had secondary security benefits. For example, the percentage of our f leet that is updated with the latest security patches has increased.
针对这个问题,最有效的解决方案是:找到改进和自动输入本地工作流程,这样来捕获或减少人为输入错误,复式记账法对此也有帮助,但不能覆盖所有情况。然而,为了进行正确的信任评估,对高精度库存数据的需求迫使人们重新关注库存数据的质量。我们的数据是有史以来最精确的,这种精确性也产生了安全效益。例如,我们团队中最新补丁更新率已经增加。
Sparse Data Sets
稀疏数据集
As mentioned previously, upstream data sources don’t neces-sarily share overlapping device identifiers. To enumerate a few potential scenarios: new devices might have asset tags but no hostnames; the hard drive serial might be associated with different motherboard serials at different stages in the device lifecycle; or MAC addresses might collide. A reasonably small set of heuristics can correlate the majority of deltas from a subset of data sources. However, in order to drive accuracy closer to 100%, you need an extremely complex set of heuristics to account for a seemingly endless number of edge cases. A tiny fraction of devices with mismatched data can potentially lock hundreds or even thousands of employees out of applications they need to be productive. In order to mitigate such scenarios, we monitor and verify that a set of synthetic records in our production pipeline, crafted to verify trust evaluation paths, result in the expected trust tier results.
如前所述,上游数据源不需要共享重叠的设备标识符。例如:新设备可能有资产标签,但没有主机名;硬盘驱动器序列号可能在设备生命周期的不同阶段与不同的主板序列号关联;或者MAC地址可能冲突。一个相当小的启发式集合可以关联来自数据源子集的大多数特征。然而,为了使准确率接近100%,需要一套极其复杂的启发式算法来解释大量的边缘案例。少量数据不匹配的设备可能会导致大量员工不能访问高频使用的应用程序。为了缓解这种情况,我们监控并检查生产管道中的一组合成记录,精心设计以验证信任评估路径,从而产生预期的信任层结果。
Pipeline Latency
管道延迟
Since the Device Inventory Service ingests data from several disparate data sources, each source requires a unique imple-mentation. Sources that were developed in-house or are based on open source tools are generally straightforward to extend in order to asynchronously publish deltas to our existing pipeline. Other sources must be periodically polled, which requires strik-ing a balance between frequency of polling and the resulting server load. Even though delivery to gateways typically takes less than a second, when polling is required, changes might take several minutes to register. In addition, pipeline processing can add latency of its own. Therefore, data propagation needs to be streamlined.
由于设备资源清册服务从几个不同的数据源接收数据,因此每个源都需要一个唯一的实现方式。自研项目或开源工具通常很容易扩展,以便异步地将数据发布到我们现有的管道中。其他数据源必须定期对其进行轮询,这需要在轮询频率和服务器负载之间取得平衡。尽管传送到网关通常需要不到一秒钟的时间,轮询时,数据的变更需要几分钟才能注册生效。此外,管道处理也有延迟。因此,需要简化数据传递。
Communication
沟通
Fundamental changes to the security infrastructure can poten-tially adversely affect the productivity of the entire company’s workforce. It’s important to communicate the impact, symp-toms, and available remediation options to users, but it can be difficult to find the balance between over-communication and under-communication. Under-communication results in surprised and confused users, inefficient remediation, and untenable operational load on the IT support staff. Over-com-munication is also problematic: change-resistant users tend to overestimate the impact of changes and attempt to seek unnec-essary exemptions. Overly frequent communication can also inure users to potentially impactful changes. Finally, as Google’s corporate infrastructure is evolving in many unrelated ways, it’s easy for users to conf late access issues with other ongoing efforts, which also slows remediation efforts and increases the operational load on support staff.
安全基础设施的根本性改变可能会对整个公司员工的生产力产生潜在的负面影响。向用户传达影响、现象和有效的补救选项很重要,但在过度沟通和沟通不足间很难找到平衡点。沟通不足导致用户感到不满和困惑、补救措施效率低下、IT支持人员身心疲惫。过度沟通也存在问题:抵制变革的用户往往高估变革的影响,并试图寻求不必要的权限。过于频繁的交流也会让用户习惯于潜在的影响性变化。最后,由于Google公司基础设施正在以许多不相关的方式发展,用户很容易将延迟访问问题与其他正在进行的工作联系起来,这也会影响补救工作,增加支持人员的操作负荷。
Disaster Recovery
灾难恢复
Since the composition of the BeyondCorp infrastructure is non-trivial, and a catastrophic failure could prevent even support staff from accessing the tools and systems needed for recov-ery, we built various fail-safes into the system. In addition to monitoring for potential or manifested unexpected changes in the assignment of trust tiers, we’ve leveraged some of our exist-ing disaster recovery practices to help ensure that BeyondCorp will still function in the event of a catastrophic emergency. Our disaster recovery protocol relies on a minimal set of dependen-cies and allows an extremely small subset of privileged main-tainers to replay an audit log of inventory changes in order to restore a previously known good state of device inventory state and trust evaluations. We also have the ability in an emergency to push fine-grained changes to the access policy that allow maintainers to bootstrap a recovery process.
由于BeyondCorp基础设施的组成是非常重要,故障甚至会阻止支持人员访问恢复所需的工具和系统,因此我们在系统中构建了各种故障保障措施。除了监控信任层分配中的潜在异常或显性异常变化之外,我们还利用了一些现有的灾难恢复做法,以帮助确保BeyondCorp在发生灾难性故障的紧急情况下仍能正常工作。我们的灾难恢复协议依赖于最小的依赖集,允许极少数特权容器重放清单更改的审核消息,以便恢复设备清单状态、恢复信任评估状态到未出故障时。我们还可以在紧急情况下将策略变更为允许维护人员启动恢复系统。
Next Steps
下一步
As with any large-scale effort, some of the challenges we faced in deploying BeyondCorp were anticipated while others were not. An increasing number of teams at Google are finding new and interesting ways to integrate with our systems, providing us with more detailed and layered protections against malicious actors. We believe that BeyondCorp has substantially improved the security posture of Google without sacrificing usability, and has provided a f lexible infrastructure that will allow us to apply authorization decisions based on policy unencumbered by tech-nological restrictions. While BeyondCorp has been quite suc-cessful with Google systems and at Google scale, its principles and processes are also within the reach of other organizations to deploy and improve upon.
与任何大规模的努力一样,我们在部署BeyondCorp时所面临的一些挑战是有想到的,也有未想到的。越来越多的Google团队正在寻找新的、有趣的方法来与我们的系统集成,为我们提供更详细、分层更细话的保护,以抵御恶意行为。我们相信,BeyondCorp在不牺牲可用性的前提下大大改善了Google的安全状况,并提供了一个灵活的基础设施,使我们能够根据不受技术限制的策略应用授权决策。BeyondCorp在Google系统和Google规模上已经取得了相当大成功的同时,它的原则和流程也可在其他组织的范围内进行部署和改进。
Resources
参考文献
[1] Architectural discussion of BeyondCorp: http://research.google.com/pubs/pub43231.html.
[2] Stagefright: https://en.wikipedia.org/wiki/Stagefright_(bug).
[3] RADIUS: https://en.wikipedia.org/wiki/RADIUS.
最后
以上就是俭朴季节为你收集整理的BeyondCorp超越组织边界方案:“零信任” 原报告翻译版的全部内容,希望文章能够帮你解决BeyondCorp超越组织边界方案:“零信任” 原报告翻译版所遇到的程序开发问题。
如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。
发表评论 取消回复