Tensorflow 2 调试方法1. 调试 Tensor 值2. 调试设备位置3. 调试图结构4. 单步调试5. 调试高级 API (tf.keras)6. 数值问题 (NaN / Infinity)7. Tensorflow Debugger (tfdbg)

92 阅读 0 评论 61 点赞

我是靠谱客的博主疯狂花生，这篇文章主要介绍Tensorflow 2 调试方法1. 调试 Tensor 值2. 调试设备位置3. 调试图结构4. 单步调试5. 调试高级 API (tf.keras)6. 数值问题 (NaN / Infinity)7. Tensorflow Debugger (tfdbg)，现在分享给大家，希望可以做个参考。

文章目录

1. 调试 Tensor 值
2. 调试设备位置
3. 调试图结构
- a. tf.function 图
- b. 运行图 (runtime graphs)
4. 单步调试
5. 调试高级 API (tf.keras)
6. 数值问题 (NaN / Infinity)
7. Tensorflow Debugger (tfdbg)

1. 调试 Tensor 值

打印Tensor的值

复制代码

1
2
3
4
5
6
7
8
9
10
import tensorflow as tf
import numpy as np
def log1p(x):
    y = 1.0 * x
    print(y)
    return tf.math.log(y)

y = log1p(tf.constant([1., 2., 3.]))
y = log1p(tf.constant([2., 3., 4.]) * np.pi)

运行结果

复制代码

1
2
3
tf.Tensor([1. 2. 3.], shape=(3,), dtype=float32)
tf.Tensor([ 6.2831855  9.424778  12.566371 ], shape=(3,), dtype=float32)

解释

函数log1p没有@tf.function被修饰，所以是立即执行的。
print函数能够输出张量的值
- 类似numpy.ndarray
- 可能设置设备到主机的复制

打印Tensor的聚合值

复制代码

1
2
3
4
5
6
7
8
def log1p(x):
    y = 1.0 * x
    print(tf.reduce_mean(y), tf.reduce_max(y), tf.reduce_min(y))
    return tf.math.log(y)

y = log1p(tf.constant([1., 2., 3.]))
y = log1p(tf.constant([2., 3., 4.]) * np.pi)

运行结果

复制代码

1
2
3
tf.Tensor(2.0, shape=(), dtype=float32) tf.Tensor(3.0, shape=(), dtype=float32) tf.Tensor(1.0, shape=(), dtype=float32)
tf.Tensor(9.424778, shape=(), dtype=float32) tf.Tensor(12.566371, shape=(), dtype=float32) tf.Tensor(6.2831855, shape=(), dtype=float32)

可以用内建的TF函数打印变换后的tensor值

修改打印值的格式

复制代码

1
2
3
4
5
6
7
8
9
10
np.set_printoptions(precision=3)

def log1p(x):
    y = 1.0 * x
    print(y)
    return tf.math.log(y)

y = log1p(tf.constant([1., 2., 3.]))
y = log1p(tf.constant([2., 3., 4.]) * np.pi)

输出

复制代码

1
2
3
tf.Tensor([1. 2. 3.], shape=(3,), dtype=float32)
tf.Tensor([ 6.283  9.425 12.566], shape=(3,), dtype=float32)

EagerTensor.__str()__和__repr()__与numpy字符串格式挂钩
因此可以使用numpy.set_printoptions()控制打印格式

输出图内的张量

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
@tf.function
def collatz(n):
    counter = tf.constant(0)
    while n > 1:
        print(n)
        if n % 2 == 0:
            n //= 2
        else:
            n = n * 3 + 1
        counter += 1
    return counter

print(collatz(tf.constant(42)))

运行结果

复制代码

1
2
3
Tensor("placeholder:0", shape=(), dtype=int32)
tf.Tensor(8, shape=(), dtype=int32)

Placeholder是TF while循环中graphlet的一部分

将print(n)换成tf.print(n)，结果变成

复制代码

1
2
3
4
5
6
7
8
9
10
42
21
64
32
16
8
4
2
tf.Tensor(8, shape=(), dtype=int32)

tf.print()打印出了真正的张量n执行中的值

不等长张量 RaggedTensor

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
ragged = tf.RaggedTensor.from_row_splits(
    values=[3.0, 1.0, 4.0, 1.0, 5.0, 9.0, 2.0, 6.0],
    row_splits=[0, 4, 4, 7, 8, 8]
)

@tf.function
def ragged_times_length_plus_one(x):
    row_lenghts = tf.reduce_sum(x.row_lengths())
    y = x * tf.cast(row_lenghts, tf.float32)
    tf.print(y)
    return y + 1.0

ragged_times_length_plus_one(ragged)

输出

复制代码

1
2
tf.RaggedTensor(values=Tensor("Mul_1:0", shape=(8,), dtype=float32), row_splits=Tensor("x_1:0", shape=(6,), dtype=int64))

不等长张量不能正常打印

稀疏张量

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
sparse = tf.sparse.SparseTensor(
    indices=[[0, 0], [1, 2]],
    values=[1.1, 2.2],
    dense_shape=[3, 4]
)

@tf.function
def sparse_times_non_zero_count(x):
    count = tf.cast(tf.math.count_nonzero(x.values), tf.float32)
    y = x * count
    tf.print(y)
    return y

sparse_times_non_zero_count(sparse)

输出

复制代码

1
2
3
'SparseTensor(indices=[[0 0]
 [1 2]], values=[2.2 4.4], shape=[3 4])'

稀疏张量可以打印

以编程方式访问图内的张量值

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
13
random_normal = tf.random_normal_initializer()
w = tf.Variable(random_normal([2, 3]))
b = tf.Variable(random_normal([3]))

@tf.function
def my_dense_layer(x):
    y = tf.matmul(x, w)
    y_with_bias = y + b
    return tf.nn.relu(y_with_bias), y, y_with_bias

x = random_normal([4, 2])
print(my_dense_layer(x))

运行结果

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
(<tf.Tensor: id=460, shape=(4, 3), dtype=float32, numpy=
array([[0.   , 0.026, 0.   ],
       [0.   , 0.024, 0.   ],
       [0.   , 0.029, 0.   ],
       [0.   , 0.022, 0.   ]], dtype=float32)>, <tf.Tensor: id=461, shape=(4, 3), dtype=float32, numpy=
array([[-0.   ,  0.001,  0.001],
       [ 0.003, -0.001, -0.006],
       [-0.001,  0.003,  0.006],
       [ 0.002, -0.004, -0.008]], dtype=float32)>, <tf.Tensor: id=462, shape=(4, 3), dtype=float32, numpy=
array([[-0.092,  0.026, -0.011],
       [-0.088,  0.024, -0.019],
       [-0.093,  0.029, -0.007],
       [-0.09 ,  0.022, -0.021]], dtype=float32)>)

对于控制流之外的中间张量值，可以将他们添加到返回值中，获得运行时的值

以编程访问图内的张量值 - while循环

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
@tf.function
def collatz(n):
    counter = tf.constant(0)
    n_history = tf.TensorArray(n.dtype, size=0, dynamic_size=True)
    while n > 1:
        if n % 2 == 0:
            n //= 2
        else:
            n = n * 3 + 1
        n_history = n_history.write(counter, n)
        counter += 1
    return counter, n_history.stack()

print(collatz(tf.constant(42)))

运行结果

复制代码

1
2
(<tf.Tensor: id=556, shape=(), dtype=int32, numpy=8>, <tf.Tensor: id=557, shape=(8,), dtype=int32, numpy=array([21, 64, 32, 16,  8,  4,  2,  1])>)

可以用tf.TensorArray实现

2. 调试设备位置

op(算子)在设备上的位置

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
import tensorflow as tf
import numpy as np
# 必须在程序开始时执行
tf.debugging.set_log_device_placement(True)

def log1p(x):
    y = 1.0 + x
    tf.print(y)
    return tf.math.log(y)

log1p(tf.constant([1.0, 2.0, 3.0]) * np.pi)

运行结果

复制代码

1
2
3
4
5
6
7
Executing op Mul in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op AddV2 in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op StringFormat in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op PrintV2 in device /job:localhost/replica:0/task:0/device:CPU:0
[4.14159298 7.28318548 10.424778]
Executing op Log in device /job:localhost/replica:0/task:0/device:CPU:0

每当单个算子放到设备上时，输出算子的位置
不输出相同设备上的算子的重复eager执行

tf.funtion在设备上的位置

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
13
import tensorflow as tf
import numpy as np
# 必须在程序开始时执行
tf.debugging.set_log_device_placement(True)

@tf.function
def log1p(x):
    y = 1.0 + x
    tf.print(y)
    return tf.math.log(y)

log1p(tf.constant([1.0, 2.0, 3.0]) * np.pi)

Jupyter Notebook 的运行结果

复制代码

1
2
3
Executing op __inference_log1p_19 in device /job:localhost/replica:0/task:0/device:CPU:0
[4.14159298 7.28318548 10.424778]

命令行运行的结果

复制代码

1
2
3
4
5
6
7
8
9
10
x: (_Arg): /job:localhost/replica:0/task:0/device:CPU:0
add: (AddV2): /job:localhost/replica:0/task:0/device:CPU:0
StringFormat: (StringFormat): /job:localhost/replica:0/task:0/device:CPU:0
PrintV2: (PrintV2): /job:localhost/replica:0/task:0/device:CPU:0
Log: (Log): /job:localhost/replica:0/task:0/device:CPU:0
Identity: (Identity): /job:localhost/replica:0/task:0/device:CPU:0
identity_RetVal: (_Retval): /job:localhost/replica:0/task:0/device:CPU:0
add/x: (Const): /job:localhost/replica:0/task:0/device:CPU:0
[4.14159298 7.28318548 10.424778]

set_log_device_placement()在Jupyter中不会显示图内算子的位置
因为Jupyter只显示stdout的结果，变量输出输出在了info log中
set_log_device_placement()只打印以下项目的设备位置
- Eager 算子执行
- 图构建
对于后者，不保证所有的算子在运行时执行。Grapper优化可能在实际运行前将其裁剪
set_log_device_placement()无法在TPU上良好地工作

3. 调试图结构

a. tf.function 图

获得tf函数的图

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
random_normal = tf.random_normal_initializer()
w = tf.Variable(random_normal([2, 3]))
b = tf.Variable(random_normal([3]))

@tf.function
def my_dense_layer(x):
    y = tf.matmul(x, w)
    y_with_bias = y + b
    return tf.nn.relu(y_with_bias), y, y_with_bias

x = random_normal([4, 2])
print(my_dense_layer(x))

graph = my_dense_layer.get_concrete_function(x).graph
graph.as_graph_def()

运行结果

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
node {
  name: "x"
  op: "Placeholder"
  attr {
    key: "_user_specified_name"
    value {
      s: "x"
    }
  }
  attr {
    key: "dtype"
    value {
      type: DT_FLOAT
    }
  }
  attr {
    key: "shape"
    value {
      shape {
        dim {
          size: 4
        }
        dim {
          size: 2
        }
      }
    }
  }
}
node {
  name: "MatMul/ReadVariableOp/resource"
  op: "Placeholder"
  device: "/job:localhost/replica:0/task:0/device:CPU:0"
  attr {
    key: "dtype"
    value {
      type: DT_RESOURCE
    }
  }
...

在第一个调用或穿过tf.function时使用get_concrete_function
concrete函数是基于特定的输入参数，将Python函数编译成图的结果

TensorBoard 图可视化工具

信息流的垂直方向：自底向上
按名字范围分组：是
能够在GraphDef中处理FunctionDefLibrary（例如 V2控制流）：是（使用breakout工具箱）

获得和绘制函数图: Colab (仅Google3)

复制代码

1
2
3
$ blaze run -c opt --config=python3 --config=cuda 
learning/brain/python/client/colab:colab_notebook_with_tfgraph_py3

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
random_normal = tf.random_normal_initializer()
w = tf.Variable(random_normal([2, 3]))
b = tf.Variable(random_normal([3]))

@tf.function
def my_dense_layer(x):
    y = tf.matmul(x, w)
    y_with_bias = y + b
    return tf.nn.relu(y_with_bias), y, y_with_bias

x = random_normal([4, 2])
print(my_dense_layer(x))

from google3.learning.brain.python.client import colab

graph = my_dense_layer.get_concrete_function(x).graph
colab.tfgraph.display(graph)

获得和绘制函数图: TF2中的控制流

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
@tf.function
def collatz(n):
    counter = tf.constant(0)
    while n > 1:
        if n % 2 == 0:
            n //= 2
        else:
            n = n * 3 + 1
        counter += 1
    return counter

print(collatz(tf.constant(42)))
collatz_graph = collatz.get_concrete_function(tf.constant(42)).graph
colab.tfgraph.display(collatz_graph)

控制流V2被转换成了graphlet
TensorBoard图可视化用break out boxes展示graphlet
Netron也不能处理这样的nested graph structure

分布式策略

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
gpus = tf.config.list_physical_devices("GPU")
if len(gpus) == 1:
    tf.config.experimental.set_virtual_device_configuration(
        gpus[0],  # Which physical device to use
        [tf.config.LogicalDeviceConfiguration(512) for _ in range(4)] # Resultant logical devices
    )
tf.config.list_logical_devices()

dist_strat = tf.distribute.MirroredStrategy()

with dist_strat.scope():
    w = tf.Variable(tf.ones([4, 10]))

def f():
    with tf.GradientTape() as tape:
        loss = tf.math.square(w)
    grads = tape.gradient(loss, w)
    return grads

dist_f = lambda: dist_strat.experimental_run_v2(f)
dist_f = tf.function(dist_f, autograph=True)
g = dist_f.get_concrete_function().graph
g.as_graph_def()

运行结果

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
...
node {
  name: "Square"
  op: "Square"
  input: "Square/ReadVariableOp"
  device: "/job:localhost/replica:0/task:0/device:GPU:0"
  attr {
    key: "T"
    value {
      type: DT_FLOAT
    }
  }
}
...

镜像策略和一些其它策略执行图内复制
此复制影响了具体函数的图

tf.print()是如何工作的

问题：tf.print()操作的结果并没有被使用，它是如何执行的？

答案：此算子被添加在了返回结果的控制依赖中

tf.print是否在没有返回值的函数中仍然工作？

复制代码

1
2
3
4
5
6
7
8
9
v1 = tf.Variable(40.0)

@tf.function
def increment_variable():
    tf.print(v1)
    tf.compat.v1.assign_add(v1, 1.0)
    
increment_variable()

运行结果

复制代码

1
2
40

b. 运行图 (runtime graphs)

tf.print(): 可能影响运行运行图的优化

复制代码

1
2
3
4
5
6
7
8
9
10
@tf.function
def harmonic_mean(x):
    x_reciprocals = tf.math.reciprocal(x)
    reciprocal_sum = tf.math.reduce_sum(x_reciprocals)
    tf.math.reduce_min(x_reciprocals) ==> tf.print(tf.math.reduce_min(x_reciprocals))
    n = tf.cast(tf.size(x), tf.float32)
    return n / reciprocal_sum

harmonic_mean(tf.constant([10.0, 20.0, 30.0]))

添加tf.print()导致本来不会执行的min算子需要执行

Dump Grapper 输出: 实际执行的图

复制代码

1
2
3
$ TF_DUMP_GRAPH_PREFIX="/tmp/tf_graph_dump" 
  bazel run my/build/target -- --vmodule=meta_optimizer=4

Grapper是TF内置的默认的图优化器
感兴趣的通常是最后一个文件：Grapper最后的输出
tfdbg2的目标是使这个工作流更简单（相对于函数图和Grapper-out图）

4. 单步调试

tf.config.experimental_run_functions_eagerly()

覆盖图的编译，运行所有的算子eagerly，包括backprop。
然后就可以在IDE中断点调试了

此API在tf.data.Dataset.map()中不工作

因为Dataset.map()总是在图执行之前编译
不论是否使用了@tf.function
意思
- 在map函数中单步调试是不可能的
- 必须使用tf.print()而不是print()输出张量的值
- 变通：使用tfdbg2

5. 调试高级 API (tf.keras)

访问tf.keras层

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
model = tf.keras.Sequential()
model.add(tf.keras.layers.Dense(5, input_shape=[4], activation='relu'))
model.add(tf.keras.layers.Dropout(rate=0.5))
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))

debug_model = tf.keras.Model(
    inputs=model.inputs, 
    outputs=[model.layers[0].output, model.layers[1].output] + model.outputs)

xs = tf.random_normal_initializer()([8, 4])
print(debug_model(xs, training=True))

运行结果

复制代码

[<tf.Tensor: id=103, shape=(8, 5), dtype=float32, numpy=
array([[0.03208053, 0.        , 0.        , 0.09101269, 0.0405516 ],
       [0.06668283, 0.        , 0.05414589, 0.        , 0.06441024],
       [0.        , 0.02470349, 0.0345275 , 0.        , 0.        ],
       [0.02822505, 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.03051471, 0.        ],
       [0.01117405, 0.        , 0.0744615 , 0.07232606, 0.09003952],
       [0.        , 0.03395397, 0.04608804, 0.        , 0.        ],
       [0.        , 0.02972447, 0.00674627, 0.        , 0.        ]],
      dtype=float32)>, <tf.Tensor: id=116, shape=(8, 5), dtype=float32, numpy=
array([[0.06416105, 0.        , 0.        , 0.        , 0.08110321],
       [0.13336566, 0.        , 0.        , 0.        , 0.12882048],
       [0.        , 0.04940698, 0.069055  , 0.        , 0.        ],
       [0.0564501 , 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.06102942, 0.        ],
       [0.0223481 , 0.        , 0.14892301, 0.14465213, 0.18007904],
       [0.        , 0.        , 0.09217609, 0.        , 0.        ],
       [0.        , 0.05944894, 0.        , 0.        , 0.        ]],
      dtype=float32)>, <tf.Tensor: id=121, shape=(8, 1), dtype=float32, numpy=
array([[0.51327056],
       [0.52288353],
       [0.49928164],
       [0.5032595 ],
       [0.5143335 ],
       [0.54077065],
       [0.49030966],
       [0.50787127]], dtype=float32)>]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
[<tf.Tensor: id=103, shape=(8, 5), dtype=float32, numpy=
array([[0.03208053, 0.        , 0.        , 0.09101269, 0.0405516 ],
       [0.06668283, 0.        , 0.05414589, 0.        , 0.06441024],
       [0.        , 0.02470349, 0.0345275 , 0.        , 0.        ],
       [0.02822505, 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.03051471, 0.        ],
       [0.01117405, 0.        , 0.0744615 , 0.07232606, 0.09003952],
       [0.        , 0.03395397, 0.04608804, 0.        , 0.        ],
       [0.        , 0.02972447, 0.00674627, 0.        , 0.        ]],
      dtype=float32)>, <tf.Tensor: id=116, shape=(8, 5), dtype=float32, numpy=
array([[0.06416105, 0.        , 0.        , 0.        , 0.08110321],
       [0.13336566, 0.        , 0.        , 0.        , 0.12882048],
       [0.        , 0.04940698, 0.069055  , 0.        , 0.        ],
       [0.0564501 , 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.06102942, 0.        ],
       [0.0223481 , 0.        , 0.14892301, 0.14465213, 0.18007904],
       [0.        , 0.        , 0.09217609, 0.        , 0.        ],
       [0.        , 0.05944894, 0.        , 0.        , 0.        ]],
      dtype=float32)>, <tf.Tensor: id=121, shape=(8, 1), dtype=float32, numpy=
array([[0.51327056],
       [0.52288353],
       [0.49928164],
       [0.5032595 ],
       [0.5143335 ],
       [0.54077065],
       [0.49030966],
       [0.50787127]], dtype=float32)>]

要访问模型的内部层，可以构建一个新模型输出那些层
如果向看层内部的梯度呢？
- tfdbg可以帮你

使用TensorBoard回调调试Keras模型

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
from tensorflow.keras import backend as K

model = tf.keras.Sequential()
model.add(tf.keras.layers.Dense(5, input_shape=[4], activation='relu'))
model.add(tf.keras.layers.Dropout(rate=0.5))
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam')

xs = tf.random_normal_initializer()([8, 4])
ys = tf.zeros([8])
model.fit(xs, ys, epochs=2, callbacks=[tf.keras.callbacks.TensorBoard("tb_logdir")])

tf.keras.callbacks.TensorBoard回调将训练图的日志输出到logdir，包括损失、权重等信息
边上被标记了张量的形状，标记到了什么程度：只有模型构建时已知的形状

6. 数值问题 (NaN / Infinity)

常见的导致数值问题的情况

缺乏对值的裁剪
- 除0，对0取对数
算子的问题
梯度爆炸
训练样本很差

使用tfdbg2调试数值问题

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
tf.debugging.enable_check_numerics()

@tf.function
def bad_func(n):
    total = tf.constant(0.0)
    x = tf.constant(10.0)
    i = tf.constant(0, dtype=tf.int32)
    while tf.math.less(i, n):
        total += tf.math.log(x)
        x -= 1.0
        i += 1
    return total

# 尝试小于10的值，观察错误
n = tf.constant(12, dtype=tf.int32)
print(bad_func(n))

输出结果

复制代码

InvalidArgumentError:

!!! Detected Infinity or NaN in output 0 of graph op "Log" (# of outputs: 1) !!!
  dtype: <dtype: 'float32'>
  shape: ()

Input tensor: Tensor("Placeholder:0", shape=(), dtype=float32)
  Graph name: "while_body_13"

Stack trace of op's creation ("->": inferred user code):
    + ... (Omitted 21 frames)
    + ...3.6/site-packages/IPython/core/interactiveshell.py (L2848) run_cell
 -> |   raw_cell, store_history, silent, shell_futures)
    + ...3.6/site-packages/IPython/core/interactiveshell.py (L2874) _run_cell
 -> |   return runner(coro)
    + ...hon3.6/site-packages/IPython/core/async_helpers.py (L68) _pseudo_sync_runner
 -> |   coro.send(None)
    + ...3.6/site-packages/IPython/core/interactiveshell.py (L3051) run_cell_async
 -> |   interactivity=interactivity, compiler=compiler, result=result)
    + ...3.6/site-packages/IPython/core/interactiveshell.py (L3242) run_ast_nodes
 -> |   if (await self.run_code(code, result,  async_=asy)):
    + ...3.6/site-packages/IPython/core/interactiveshell.py (L3319) run_code
 -> |   exec(code_obj, self.user_global_ns, self.user_ns)
    + <ipython-input-3-acc5c4cbe210> (L16) <module>
 -> |   print(bad_func(n))
    + ...kages/tensorflow_core/python/eager/def_function.py (L568) __call__
    |   result = self._call(*args, **kwds)
    + ...kages/tensorflow_core/python/eager/def_function.py (L615) _call
    |   self._initialize(args, kwds, add_initializers_to=initializers)
    + ...kages/tensorflow_core/python/eager/def_function.py (L497) _initialize
    |   *args, **kwds))
    + ...-packages/tensorflow_core/python/eager/function.py (L2389) _get_concrete_function_internal_garbage_collected
    |   graph_function, _, _ = self._maybe_define_function(args, kwargs)
    + ...-packages/tensorflow_core/python/eager/function.py (L2703) _maybe_define_function
    |   graph_function = self._create_graph_function(args, kwargs)
    + ...-packages/tensorflow_core/python/eager/function.py (L2593) _create_graph_function
    |   capture_by_value=self._capture_by_value),
    + ...ges/tensorflow_core/python/framework/func_graph.py (L978) func_graph_from_py_func
    |   func_outputs = python_func(*func_args, **func_kwargs)
    + ...kages/tensorflow_core/python/eager/def_function.py (L439) wrapped_fn
    |   return weak_wrapped_fn().__wrapped__(*args, **kwds)
    + ...ges/tensorflow_core/python/framework/func_graph.py (L964) wrapper
    |   user_requested=True,
    + <ipython-input-3-acc5c4cbe210> (L8) bad_func
 -> |   while tf.math.less(i, n):
    + ...ow_core/python/autograph/operators/control_flow.py (L746) while_stmt
    |   basic_symbol_names, composite_symbol_names, opts)
    + ...ow_core/python/autograph/operators/control_flow.py (L794) _tf_while_stmt
    |   aug_init_vars, **opts)
    + ...ges/tensorflow_core/python/ops/control_flow_ops.py (L2675) while_loop
    |   back_prop=back_prop)
    + ...te-packages/tensorflow_core/python/ops/while_v2.py (L194) while_loop
    |   add_control_dependencies=add_control_dependencies)
    + ...ges/tensorflow_core/python/framework/func_graph.py (L978) func_graph_from_py_func
    |   func_outputs = python_func(*func_args, **func_kwargs)
    + ...te-packages/tensorflow_core/python/ops/while_v2.py (L172) wrapped_body
    |   outputs = body(*_pack_sequence_as(orig_loop_vars, args))
    + ...ow_core/python/autograph/operators/control_flow.py (L781) aug_body
    |   loop_vars = body(*aug_loop_vars[loop_vars_slice])
    + <ipython-input-3-acc5c4cbe210> (L9) bad_func
 -> |   total += tf.math.log(x)
    + ...ackages/tensorflow_core/python/ops/gen_math_ops.py (L5248) log
    |   "Log", x=x, name=name)
    + ...tensorflow_core/python/framework/op_def_library.py (L742) _apply_op_helper
    |   attrs=attr_protos, op_def=op_def)
    + ...ges/tensorflow_core/python/framework/func_graph.py (L595) _create_op_internal
    |   compute_device)
    + ...e-packages/tensorflow_core/python/framework/ops.py (L3322) _create_op_internal
    |   op_def=op_def)
    + ...e-packages/tensorflow_core/python/framework/ops.py (L1756) __init__
    |   self._traceback = tf_stack.extract_stack()

: Tensor had Inf values
	 [[{{node while/body/_1/Log/CheckNumerics}}]] [Op:__inference_bad_func_58]

Function call stack:
bad_func

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
InvalidArgumentError:  

!!! Detected Infinity or NaN in output 0 of graph op "Log" (# of outputs: 1) !!!
  dtype: <dtype: 'float32'>
  shape: ()

  Input tensor: Tensor("Placeholder:0", shape=(), dtype=float32)
  Graph name: "while_body_13"

  Stack trace of op's creation ("->": inferred user code):
    + ... (Omitted 21 frames)
    + ...3.6/site-packages/IPython/core/interactiveshell.py (L2848) run_cell
 -> |   raw_cell, store_history, silent, shell_futures)
    + ...3.6/site-packages/IPython/core/interactiveshell.py (L2874) _run_cell
 -> |   return runner(coro)
    + ...hon3.6/site-packages/IPython/core/async_helpers.py (L68) _pseudo_sync_runner
 -> |   coro.send(None)
    + ...3.6/site-packages/IPython/core/interactiveshell.py (L3051) run_cell_async
 -> |   interactivity=interactivity, compiler=compiler, result=result)
    + ...3.6/site-packages/IPython/core/interactiveshell.py (L3242) run_ast_nodes
 -> |   if (await self.run_code(code, result,  async_=asy)):
    + ...3.6/site-packages/IPython/core/interactiveshell.py (L3319) run_code
 -> |   exec(code_obj, self.user_global_ns, self.user_ns)
    + <ipython-input-3-acc5c4cbe210> (L16) <module>
 -> |   print(bad_func(n))
    + ...kages/tensorflow_core/python/eager/def_function.py (L568) __call__
    |   result = self._call(*args, **kwds)
    + ...kages/tensorflow_core/python/eager/def_function.py (L615) _call
    |   self._initialize(args, kwds, add_initializers_to=initializers)
    + ...kages/tensorflow_core/python/eager/def_function.py (L497) _initialize
    |   *args, **kwds))
    + ...-packages/tensorflow_core/python/eager/function.py (L2389) _get_concrete_function_internal_garbage_collected
    |   graph_function, _, _ = self._maybe_define_function(args, kwargs)
    + ...-packages/tensorflow_core/python/eager/function.py (L2703) _maybe_define_function
    |   graph_function = self._create_graph_function(args, kwargs)
    + ...-packages/tensorflow_core/python/eager/function.py (L2593) _create_graph_function
    |   capture_by_value=self._capture_by_value),
    + ...ges/tensorflow_core/python/framework/func_graph.py (L978) func_graph_from_py_func
    |   func_outputs = python_func(*func_args, **func_kwargs)
    + ...kages/tensorflow_core/python/eager/def_function.py (L439) wrapped_fn
    |   return weak_wrapped_fn().__wrapped__(*args, **kwds)
    + ...ges/tensorflow_core/python/framework/func_graph.py (L964) wrapper
    |   user_requested=True,
    + <ipython-input-3-acc5c4cbe210> (L8) bad_func
 -> |   while tf.math.less(i, n):
    + ...ow_core/python/autograph/operators/control_flow.py (L746) while_stmt
    |   basic_symbol_names, composite_symbol_names, opts)
    + ...ow_core/python/autograph/operators/control_flow.py (L794) _tf_while_stmt
    |   aug_init_vars, **opts)
    + ...ges/tensorflow_core/python/ops/control_flow_ops.py (L2675) while_loop
    |   back_prop=back_prop)
    + ...te-packages/tensorflow_core/python/ops/while_v2.py (L194) while_loop
    |   add_control_dependencies=add_control_dependencies)
    + ...ges/tensorflow_core/python/framework/func_graph.py (L978) func_graph_from_py_func
    |   func_outputs = python_func(*func_args, **func_kwargs)
    + ...te-packages/tensorflow_core/python/ops/while_v2.py (L172) wrapped_body
    |   outputs = body(*_pack_sequence_as(orig_loop_vars, args))
    + ...ow_core/python/autograph/operators/control_flow.py (L781) aug_body
    |   loop_vars = body(*aug_loop_vars[loop_vars_slice])
    + <ipython-input-3-acc5c4cbe210> (L9) bad_func
 -> |   total += tf.math.log(x)
    + ...ackages/tensorflow_core/python/ops/gen_math_ops.py (L5248) log
    |   "Log", x=x, name=name)
    + ...tensorflow_core/python/framework/op_def_library.py (L742) _apply_op_helper
    |   attrs=attr_protos, op_def=op_def)
    + ...ges/tensorflow_core/python/framework/func_graph.py (L595) _create_op_internal
    |   compute_device)
    + ...e-packages/tensorflow_core/python/framework/ops.py (L3322) _create_op_internal
    |   op_def=op_def)
    + ...e-packages/tensorflow_core/python/framework/ops.py (L1756) __init__
    |   self._traceback = tf_stack.extract_stack()

 : Tensor had Inf values
	 [[{{node while/body/_1/Log/CheckNumerics}}]] [Op:__inference_bad_func_58]

Function call stack:
bad_func

enable_check_numerics()是TF1中add_check_numerics_ops()的继承者
检查eagerly执行的算子和图内的算子
- 工作与向前和向后传播
- 工作于API层
- 工作于TF1,
- 工作在CPU, GPU和TPU
相对负载
- CPU上是1.29x的时长，GPU上是1.76x的时长，负载不高
- 注：1.0x为无负载
- 基于模型：tensorflow_models.official.transformers.v2 task type=training; batch size=64
- TPU benchmarks之后会添加的，与TensorTracer协作

7. Tensorflow Debugger (tfdbg)

TensorFlow Debugger (tfdbg) V1

tfdbg v2的前身，启动与2017年早些时候
提供tf.Session()运行时的可视化界面
- 插入tf.Session()包中
- Keras, Estimator, slim也可用的方便的API
支持分布式训练
用户界面：交互式、可点击的CLI
- 中间张量值和他们的总结统计信息
  - 条件断点，比如has_inf_or_nan
- 运行图结构（在Grapper和Partition之后）
- 算子属性，包括原始堆栈
- 源码查看

为什么需要tfdbg v2?

TF新执行范式
- 没有tf.Session()
- Eager执行+tf函数
print()和tf.print()能否满足可调试性？
- 一些情况下有帮助，但不是完整的答案
- 通用性很重要：硬件种类
- 低性能负载很重要
  - 调试的分级侵入性
- 前端UX很重要

重要文档：

tf.debugging.experimental.enable_dump_debug_info

TensorFlow Debugger (TFDBG)

Debugger Dashboard 使用说明

最后

以上就是疯狂花生最近收集整理的关于Tensorflow 2 调试方法1. 调试 Tensor 值2. 调试设备位置3. 调试图结构4. 单步调试5. 调试高级 API (tf.keras)6. 数值问题 (NaN / Infinity)7. Tensorflow Debugger (tfdbg)的全部内容，更多相关Tensorflow内容请搜索靠谱客的其他文章。

本图文内容来源于网友提供，作为学习参考使用，或来自网络收集整理，版权属于原作者所有。

本文分类：TensorFlow学习
浏览次数：92 次浏览
发布日期：2023-10-13 01:36:25
本文链接：https://www.kaopuke.com/article/k-p-k_14_uzokf5_14__7__2_2.html

Tensorflow 2 调试方法1. 调试 Tensor 值2. 调试设备位置3. 调试图结构4. 单步调试5. 调试高级 API (tf.keras)6. 数值问题 (NaN / Infinity)7. Tensorflow Debugger (tfdbg)

文章目录

1. 调试 Tensor 值

2. 调试设备位置

3. 调试图结构

a. tf.function 图

b. 运行图 (runtime graphs)

4. 单步调试

5. 调试高级 API (tf.keras)

6. 数值问题 (NaN / Infinity)

7. Tensorflow Debugger (tfdbg)

最后

评论列表共有 0 条评论

发表评论取消回复

Tensorflow 2 调试方法1. 调试 Tensor 值2. 调试设备位置3. 调试图结构4. 单步调试5. 调试高级 API (tf.keras)6. 数值问题 (NaN / Infinity)7. Tensorflow Debugger (tfdbg)

文章目录

1. 调试 Tensor 值

2. 调试设备位置

3. 调试图结构

a. tf.function 图

b. 运行图 (runtime graphs)

4. 单步调试

5. 调试高级 API (tf.keras)

6. 数值问题 (NaN / Infinity)

7. Tensorflow Debugger (tfdbg)

最后

相关文章

评论列表共有 0 条评论

发表评论 取消回复

微信扫一扫：分享

发表评论取消回复