TensorFlow2.0学习过程中遇到“ResourceExhaustedError”错误的一种解决思路

78 阅读 0 评论 52 点赞

我是靠谱客的博主昏睡飞鸟，这篇文章主要介绍TensorFlow2.0学习过程中遇到“ResourceExhaustedError”错误的一种解决思路，现在分享给大家，希望可以做个参考。

最近在学习利用TensorFlow2.0训练模型，然后就遇到了下面这个错误：

ResourceExhaustedError                    Traceback (most recent call last)
<ipython-input-15-0d9dc5695c3a> in <module>
----> 1 history=model.fit(train_x,train_y,epochs=5,batch_size=64,validation_data=(test_x,test_y))

D:Anacondalibsite-packagestensorflowpythonkerasenginetraining.py in _method_wrapper(self, *args, **kwargs)
     64   def _method_wrapper(self, *args, **kwargs):
     65     if not self._in_multi_worker_mode():  # pylint: disable=protected-access
---> 66       return method(self, *args, **kwargs)
     67 
     68     # Running inside `run_distribute_coordinator` already.

D:Anacondalibsite-packagestensorflowpythonkerasenginetraining.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_batch_size, validation_freq, max_queue_size, workers, use_multiprocessing)
    846                 batch_size=batch_size):
    847               callbacks.on_train_batch_begin(step)
--> 848               tmp_logs = train_function(iterator)
    849               # Catch OutOfRangeError for Datasets of unknown size.
    850               # This blocks until the batch has finished executing.

D:Anacondalibsite-packagestensorflowpythoneagerdef_function.py in __call__(self, *args, **kwds)
    578         xla_context.Exit()
    579     else:
--> 580       result = self._call(*args, **kwds)
    581 
    582     if tracing_count == self._get_tracing_count():

D:Anacondalibsite-packagestensorflowpythoneagerdef_function.py in _call(self, *args, **kwds)
    609       # In this case we have created variables on the first call, so we run the
    610       # defunned version which is guaranteed to never create variables.
--> 611       return self._stateless_fn(*args, **kwds)  # pylint: disable=not-callable
    612     elif self._stateful_fn is not None:
    613       # Release the lock early so that multiple threads can perform the call

D:Anacondalibsite-packagestensorflowpythoneagerfunction.py in __call__(self, *args, **kwargs)
   2418     with self._lock:
   2419       graph_function, args, kwargs = self._maybe_define_function(args, kwargs)
-> 2420     return graph_function._filtered_call(args, kwargs)  # pylint: disable=protected-access
   2421 
   2422   @property

D:Anacondalibsite-packagestensorflowpythoneagerfunction.py in _filtered_call(self, args, kwargs)
   1663          if isinstance(t, (ops.Tensor,
   1664                            resource_variable_ops.BaseResourceVariable))),
-> 1665         self.captured_inputs)
   1666 
   1667   def _call_flat(self, args, captured_inputs, cancellation_manager=None):

D:Anacondalibsite-packagestensorflowpythoneagerfunction.py in _call_flat(self, args, captured_inputs, cancellation_manager)
   1744       # No tape is watching; skip to running the function.
   1745       return self._build_call_outputs(self._inference_function.call(
-> 1746           ctx, args, cancellation_manager=cancellation_manager))
   1747     forward_backward = self._select_forward_and_backward_functions(
   1748         args,

D:Anacondalibsite-packagestensorflowpythoneagerfunction.py in call(self, ctx, args, cancellation_manager)
    596               inputs=args,
    597               attrs=attrs,
--> 598               ctx=ctx)
    599         else:
    600           outputs = execute.execute_with_cancellation(

D:Anacondalibsite-packagestensorflowpythoneagerexecute.py in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
     58     ctx.ensure_initialized()
     59     tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
---> 60                                         inputs, attrs, num_outputs)
     61   except core._NotOkStatusException as e:
     62     if name is not None:

ResourceExhaustedError:  OOM when allocating tensor with shape[64,64,252,252] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[node sequential/conv2d_1/Conv2D (defined at <ipython-input-15-0d9dc5695c3a>:1) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
 [Op:__inference_train_function_3331]

Function call stack:
train_function

这个报错提示资源用尽，所以我们可以首先通过一个可以正常训练的模型来测试出我们机器的最大上限是什么！例如：

history=model.fit(train_x,train_y,epochs=5,batch_size=64,validation_data=(test_x,test_y))

当我这里的batch_size=64时，机器就会提示“ResourceExhaustedError”这个错误，当batch_size=16时，机器就会正常训练这个模型，通过不断的调整这个数字，最终我得出我的电脑可以接受的最大batch_size=21，超过这个批量我的电脑就会报错。通过在不断调整中的发现，资源溢出是由于显存不够造成的。你可以通过这种方法去查看你的资源管理器，发现具体的溢出原因是什么！
这里做个记录以及分享！