UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize

发布时间 2023-10-09 22:36:35作者: emanlee

 

/home/software/anaconda3/envs/mydlenv/lib/python3.8/site-packages/tensorflow/python/client/session.py:1751: UserWarning: An interactive session is already active. This can cause out-of-memory errors in some cases. You must explicitly call `InteractiveSession.close()` to release resources held by the other session(s).
  warnings.warn('An interactive session is already active. This can '

 

 

---------------------------------------------------------------------------
UnknownError                              Traceback (most recent call last)
Cell In[1], line 59
     54 model.add(Dense(num_classes, activation='softmax'))
     56 model.compile(loss=keras.losses.categorical_crossentropy,
     57               optimizer=keras.optimizers.Adadelta(),
     58               metrics=['accuracy'])
---> 59 model.fit(x_train, y_train,
     60           batch_size=batch_size,
     61           epochs=epochs,
     62           verbose=1,
     63           validation_data=(x_test, y_test))
     64 score = model.evaluate(x_test, y_test, verbose=0)
     65 print('Test loss:', score[0])

File /home/software/anaconda3/envs/mydlenv/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py:66, in enable_multi_worker.<locals>._method_wrapper(self, *args, **kwargs)
     64 def _method_wrapper(self, *args, **kwargs):
     65   if not self._in_multi_worker_mode():  # pylint: disable=protected-access
---> 66     return method(self, *args, **kwargs)
     68   # Running inside `run_distribute_coordinator` already.
     69   if dc_context.get_current_worker_context():

File /home/software/anaconda3/envs/mydlenv/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py:848, in Model.fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_batch_size, validation_freq, max_queue_size, workers, use_multiprocessing)
    841 with traceme.TraceMe(
    842     'TraceContext',
    843     graph_type='train',
    844     epoch_num=epoch,
    845     step_num=step,
    846     batch_size=batch_size):
    847   callbacks.on_train_batch_begin(step)
--> 848   tmp_logs = train_function(iterator)
    849   # Catch OutOfRangeError for Datasets of unknown size.
    850   # This blocks until the batch has finished executing.
    851   # TODO(b/150292341): Allow multiple async steps here.
    852   if not data_handler.inferred_steps:

File /home/software/anaconda3/envs/mydlenv/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py:580, in Function.__call__(self, *args, **kwds)
    578     xla_context.Exit()
    579 else:
--> 580   result = self._call(*args, **kwds)
    582 if tracing_count == self._get_tracing_count():
    583   self._call_counter.called_without_tracing()

File /home/software/anaconda3/envs/mydlenv/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py:644, in Function._call(self, *args, **kwds)
    640     pass  # Fall through to cond-based initialization.
    641   else:
    642     # Lifting succeeded, so variables are initialized and we can run the
    643     # stateless function.
--> 644     return self._stateless_fn(*args, **kwds)
    645 else:
    646   canon_args, canon_kwds = \
    647       self._stateful_fn._function_spec.canonicalize_function_inputs(  # pylint: disable=protected-access
    648           *args, **kwds)

File /home/software/anaconda3/envs/mydlenv/lib/python3.8/site-packages/tensorflow/python/eager/function.py:2420, in Function.__call__(self, *args, **kwargs)
   2418 with self._lock:
   2419   graph_function, args, kwargs = self._maybe_define_function(args, kwargs)
-> 2420 return graph_function._filtered_call(args, kwargs)

File /home/software/anaconda3/envs/mydlenv/lib/python3.8/site-packages/tensorflow/python/eager/function.py:1661, in ConcreteFunction._filtered_call(self, args, kwargs)
   1647 def _filtered_call(self, args, kwargs):
   1648   """Executes the function, filtering arguments from the Python function.
   1649 
   1650   Objects aside from Tensors, CompositeTensors, and Variables are ignored.
   (...)
   1659     `args` and `kwargs`.
   1660   """
-> 1661   return self._call_flat(
   1662       (t for t in nest.flatten((args, kwargs), expand_composites=True)
   1663        if isinstance(t, (ops.Tensor,
   1664                          resource_variable_ops.BaseResourceVariable))),
   1665       self.captured_inputs)

File /home/software/anaconda3/envs/mydlenv/lib/python3.8/site-packages/tensorflow/python/eager/function.py:1745, in ConcreteFunction._call_flat(self, args, captured_inputs, cancellation_manager)
   1740 possible_gradient_type = (
   1741     pywrap_tfe.TFE_Py_TapeSetPossibleGradientTypes(args))
   1742 if (possible_gradient_type == _POSSIBLE_GRADIENT_TYPES_NONE
   1743     and executing_eagerly):
   1744   # No tape is watching; skip to running the function.
-> 1745   return self._build_call_outputs(self._inference_function.call(
   1746       ctx, args, cancellation_manager=cancellation_manager))
   1747 forward_backward = self._select_forward_and_backward_functions(
   1748     args,
   1749     possible_gradient_type,
   1750     executing_eagerly)
   1751 forward_function, args_with_tangents = forward_backward.forward()

File /home/software/anaconda3/envs/mydlenv/lib/python3.8/site-packages/tensorflow/python/eager/function.py:593, in _EagerDefinedFunction.call(self, ctx, args, cancellation_manager)
    591 with _InterpolateFunctionError(self):
    592   if cancellation_manager is None:
--> 593     outputs = execute.execute(
    594         str(self.signature.name),
    595         num_outputs=self._num_outputs,
    596         inputs=args,
    597         attrs=attrs,
    598         ctx=ctx)
    599   else:
    600     outputs = execute.execute_with_cancellation(
    601         str(self.signature.name),
    602         num_outputs=self._num_outputs,
   (...)
    605         ctx=ctx,
    606         cancellation_manager=cancellation_manager)

File /home/software/anaconda3/envs/mydlenv/lib/python3.8/site-packages/tensorflow/python/eager/execute.py:59, in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
     57 try:
     58   ctx.ensure_initialized()
---> 59   tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
     60                                       inputs, attrs, num_outputs)
     61 except core._NotOkStatusException as e:
     62   if name is not None:

UnknownError:  Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[node sequential/conv2d/Conv2D (defined at tmp/ipykernel_30155/4033766983.py:59) ]] [Op:__inference_train_function_848]

Function call stack:
train_function


 
 
 
 
 
 
 
 
 
 
 

 ===========================================================

一开始怀疑是CUDA和CuDNN配置错误(要求版本匹配)。反复试验后,还是有这个错误。
最后发现可能是GPU内存不足造成的。需要在程序前加以下一段代码:

from tensorflow.compat.v1 import ConfigProto
from tensorflow.compat.v1 import InteractiveSession

config = ConfigProto()
config.gpu_options.allow_growth = True
session = InteractiveSession(config=config)


意思是对GPU进行按需分配。
主要原因是我的图像比较大,消耗GPU资源较多。但我的显卡(RTX2060)显存只有6GB,所以会出现这个错误。这个错误提示有很大的误导性,让人一直纠结CUDA和CuDNN的版本问题。故在此立贴,以免后人重蹈覆辙。

参考:

    https://github.com/tensorflow/tensorflow/issues/24828
————————————————
版权声明:本文为CSDN博主「史丹利复合田」的原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接及本声明。
原文链接:https://blog.csdn.net/tsyccnh/article/details/102938368

===========================================================

2023-01-30 23:52:19.393940: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2023-01-30 23:52:19.404512: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR

===========================================================

解决方法:

import tensorflow as tf
import keras  
from keras.backend.tensorflow_backend import set_session
config = tf.ConfigProto()  
config.gpu_options.allow_growth = True
sess = tf.Session(config=config)
set_session(sess)
keras.backend.clear_session() #清理session


原因
开头开启的用来限制显存的session由于没有清理,用keras里的clear_session()清理一下session就OK了。

总之一点,不要一遇到问题就想着降版本

https://blog.csdn.net/qq_38835585/article/details/108321628

===========================================================

 

I've seen this error message for three different reasons, with different solutions:

1. You have cache issues

I regularly work around this error by shutting down my python process, removing the ~/.nv directory (on linux, rm -rf ~/.nv), and restarting the Python process. I don't exactly know why this works. It's probably at least partly related to the second option:

2. You're out of memory

The error can also show up if you run out of graphics card RAM. With an nvidia GPU you can check graphics card memory usage with nvidia-smi. This will give you a readout of how much GPU RAM you have in use (something like 6025MiB / 6086MiB if you're almost at the limit) as well as a list of what processes are using GPU RAM.

If you've run out of RAM, you'll need to restart the process (which should free up the RAM) and then take a less memory-intensive approach. A few options are:

  • reducing your batch size
  • using a simpler model
  • using less data
  • limit TensorFlow GPU memory fraction: For example, the following will make sure TensorFlow uses <= 90% of your RAM:
import keras
import tensorflow as tf

config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.9  # 0.6 sometimes works better for folks
keras.backend.tensorflow_backend.set_session(tf.Session(config=config))

This can slow down your model evaluation if not used together with the items above, presumably since the large data set will have to be swapped in and out to fit into the small amount of memory you've allocated.

A second option is to have TensorFlow start out using only a minimum amount of memory and then allocate more as needed (documented here):

os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true'

3. You have incompatible versions of CUDA, TensorFlow, NVIDIA drivers, etc.

If you've never had similar models working, you're not running out of VRAM and your cache is clean, I'd go back and set up CUDA + TensorFlow using the best available installation guide - I have had the most success with following the instructions at https://www.tensorflow.org/install/gpu rather than those on the NVIDIA / CUDA site. Lambda Stack is also a good way to go.

 
  • 14
    I'm upvoting this answer since for me, I was out of memory only.   Feb 5, 2020 at 16:48
  • 2
    In my case, it was incompatible versions. Instructions are tensorflow.org/install/gpu are accurate if you pay close attention to the operators like = or >=. Oiriginally I assumed "equal or newer", but with TensorFlow 2.2 (seemingly need to treat like 2.1), you need exactly CUDA 10.1 and >= CuDNN 7.6 that is compatible with CUDA 10.1 (currently, that's only 7.6.5 - and there's two different ones for CUDA 10.2 and 10.1. 
    – Heath
     Jun 28, 2020 at 23:03
  • 1
    It was memory for me as well. Thanks for the in depth explanation.   Jul 1, 2020 at 11:02
  • 1
    In my case it's out of memory.and your code for 0.6 worked for me [per_process_gpu_memory_fraction = 0.6]. Thanks 
    – Nitesh
     Dec 16, 2020 at 19:44 
  • 1
    I was out of memory the whole time. A background process was hogging up all of my GPU memory. Cross checked the process ids with htop and nvidia-smi   Dec 17, 2020 at 11:28

===========================================================

 

ou can also downgrade the TensorFlow version   Dec 20, 2019 at 5:52
  • 3
    Same error i got , The Reason of getting this error is due to the mismatch of the version of the cudaa/cudnn with your tensorflow version there are two methods to solve this: Either you Downgrade your Tensorflow Version pip install --upgrade tensorflowgpu==1.8.0 Or You can follow the steps at tensorflow.org/install/gpu tip: Choose your Ubuntu version and follow the steps.:-)   Dec 20, 2019 at 6:00
  •  
    For me, it was a mismatch between CUDA and cuDNN. Replacing cuDNN libraries with a matching version solved the issue.   Jan 22, 2020 at 4:54

===========================================================

 

===========================================================

 

===========================================================

 

===========================================================

 

===========================================================

 

===========================================================