Chapter 9. Known Issues

The following problems still exist in this release and are in the process of being resolved.

Known Issues

Interaction with pthreads

Single-threaded applications that use dlopen() to load NVIDIA's libGL library, and then use dlopen() to load any other library that is linked against libpthread will crash in libGL. This does not happen in NVIDIA's new ELF TLS OpenGL libraries (see Chapter 5, Listing of Installed Components for a description of the ELF TLS OpenGL libraries). Possible workarounds for this problem are:

  1. Load the library that is linked with libpthread before loading libGL.

  2. Link the application with libpthread.

Cache Aliasing

Cache aliasing occurs when multiple mappings to a physical page of memory have conflicting caching states, such as cached and uncached. Due to these conflicting states, data in that physical page may become corrupted when the processor's cache is flushed. If that page is being used for DMA by a driver such as NVIDIA's graphics driver, this can lead to hardware stability problems and system lockups.

NVIDIA has encountered bugs with some Linux kernel versions that lead to cache aliasing. Although some systems will run perfectly fine when cache aliasing occurs, other systems will experience severe stability problems, including random lockups. Users experiencing stability problems due to cache aliasing will benefit from updating to a kernel that does not cause cache aliasing to occur.

64-Bit BARs (Base Address Registers)

NVIDIA GPUs advertise a 64-bit BAR capability (a Base Address Register stores the location of a PCI I/O region, such as registers or a frame buffer). This means that the GPU's PCI I/O regions (registers and frame buffer) can be placed above the 32-bit address space (the first 4 gigabytes of memory).

The decision of where the BAR is placed is made by the system BIOS at boot time. If the BIOS supports 64-bit BARs, then the NVIDIA PCI I/O regions may be placed above the 32-bit address space. If the BIOS does not support this feature, then our PCI I/O regions will be placed within the 32-bit address space as they have always been.

Unfortunately, some Linux kernels (such as 2.6.11.x) do not understand or support 64-bit BARs. If the BIOS does place any NVIDIA PCI I/O regions above the 32-bit address space, such kernels will reject the BAR and the NVIDIA driver will not work.

The only known workaround is to upgrade to a newer kernel.

Valgrind

The NVIDIA OpenGL implementation makes use of self modifying code. To force Valgrind to retranslate this code after a modification you must run using the Valgrind command line option:

--smc-check=all

Without this option Valgrind may execute incorrect code causing incorrect behavior and reports of the form:

==30313== Invalid write of size 4

HDMI screen blanks unless audio is played

The ALSA audio driver in some Linux kernels contains a bug affecting some systems with integrated graphics that causes the display to go blank on some HDMI TVs whenever audio is not being played. This bug occurs when the ALSA audio driver configures the HDMI hardware to send an HDMI audio info frame that contains an invalid checksum. Some TVs blank the video when they receive such invalid audio packets.

To ensure proper display, please make sure your Linux kernel contains commit 1f348522844bb1f6e7b10d50b9e8aa89a2511b09. This fix is in Linux 2.6.39-rc3 and later, and may be be back-ported to some older kernels.

Driver fails to initialize when MSI interrupts are enabled

The Linux NVIDIA driver uses Message Signaled Interrupts (MSI) by default. This provides compatibility and scalability benefits, mainly due to the avoidance of IRQ sharing.

Some systems have been seen to have problems supporting MSI, while working fine with virtual wire interrupts. These problems manifest as an inability to start X with the NVIDIA driver, or CUDA initialization failures. The NVIDIA driver will then report an error indicating that the NVIDIA kernel module does not appear to be receiving interrupts generated by the GPU.

Problems have also been seen with suspend/resume while MSI is enabled. All known problems have been fixed, but if you observe problems with suspend/resume that you did not see with previous drivers, disabling MSI may help you.

NVIDIA is working on a long-term solution to improve the driver's out of the box compatibility with system configurations that do not fully support MSI.

MSI interrupts can be disabled via the NVIDIA kernel module parameter "NVreg_EnableMSI=0". This can be set on the command line when loading the module, or more appropriately via your distribution's kernel module configuration files (such as those under /etc/modprobe.d/).

Console restore behavior

The Linux NVIDIA driver uses the nvidia-modeset module for console restore whenever it can. Currently, the improved console restore mechanism is used on systems that boot with the UEFI Graphics Output Protocol driver, and on systems that use supported VESA linear graphical modes. Note that VGA text, color index, planar, banked, and some linear modes cannot be supported, and will use the older console restore method instead.

When the new console restore mechanism is in use and the nvidia-modeset module is initialized (e.g. because an X server is running on a different VT, nvidia-persistenced is running, or the nvidia_drm module is loaded with the modeset=1 parameter), then nvidia-modeset will respond to hot plug events by displaying the console on as many displays as it can. Note that to save power, it may not display the console on all connected displays.

Vulkan and device enumeration

Starting with the X.Org X server version 1.20.7, it is possible to enumerate all the NVIDIA devices in the system if the application is able to open a connection to the X server. However, such applications will only be able to create an Xlib or XCB swapchain on the device driving the X screen. Such a device can be identified by using the vkGetPhysicalDeviceSurfaceSupportKHR() API.

Prior to the X.Org X server version 1.20.7, it is not possible to enumerate multiple devices if one of them will be used to present to an X11 swapchain. It is still possible to enumerate multiple devices even if one of them is driving an X screen, if the devices will be used for Vulkan offscreen rendering or presenting to a display swapchain. For that, make sure that the application cannot open a display connection to an X server by, for example, unsetting the DISPLAY environment variable.

Restricting access to GPU performance counters

NVIDIA Developer Tools allow developers to debug, profile, and develop software for NVIDIA GPUs. GPU performance counters are integral to these tools. By default, access to the GPU performance counters is restricted to root, and other users with the CAP_SYS_ADMIN capability, for security reasons. If developers require access to the NVIDIA Developer Tools, a system administrator can accept the security risk and allow access to users without the CAP_SYS_ADMIN capability.

Wider access to GPU performance counters can be granted by setting the kernel module parameter "NVreg_RestrictProfilingToAdminUsers=0" in the nvidia.ko kernel module. This can be set on the command line when loading the module, or more appropriately via your distribution's kernel module configuration files (such as those under /etc/modprobe.d/).

Driver fails to initialize with some versions of RHEL 8

Some versions of Red Hat Enterprise Linux 8 kernels have a bug that causes driver initialization to fail with an error such as:

    NVRM: Xid (PCI:0000:09:00): 79, pid=2172, GPU has fallen off the bus.
    NVRM: GPU 0000:09:00.0: GPU has fallen off the bus.
    NVRM: GPU 0000:09:00.0: RmInitAdapter failed! (0x26:0x65:1239)
    NVRM: GPU 0000:09:00.0: rm_init_adapter failed, device minor number 0

See the Red Hat knowledge base article https://access.redhat.com/solutions/5825061 to find the specific affected and fixed kernel versions.

Notebooks

If you are using a notebook see the "Known Notebook Issues" in Chapter 16, Configuring a Notebook.

Texture seams in Quake 3 engine

Many games based on the Quake 3 engine set their textures to use the GL_CLAMP clamping mode when they should be using GL_CLAMP_TO_EDGE. This was an oversight made by the developers because some legacy NVIDIA GPUs treat the two modes as equivalent. The result is seams at the edges of textures in these games. To mitigate this, older versions of the NVIDIA display driver remap GL_CLAMP to GL_CLAMP_TO_EDGE internally to emulate the behavior of the older GPUs, but this workaround has been disabled by default. To re-enable it, uncheck the "Use Conformant Texture Clamping" checkbox in nvidia-settings before starting any affected applications.

FSAA

When FSAA is enabled (the __GL_FSAA_MODE environment variable is set to a value that enables FSAA and a multisample visual is chosen), the rendering may be corrupted when resizing the window.

libGL DSO finalizer and pthreads

When a multithreaded OpenGL application exits, it is possible for libGL's DSO finalizer (also known as the destructor, or "_fini") to be called while other threads are executing OpenGL code. The finalizer needs to free resources allocated by libGL. This can cause problems for threads that are still using these resources. Setting the environment variable "__GL_NO_DSO_FINALIZER" to "1" will work around this problem by forcing libGL's finalizer to leave its resources in place. These resources will still be reclaimed by the operating system when the process exits. Note that the finalizer is also executed as part of dlclose(3), so if you have an application that dlopens(3) and dlcloses(3) libGL repeatedly, "__GL_NO_DSO_FINALIZER" will cause libGL to leak resources until the process exits. Using this option can improve stability in some multithreaded applications, including Java3D applications.

Thread cancellation

Canceling a thread (see pthread_cancel(3)) while it is executing in the OpenGL driver causes undefined behavior. For applications that wish to use thread cancellation, it is recommended that threads disable cancellation using pthread_setcancelstate(3) while executing OpenGL or GLX commands.

This section describes problems that will not be fixed. Usually, the source of the problem is beyond the control of NVIDIA. Following is the list of problems:

Problems that Will Not Be Fixed

NV-CONTROL versions 1.8 and 1.9

Version 1.8 of the NV-CONTROL X Extension introduced target types for setting and querying attributes as well as receiving event notification on targets. Targets are objects like X Screens, GPUs and Quadro Sync devices. Previously, all attributes were described relative to an X Screen. These new bits of information (target type and target id) were packed in a non-compatible way in the protocol stream such that addressing X Screen 1 or higher would generate an X protocol error when mixing NV-CONTROL client and server versions.

This packing problem has been fixed in the NV-CONTROL 1.10 protocol, making it possible for the older (1.7 and prior) clients to communicate with NV-CONTROL 1.10 servers. Furthermore, the NV-CONTROL 1.10 client library has been updated to accommodate the target protocol packing bug when communicating with a 1.8 or 1.9 NV-CONTROL server. This means that the NV-CONTROL 1.10 client library should be able to communicate with any version of the NV-CONTROL server.

NVIDIA recommends that NV-CONTROL client applications relink with version 1.10 or later of the NV-CONTROL client library (libXNVCtrl.a, in the nvidia-settings-470.82.00.tar.bz2 tarball). The version of the client library can be determined by checking the NV_CONTROL_MAJOR and NV_CONTROL_MINOR definitions in the accompanying nv_control.h.

The only web released NVIDIA Linux driver that is affected by this problem (i.e., the only driver to use either version 1.8 or 1.9 of the NV-CONTROL X extension) is 1.0-8756.

CPU throttling reducing memory bandwidth on IGP systems

For some models of CPU, the CPU throttling technology may affect not only CPU core frequency, but also memory frequency/bandwidth. On systems using integrated graphics, any reduction in memory bandwidth will affect the GPU as well as the CPU. This can negatively affect applications that use significant memory bandwidth, such as video decoding using VDPAU, or certain OpenGL operations. This may cause such applications to run with lower performance than desired.

To work around this problem, NVIDIA recommends configuring your CPU throttling implementation to avoid reducing memory bandwidth. This may be as simple as setting a certain minimum frequency for the CPU.

Depending on your operating system and/or distribution, this may be as simple as writing to a configuration file in the /sys or /proc filesystems, or other system configuration file. Please read, or search the Internet for, documentation regarding CPU throttling on your operating system.

VDPAU initialization failures on supported GPUs

If VDPAU gives the VDP_STATUS_NO_IMPLEMENTATION error message on a GPU which was labeled or specified as supporting PureVideo or PureVideo HD, one possible reason is a hardware defect. After ruling out any other software problems, NVIDIA recommends returning the GPU to the manufacturer for a replacement.

Some applications, such as Quake 3, crash after querying the OpenGL extension string

Some applications have bugs that are triggered when the extension string is longer than a certain size. As more features are added to the driver, the length of this string increases and can trigger these sorts of bugs.

You can limit the extensions listed in the OpenGL extension string to the ones that appeared in a particular version of the driver by setting the __GL_ExtensionStringVersion environment variable to a particular version number. For example,

__GL_ExtensionStringVersion=17700 quake3

will run Quake 3 with the extension string that appeared in the 177.* driver series. Limiting the size of the extension string can work around this sort of application bug.

Some X servers have trouble with multiple GPUs

Some versions of the X.Org X server starting with 1.5.0 have a bug that causes X to fail with an error similar to the following when there is more than one GPU in the computer:

(!!) More than one possible primary device found
(II) Primary Device is:
(EE) No devices detected.

Fatal server error:
no screens found

This bug was fixed in the X.Org X Server 1.7 release.

You can work around this problem by specifying the bus ID of the device you wish to use. For more details, please search the xorg.conf manual page for "BusID". You can configure the X server with an X screen on each NVIDIA GPU by running:

nvidia-xconfig --enable-all-gpus

Please see Bugzilla bug #18321 for more details on this X server problem. In addition, please see “How do I interpret X server version numbers?” when determining whether your X server is new enough to contain this fix.

gnome-shell doesn't update until a window is moved

Versions of libcogl prior to 1.10.x have a bug which causes glBlitFramebuffer() calls used to update the window to be clipped by a 0x0 scissor (see GNOME bug #690451 for more details). To work around this bug, the scissor test can be disabled by setting the __GL_ConformantBlitFramebufferScissor environment variable to 0. Note this version of the NVIDIA driver comes with an application profile which automatically disables this test if libcogl is detected in the process.

Some X servers ignore the RandR transform filter during a modeset request

The RandR layer of the X server attempts to ignore redundant RRSetCrtcConfig requests. If the only property changed by an RRSetCrtcConfig request is the transform filter, some X servers will ignore the request as redundant. This can be worked around by also changing other properties, such as the mode, transformation matrix, etc.