많은 사람들이 게이밍용 혹은 딥러닝용으로 GPU를 많이 사용한다.
하지만 정확하게 아는 사람이 많지는 않은 것 같다.
내 서버는 쿨링이 잘되니까 괜찮겠지, 혹은 IDC니까 문제없겠지라는 생각을 가지고 사용한다.
하지만 GPU의 limit걸려 있는 적정 온도라는 게 있다.
물론 GPU마다 차이는 있다.
그럼 그 온도를 어떻게 확인 할 수 있을까??
확인 방법은 간단하다.
내가 벌써 nvidia-smi라는 명령어가 포함된 포스팅만 3개 정도가 있는 것 같다.
그만큼 nvidia-smi 명령어는 아무것도 아닌것 같고 더 괜찮아 보이는 모니터링 툴이 있지만
사실 nvidia-smi에서 다 파생된 제품들에 불가하다는 생각이 든다.
nvidia-smi -h 명령어만 치더라도 많은 옵션들이 있다.
(base) root@ubuntu:~/gpu-burn# nvidia-smi -h
NVIDIA System Management Interface -- v440.59
NVSMI provides monitoring information for Tesla and select Quadro devices.
The data is presented in either a plain text or an XML format, via stdout or a file.
NVSMI also provides several management operations for changing the device state.
Note that the functionality of NVSMI is exposed through the NVML C-based
library. See the NVIDIA developer website for more information about NVML.
Python wrappers to NVML are also available. The output of NVSMI is
not guaranteed to be backwards compatible; NVML and the bindings are backwards
compatible.
http://developer.nvidia.com/nvidia-management-library-nvml/
http://pypi.python.org/pypi/nvidia-ml-py/
Supported products:
- Full Support
- All Tesla products, starting with the Kepler architecture
- All Quadro products, starting with the Kepler architecture
- All GRID products, starting with the Kepler architecture
- GeForce Titan products, starting with the Kepler architecture
- Limited Support
- All Geforce products, starting with the Kepler architecture
nvidia-smi [OPTION1 [ARG1]] [OPTION2 [ARG2]] ...
-h, --help Print usage information and exit.
LIST OPTIONS:
-L, --list-gpus Display a list of GPUs connected to the system.
-B, --list-blacklist-gpus Display a list of blacklisted GPUs in the system.
SUMMARY OPTIONS:
<no arguments> Show a summary of GPUs connected to the system.
[plus any of]
-i, --id= Target a specific GPU.
-f, --filename= Log to a specified file, rather than to stdout.
-l, --loop= Probe until Ctrl+C at specified second interval.
QUERY OPTIONS:
-q, --query Display GPU or Unit info.
[plus any of]
-u, --unit Show unit, rather than GPU, attributes.
-i, --id= Target a specific GPU or Unit.
-f, --filename= Log to a specified file, rather than to stdout.
-x, --xml-format Produce XML output.
--dtd When showing xml output, embed DTD.
-d, --display= Display only selected information: MEMORY,
UTILIZATION, ECC, TEMPERATURE, POWER, CLOCK,
COMPUTE, PIDS, PERFORMANCE, SUPPORTED_CLOCKS,
PAGE_RETIREMENT, ACCOUNTING, ENCODER_STATS, FBC_STATS
Flags can be combined with comma e.g. ECC,POWER.
Sampling data with max/min/avg is also returned
for POWER, UTILIZATION and CLOCK display types.
Doesn't work with -u or -x flags.
-l, --loop= Probe until Ctrl+C at specified second interval.
-lms, --loop-ms= Probe until Ctrl+C at specified millisecond interval.
더 있는데 생략하겠다.
저기에서 우리가 또 GPU서버를 많이 만지면서 사용하는 옵션 중에 하나인 -q 옵션이 있다.
Display GPU or Unit info.
저 qurey안에 거의 모든 정보가 다 들어있다.
(base) root@ubuntu:~/gpu-burn# nvidia-smi -q
==============NVSMI LOG==============
Timestamp : Tue Aug 18 04:51:33 2020
Driver Version : 440.59
CUDA Version : 10.2
Attached GPUs : 1
GPU 00000000:02:00.0
Product Name : GeForce RTX 2080 Ti
Product Brand : GeForce
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Disabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : N/A
GPU UUID : GPU-c58a9b13-0d1d-ce25-5e9d-79b93bdf0d7c
Minor Number : 0
VBIOS Version : 90.02.30.40.7E
MultiGPU Board : No
Board ID : 0x200
GPU Part Number : N/A
Inforom Version
Image Version : G001.0000.02.04
OEM Object : 1.1
ECC Object : N/A
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x02
Device : 0x00
Domain : 0x0000
Device Id : 0x1E0710DE
Bus Id : 00000000:02:00.0
Sub System Id : 0x134E196E
GPU Link Info
PCIe Generation
Max : 3
Current : 1
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : 30 %
Performance State : P8
Clocks Throttle Reasons
Idle : Not Active
Applications Clocks Setting : Not Active
SW Power Cap : Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 11019 MiB
Used : 0 MiB
Free : 11019 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 2 MiB
Free : 254 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Temperature
GPU Current Temp : 31 C
GPU Shutdown Temp : 94 C
GPU Slowdown Temp : 91 C
GPU Max Operating Temp : 89 C
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 22.32 W
Power Limit : 250.00 W
Default Power Limit : 250.00 W
Enforced Power Limit : 250.00 W
Min Power Limit : 100.00 W
Max Power Limit : 280.00 W
Clocks
Graphics : 300 MHz
SM : 300 MHz
Memory : 405 MHz
Video : 540 MHz
Applications Clocks
Graphics : N/A
Memory : N/A
Default Applications Clocks
Graphics : N/A
Memory : N/A
Max Clocks
Graphics : 2100 MHz
SM : 2100 MHz
Memory : 7000 MHz
Video : 1950 MHz
Max Customer Boost Clocks
Graphics : N/A
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes : None
GPU 1개의 정보이다.
저기에서는 전력량과 Clock정보등 많은 정보들이 있지만 우리는 적정온도를 확인하고 싶다.
중간에 Temperature가 우리가 살펴 볼 곳이다.
하지만 GPU 1개는 확인이 쉽지만 GPU 갯수가 많아질 때나 혹은 다른 정보를 확인할 때 불편할 수 있으니 다른 옵션을
사용해보자.
-d, --display= Display only selected information: MEMORY,
UTILIZATION, ECC, TEMPERATURE, POWER, CLOCK,
COMPUTE, PIDS, PERFORMANCE, SUPPORTED_CLOCKS,
PAGE_RETIREMENT, ACCOUNTING, ENCODER_STATS, FBC_STATS
Flags can be combined with comma e.g. ECC,POWER.
Sampling data with max/min/avg is also returned
for POWER, UTILIZATION and CLOCK display types.
Doesn't work with -u or -x flags.
(base) root@ubuntu:~/gpu-burn# nvidia-smi -q -d temperature
==============NVSMI LOG==============
Timestamp : Tue Aug 18 05:04:47 2020
Driver Version : 440.59
CUDA Version : 10.2
Attached GPUs : 1
GPU 00000000:02:00.0
Temperature
GPU Current Temp : 31 C
GPU Shutdown Temp : 94 C
GPU Slowdown Temp : 91 C
GPU Max Operating Temp : 89 C
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
이제 보기가 편해졌다.
GPU Current Temp는 현재 GPU의 온도이다.
GPU Max Operation Temp가 GPU가 재성능을 갖추면서 버틸수 있는 적정 온도이다.
그 이후에는 Slowdown Temp로 GPU의 성능이 저하되며 Shutdowm Temp까지 온도가 올라간다면,
GPU 작업이 떨어지며 시스템 hang 혹은 down, GPU drop 등 어떤 일이 발생하여도 이상한 일이 아니다.
참고하도록 하자.
ubuntu18.04 desktop nvidia driver troubleshooting (6) | 2020.08.21 |
---|---|
nvidia-smi topo matrix 살펴보기 (2) | 2020.08.20 |
Nvidia gpu 장치 확인하는 방법 (4) | 2020.08.18 |
Tensorflow python packages list (2) | 2020.08.15 |
Nvidia gpu와 driver 그리고 CUDA의 호환성 (2) | 2020.08.14 |