드레스룸 시스템행거 인테리어

반응형

NVIDIA GPU 적정 온도는 어떻게 될까??

 

 

 

 

많은 사람들이 게이밍용 혹은 딥러닝용으로 GPU를 많이 사용한다.

 

하지만 정확하게 아는 사람이 많지는 않은 것 같다.

 

내 서버는 쿨링이 잘되니까 괜찮겠지, 혹은 IDC니까 문제없겠지라는 생각을 가지고 사용한다.

 

하지만 GPU의 limit걸려 있는 적정 온도라는 게 있다.

 

물론 GPU마다 차이는 있다.

 

그럼 그 온도를 어떻게 확인 할 수 있을까??

 

확인 방법은 간단하다.

 

내가 벌써 nvidia-smi라는 명령어가 포함된 포스팅만 3개 정도가 있는 것 같다.

 

그만큼 nvidia-smi 명령어는 아무것도 아닌것 같고 더 괜찮아 보이는 모니터링 툴이 있지만

 

사실 nvidia-smi에서 다 파생된 제품들에 불가하다는 생각이 든다.

 

nvidia-smi -h 명령어만 치더라도 많은 옵션들이 있다.

 

(base) root@ubuntu:~/gpu-burn# nvidia-smi -h
NVIDIA System Management Interface -- v440.59

NVSMI provides monitoring information for Tesla and select Quadro devices.
The data is presented in either a plain text or an XML format, via stdout or a file.
NVSMI also provides several management operations for changing the device state.

Note that the functionality of NVSMI is exposed through the NVML C-based
library. See the NVIDIA developer website for more information about NVML.
Python wrappers to NVML are also available.  The output of NVSMI is
not guaranteed to be backwards compatible; NVML and the bindings are backwards
compatible.

http://developer.nvidia.com/nvidia-management-library-nvml/
http://pypi.python.org/pypi/nvidia-ml-py/
Supported products:
- Full Support
    - All Tesla products, starting with the Kepler architecture
    - All Quadro products, starting with the Kepler architecture
    - All GRID products, starting with the Kepler architecture
    - GeForce Titan products, starting with the Kepler architecture
- Limited Support
    - All Geforce products, starting with the Kepler architecture
nvidia-smi [OPTION1 [ARG1]] [OPTION2 [ARG2]] ...

    -h,   --help                Print usage information and exit.

  LIST OPTIONS:

    -L,   --list-gpus           Display a list of GPUs connected to the system.

    -B,   --list-blacklist-gpus Display a list of blacklisted GPUs in the system.

  SUMMARY OPTIONS:

    <no arguments>              Show a summary of GPUs connected to the system.

    [plus any of]

    -i,   --id=                 Target a specific GPU.
    -f,   --filename=           Log to a specified file, rather than to stdout.
    -l,   --loop=               Probe until Ctrl+C at specified second interval.

  QUERY OPTIONS:

    -q,   --query               Display GPU or Unit info.

    [plus any of]

    -u,   --unit                Show unit, rather than GPU, attributes.
    -i,   --id=                 Target a specific GPU or Unit.
    -f,   --filename=           Log to a specified file, rather than to stdout.
    -x,   --xml-format          Produce XML output.
          --dtd                 When showing xml output, embed DTD.
    -d,   --display=            Display only selected information: MEMORY,
                                    UTILIZATION, ECC, TEMPERATURE, POWER, CLOCK,
                                    COMPUTE, PIDS, PERFORMANCE, SUPPORTED_CLOCKS,
                                    PAGE_RETIREMENT, ACCOUNTING, ENCODER_STATS, FBC_STATS
                                Flags can be combined with comma e.g. ECC,POWER.
                                Sampling data with max/min/avg is also returned
                                for POWER, UTILIZATION and CLOCK display types.
                                Doesn't work with -u or -x flags.
    -l,   --loop=               Probe until Ctrl+C at specified second interval.

    -lms, --loop-ms=            Probe until Ctrl+C at specified millisecond interval.

 

더 있는데 생략하겠다.

 

저기에서 우리가 또 GPU서버를 많이 만지면서 사용하는 옵션 중에 하나인 -q 옵션이 있다.

 

Display GPU or Unit info.

 

저 qurey안에 거의 모든 정보가 다 들어있다.

 

(base) root@ubuntu:~/gpu-burn# nvidia-smi -q

==============NVSMI LOG==============

Timestamp                           : Tue Aug 18 04:51:33 2020
Driver Version                      : 440.59
CUDA Version                        : 10.2

Attached GPUs                       : 1
GPU 00000000:02:00.0
    Product Name                    : GeForce RTX 2080 Ti
    Product Brand                   : GeForce
    Display Mode                    : Disabled
    Display Active                  : Disabled
    Persistence Mode                : Disabled
    Accounting Mode                 : Disabled
    Accounting Mode Buffer Size     : 4000
    Driver Model
        Current                     : N/A
        Pending                     : N/A
    Serial Number                   : N/A
    GPU UUID                        : GPU-c58a9b13-0d1d-ce25-5e9d-79b93bdf0d7c
    Minor Number                    : 0
    VBIOS Version                   : 90.02.30.40.7E
    MultiGPU Board                  : No
    Board ID                        : 0x200
    GPU Part Number                 : N/A
    Inforom Version
        Image Version               : G001.0000.02.04
        OEM Object                  : 1.1
        ECC Object                  : N/A
        Power Management Object     : N/A
    GPU Operation Mode
        Current                     : N/A
        Pending                     : N/A
    GPU Virtualization Mode
        Virtualization Mode         : None
        Host VGPU Mode              : N/A
    IBMNPU
        Relaxed Ordering Mode       : N/A
    PCI
        Bus                         : 0x02
        Device                      : 0x00
        Domain                      : 0x0000
        Device Id                   : 0x1E0710DE
        Bus Id                      : 00000000:02:00.0
        Sub System Id               : 0x134E196E
        GPU Link Info
            PCIe Generation
                Max                 : 3
                Current             : 1
            Link Width
                Max                 : 16x
                Current             : 16x
        Bridge Chip
            Type                    : N/A
            Firmware                : N/A
        Replays Since Reset         : 0
        Replay Number Rollovers     : 0
        Tx Throughput               : 0 KB/s
        Rx Throughput               : 0 KB/s
    Fan Speed                       : 30 %
    Performance State               : P8
    Clocks Throttle Reasons
        Idle                        : Not Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Active
        HW Slowdown                 : Not Active
            HW Thermal Slowdown     : Not Active
            HW Power Brake Slowdown : Not Active
        Sync Boost                  : Not Active
        SW Thermal Slowdown         : Not Active
        Display Clock Setting       : Not Active
    FB Memory Usage
        Total                       : 11019 MiB
        Used                        : 0 MiB
        Free                        : 11019 MiB
    BAR1 Memory Usage
        Total                       : 256 MiB
        Used                        : 2 MiB
        Free                        : 254 MiB
    Compute Mode                    : Default
    Utilization
        Gpu                         : 0 %
        Memory                      : 0 %
        Encoder                     : 0 %
        Decoder                     : 0 %
    Encoder Stats
        Active Sessions             : 0
        Average FPS                 : 0
        Average Latency             : 0
    FBC Stats
        Active Sessions             : 0
        Average FPS                 : 0
        Average Latency             : 0
    Ecc Mode
        Current                     : N/A
        Pending                     : N/A
    ECC Errors
        Volatile
            SRAM Correctable        : N/A
            SRAM Uncorrectable      : N/A
            DRAM Correctable        : N/A
            DRAM Uncorrectable      : N/A
        Aggregate
            SRAM Correctable        : N/A
            SRAM Uncorrectable      : N/A
            DRAM Correctable        : N/A
            DRAM Uncorrectable      : N/A
    Retired Pages
        Single Bit ECC              : N/A
        Double Bit ECC              : N/A
        Pending Page Blacklist      : N/A
    Temperature
        GPU Current Temp            : 31 C
        GPU Shutdown Temp           : 94 C
        GPU Slowdown Temp           : 91 C
        GPU Max Operating Temp      : 89 C
        Memory Current Temp         : N/A
        Memory Max Operating Temp   : N/A
    Power Readings
        Power Management            : Supported
        Power Draw                  : 22.32 W
        Power Limit                 : 250.00 W
        Default Power Limit         : 250.00 W
        Enforced Power Limit        : 250.00 W
        Min Power Limit             : 100.00 W
        Max Power Limit             : 280.00 W
    Clocks
        Graphics                    : 300 MHz
        SM                          : 300 MHz
        Memory                      : 405 MHz
        Video                       : 540 MHz
    Applications Clocks
        Graphics                    : N/A
        Memory                      : N/A
    Default Applications Clocks
        Graphics                    : N/A
        Memory                      : N/A
    Max Clocks
        Graphics                    : 2100 MHz
        SM                          : 2100 MHz
        Memory                      : 7000 MHz
        Video                       : 1950 MHz
    Max Customer Boost Clocks
        Graphics                    : N/A
    Clock Policy
        Auto Boost                  : N/A
        Auto Boost Default          : N/A
    Processes                       : None

 

GPU 1개의 정보이다.

 

저기에서는 전력량과 Clock정보등 많은 정보들이 있지만 우리는 적정온도를 확인하고 싶다.

 

중간에 Temperature가 우리가 살펴 볼 곳이다.

 

하지만 GPU 1개는 확인이 쉽지만 GPU 갯수가 많아질 때나 혹은 다른 정보를 확인할 때 불편할 수 있으니 다른 옵션을 

 

사용해보자.

 

-d,   --display=            Display only selected information: MEMORY,
                                    UTILIZATION, ECC, TEMPERATURE, POWER, CLOCK,
                                    COMPUTE, PIDS, PERFORMANCE, SUPPORTED_CLOCKS,
                                    PAGE_RETIREMENT, ACCOUNTING, ENCODER_STATS, FBC_STATS
                                Flags can be combined with comma e.g. ECC,POWER.
                                Sampling data with max/min/avg is also returned
                                for POWER, UTILIZATION and CLOCK display types.
                                Doesn't work with -u or -x flags.

 

(base) root@ubuntu:~/gpu-burn# nvidia-smi -q -d temperature

==============NVSMI LOG==============

Timestamp                           : Tue Aug 18 05:04:47 2020
Driver Version                      : 440.59
CUDA Version                        : 10.2

Attached GPUs                       : 1
GPU 00000000:02:00.0
    Temperature
        GPU Current Temp            : 31 C
        GPU Shutdown Temp           : 94 C
        GPU Slowdown Temp           : 91 C
        GPU Max Operating Temp      : 89 C
        Memory Current Temp         : N/A
        Memory Max Operating Temp   : N/A

 

이제 보기가 편해졌다.

 

GPU Current Temp는 현재 GPU의 온도이다.

 

GPU Max Operation Temp가 GPU가 재성능을 갖추면서 버틸수 있는 적정 온도이다.

 

그 이후에는 Slowdown Temp로 GPU의 성능이 저하되며 Shutdowm Temp까지 온도가 올라간다면,

 

GPU 작업이 떨어지며 시스템 hang 혹은 down, GPU drop 등 어떤 일이 발생하여도 이상한 일이 아니다.

 

참고하도록 하자.

반응형

이 글을 공유합시다

facebook twitter kakaoTalk kakaostory naver band