大连人工智能计算平台——华为昇腾AI平台——高性能计算HPC的单任务task的多CPU运行模式

发布时间 2023-07-04 13:19:23作者: Death_Knight

超算是离我们平时生活比较远的一个事情,即使是对于一个计算机专业方向的学生来说,正好实验室得到了华为的超算平台的使用账号,于是就摸索了一下,不得不承认这个东西确实不是普通人能搞的明白的。

 

基本概念:

一个工作Job可以开多个副本,每个副本都是mpirun -N 所开出的,每个副本又被叫做任务task,而每个任务task又可以申请多个CPU核心和多个GPU计算资源。

 

运算代码:

import mpi4py.MPI as MPI
import sys
import socket
import numpy as np

def func1(queue, num):
    import time
    # time.sleep(num)
    # time.sleep(1)
    x = np.random.rand(100)
    for _ in range(1000000):
        x += np.random.rand(100)
    num += np.sum(x)

    queue.put(num)


def run_queue():
    from multiprocessing import Process, Queue

    ps = 120

    queue = Queue(maxsize=200)  # the following attribute can call in anywhere

    process = [Process(target=func1, args=(queue, num)) for num in range(ps)]
    [p.start() for p in process]
    [p.join() for p in process]
    return [queue.get() for p in process]

 
comm = MPI.COMM_WORLD
comm_rank = comm.Get_rank()
comm_size = comm.Get_size()
node_name = MPI.Get_processor_name()
# node_name = socket.gethostname()
 
# point to point communication
data_send = [comm_rank]*1

comm.send(data_send,dest=(comm_rank+1)%comm_size)

res = run_queue() ###

data_recv =comm.recv(source=(comm_rank-1)%comm_size)

# print("my rank is %d, and Ireceived:" % comm_rank, data_recv, file=sys.stdout, flush=True)
# print(data_recv)

with open("/home/share/xxxxxxxxxx/home/xxxxxxxx/xxxxxxx/results/{}.txt".format(comm_rank, ), "w") as f:
    f.write("my rank is %d/%d, and node_name: %s Ireceived:" % (comm_rank, comm_size, node_name) + str(data_recv) + str(res) + "\n" )

 

 

超算的启动命令:( -R 为task做资源申请 )

一个job开8个task,每个task申请120个CPU:

/opt/batch/cli/bin/dsub  -n task_test -A xxxxxxxxxxxx --priority 9999 --job_retry 10 --job_type hmpi -R "cpu=120;mem=128" -N 8  -eo error.txt -oo output.txt /home/share/xxxxxxxxxx/home/xxxxxxx/xxxxxxx/run_python.sh

运行时间:6分43秒

 

 

一个job开8个task,每个task申请1个CPU:

/opt/batch/cli/bin/dsub  -n task_test -A xxxxxxxxxxxx --priority 9999 --job_retry 10 --job_type hmpi -R "cpu=1;mem=128" -N 8  -eo error.txt -oo output.txt /home/share/xxxxxxxxxx/home/xxxxxxx/xxxxxxx/run_python.sh