Performance comparison C++, Numpy and Cupy (CUDA) for differentiation

One of the interesting question, what can performance we can have from using Python in compare with C/C++, and what is the benefit to use CUDA for the same calculations. To answer this question, I do calculate derivatives for different dataset sizes and for different data types. How to calculate derivative, I do explain in this video about derivatives and in this short article

Performance was measured with float32 and fload64 data types with array sizes limited by memory size.

Computer parameters

CPU: Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz

Memory: 8GB

GPU: NVIDIA GeForce GTX 1070, 8GB

For GPU code was transferred from NumPy to CUPA library. Performance from the data set was calculated as described in the video about linear regression and in this article

C++ code: This code was performed with float and double arrays

Also, it is nesesary to note, that before main cycle of calculations, some false calculations are performed, to reduce an impact instability with cache and quick memory realocation.


#include 
#include 
#include 

double calc( float *Z, float *Y, double dx, int s)
	{
	 auto fr = std::chrono::high_resolution_clock::now();
	 for( int k = 1; k < s+1; k++ )
		{
		 Z[k] = (Y[k+1]-Y[k])/dx;
		}
	 auto to = std::chrono::high_resolution_clock::now();
	 int res = std::chrono::duration_cast(to-fr).count();
	 return res;	 
	}

int main()
{
 long smax = 400000000;
 long sz = smax + 2;

 double xmin = -10;
 double dx = 0.01;
 
 float *X = new float[sz];
 float *Y = new float[sz];
 float *Z = new float[sz];

 for( int i = 0; i < sz; i++ )
	{
	 X[i] = xmin + dx*i;
	 Y[i] = sin(X[i]);
	}
 
 double xy = 0;
 double xx = 0;
 for( int i = 1000; i > 199; i-- )
	{
	 long s = pow(10,i/100.0);
	 if( s > smax ) continue;

	 double res = calc( Z, Y, dx, s);
	 xy += res * s;
	 xx += s * s;
	 printf("%ld - %f\n", s, res);
	}
 printf("k = %.3e", xy/xx);
}

Similar code was written for Python. Which was run with np.float32 and np.float64 for import cupy as np and import numpy as np. To obtain more compatible values, for CUDA, cycle = 20 was used to increase overall time into 20 times.


import cupy as np #import numpy as np
import datetime

def calc(s, cycle):
    fr = datetime.datetime.now()
    for kk in range(cycle):
        Z[1:s] = (Y[2:s+1]-Y[0:s-1])/(2*dx)
    to = datetime.datetime.now()
    return (to-fr).total_seconds()*1000

s_max = 400_000_000
cycle = 20
dt = np.float32 #  np.float64

sz = s_max + 2
xmin = -10.0
dx = 0.01

X = np.linspace(xmin, dx*(sz-1), sz, dtype=dt)
Y = np.sin(X, dtype=dt)
Z = np.zeros_like(X)

x = []
y = []

for ii in range(3):
    calc(s_max, cycle)


for i in range(1000, 199, -1):
    s = int(10**(i/100))
    if s > s_max:
        continue
    t = calc(s, cycle)
    print(s, t)
    x.append(s)
    y.append(t)

xx = np.array(x)
yy = np.array(y)
print("k = ", (np.sum(xx*yy)/np.sum(xx**2))/cycle)

Due to memory size, all calculations were performed with maximal size s_max = 400_000_000 elements for float32 and s_max = 200_000_000 for float64.

**Compare NumPy, CuPy and C++.**
Compare NumPy, CuPy and C++ performance for differentiation if a linear numverical function.

Original image: 603 x 459

The final results of performance given in the table below. Less is faster.

	float32	float64
C++	3.04x10^-6	3.77x10^-6
NumPy	1.69x10^-6	4.28x10^-6
CuPy	1.45x10^-7	2.06x10^-7

The same in real mathematical performance (calculated derivatives per second)

	float32	float64
C++	3.3x10⁸	2.7x10⁸
NumPy	5.9x10⁸	2.3x10⁸
CuPy	6.9x10⁹	4.9x10⁹

Or, transferred to relative values, with NumPy performance taken as a measure. 32/64 acceleration shows the acceleration of changing from 64 to 32 bytes float.

	float32	float64	32/64 acceleration
C++	0.55	1.14	1.24
NumPy	1	1	2.53
CuPy	11.7	20.7	1.42

Conclusion

If you a happy about memory usage, then CuPy is the best for Float 64 and NumPy is very good at float32. C++ is reasonable for using with float64

It is interesting to note, that on float32, the overall performance of C++ is very low in compare with Python.

One core usage

During this experiment, only one core of CPU was used according to CPU load = 100% dedicated to this script on the TOP monitor.

Published: 2022-07-26 11:45:27
Updated: 2022-07-26 13:43:09