NumPy/SciPy-compatible Array Library for GPU-accelerated Computing with Python
CuPy is an open-source array library for GPU-accelerated computing with Python. CuPy utilizes CUDA Toolkit libraries including cuBLAS, cuRAND, cuSOLVER, cuSPARSE, cuFFT, cuDNN and NCCL to make full use of the GPU architecture.
                        
The figure shows CuPy speedup over NumPy. Most operations perform well on a GPU using CuPy out of the box. CuPy speeds up some operations more than 100X. Read the original benchmark article 
                        Single-GPU CuPy Speedups on the RAPIDS AI Medium blog.
                    
CuPy's interface is highly compatible with NumPy and SciPy; in most cases it can be used as a drop-in replacement. All you need to do is just replace 
                    numpy and scipy with
                    cupy and cupyx.scipy in your Python code.
                        The Basics of CuPy tutorial is useful to learn first steps with CuPy.
                        
CuPy supports various methods, indexing, data types, broadcasting and more. 
                        This comparison table shows a list of NumPy / SciPy APIs and their corresponding CuPy implementations.
                    
>>> import cupy as cp
>>> x = cp.arange(6).reshape(2, 3).astype('f')
>>> x
array([[ 0.,  1.,  2.],
       [ 3.,  4.,  5.]], dtype=float32)
>>> x.sum(axis=1)
array([  3.,  12.], dtype=float32)
                The easiest way to install CuPy is to use pip. CuPy provides wheels (precompiled binary packages) for Linux and Windows. Read the
                        Installation Guide for more details.
                        
                        CuPy can also be installed from Conda-Forge or from source code.
                    
# For CUDA 11.2 ~ 11.x
pip install cupy-cuda11x
# For CUDA 12.x
pip install cupy-cuda12x
# For CUDA 13.x
pip install cupy-cuda13x
# For AMD ROCm 4.3
pip install cupy-rocm-4-3
# For AMD ROCm 5.0
pip install cupy-rocm-5-0
                You can easily make a custom CUDA kernel if you want to make your code run faster, requiring only a small code snippet of C++. CuPy automatically wraps and compiles it to make a CUDA binary. Compiled binaries are cached and reused in subsequent runs. Please read the 
                        User-Defined Kernels tutorial.
                        
And, you can also use raw CUDA kernels via 
                        Raw modules.
                    
>>> x = cp.arange(6, dtype='f').reshape(2, 3)
>>> y = cp.arange(3, dtype='f')
>>> kernel = cp.ElementwiseKernel(
...     'float32 x, float32 y', 'float32 z',
...     '''
...     if (x - 2 > y) {
...       z = x * y;
...     } else {
...       z = x + y;
...     }
...     ''', 'my_kernel')
>>> kernel(x, y)
array([[ 0.,  2.,  4.],
       [ 0.,  4.,  10.]], dtype=float32)