/Revisiting the postcard pathtracer with CUDA and Optix

Revisiting the postcard pathtracer with CUDA and Optix


Revisiting the postcard pathtracer with CUDA and Optix


May 18, 2020

Revisiting the postcard pathtracer with CUDA and Optix


Last week, I revisited Andrew Kensler’s business card raytracer to make it run as fast as possible[1]. It was a lot of fun and I learned a lot in the process. I enjoyed it so much that I decided to revisit Andrew’s follow up, the postcard sized pathtracer[2].

I spent a lot of time trying to make sense of that code back in 2018 and I knew it was going to be a bigger challenge. Because it is a pathtracer and not a raytracer, it is more computationally intensive. The quality of the image is directly correlated with how many rays are cast per pixel.

This twist allowed me to change the rules of my self-imposed game. Instead of attempting to run as fast as possible, I decided on a time budget (1mn) and resolved to generate the best image possible. The name of the game was to maximize the samples per pixel (spp) at resolution 960×540.

The hardware used was a DB4[3] cube containing a Ryzen 5 2600 (6 cores, 12 threads at 3.5Ghz) with an Nvidia Pascal based GTX 1050 Ti GPU.


I followed the same path as last time. Starting with compiler flags, then moving to SIMD, multi-threading, and GPU rendering via CUDA. In the end, despite skepticism, I gave a try to OptiX denoiser and it blew my mind.

Cleaning up the code


At first sight, the postcard pathtracer code (pixar.cpp) is a tad intimidating.

#include <stdlib.h> // card > pixar.ppm
#include <stdio.h>
#include <math.h>
#define R return
#define O operator
typedef float F;typedef int I;struct V{F x,y,z;V(F v=0){x=y=z=v;}V(F a,F b,F
c=0){x=a;y=b;z=c;}V O+(V r){R V(x+r.x,y+r.y,z+r.z);}V O*(V r){R V(x*r.x,y*r.
y,z*r.z);}F O%(V r){R x*r.x+y*r.y+z*r.z;}V O!(){R*this*(1/sqrtf(*this%*this)
);}};F L(F l,F r){R l<r?l:r;}F U(){R(F)rand()/RAND_MAX;}F B(V p,V l,V h){l=p
+l*-1;h=h+p*-1;R-L(L(L(l.x,h.x),L(l.y,h.y)),L(l.z,h.z));}F S(V p,I&m){F d=1
e9;V f=p;f.z=0;char l[]="5O5_5W9W5_9_COC_AOEOA_E_IOQ_I_QOUOY_Y_]OWW[WaOa_aW
eWa_e_cWiO";for(I i=0;i<60;i+=4){V b=V(l[i]-79,l[i+1]-79)*.5,e=V(l[i+2]-79,l
[i+3]-79)*.5+b*-1,o=f+(b+e*L(-L((b+f*-1)%e/(e%e),0),1))*-1;d=L(d,o%o);}d=sq
rtf(d);V a[]={V(-11,6),V(11,6)};for(I i=2;i--;){V o=f+a[i]*-1;d=L(d,o.x>0?f
absf(sqrtf(o%o)-2):(o.y+=o.y>0?-2:2,sqrtf(o%o)));}d=powf(powf(d,8)+powf(p.z,
8),.125)-.5;m=1;F r=L(-L(B(p,V(-30,-.5,-30),V(30,18,30)),B(p,V(-25,17,-25),V
(25,20,25))),B(V(fmodf(fabsf(p.x),8),p.y,p.z),V(1.5,18.5,-25),V(6.5,20,25)))
;if(r<d)d=r,m=2;F s=19.9-p.y;if(s<d)d=s,m=3;R d;}I M(V o,V d,V&h,V&n){I m,s=
0;F t=0,c;for(;t<100;t+=c)if((c=S(h=o+d*t,m))<.01>99)R n=!V(S(h+V(.01,0
),s)-c,S(h+V(0,.01),s)-c,S(h+V(0,0,.01),s)-c),m;R 0;}V T(V o,V d){V h,n,r,t=
1,l(!V(.6,.6,1));for(I b=3;b--;){I m=M(o,d,h,n);if(!m)break;if(m==1){d=d+n*(
n%d*-2);o=h+d*.1;t=t*.2;}if(m==2){F i=n%l,p=6.283185*U(),c=U(),s=sqrtf(1-c),
g=n.z<0?-1:1,u=-1/(g+n.z),v=n.x*n.y*u;d=V(v,g+n.y*n.y*u,-n.y)*(cosf(p)*s)+V(
1+g*n.x*n.x*u,g*v,-g*n.x)*(sinf(p)*s)+n*sqrtf(c);o=h+d*.1;t=t*.2;if(i>0&&M(h
+n*.1,l,h,n)==3)r=r+t*V(500,400,100)*i;}if(m==3){r=r+t*V(50,80,100);break;}}
R r;}I main(){I w=960,h=540,s=16;V e(-22,5,25),g=!(V(-3,4,0)+e*-1),l=!V(g.z,
0,-g.x)*(1./w),u(g.y*l.z-g.z*l.y,g.z*l.x-g.x*l.z,g.x*l.y-g.y*l.x);printf("P
6 %d %d 255 ",w,h);for(I y=h;y--;)for(I x=w;x--;){V c;for(I p=s;p--;)c=c+T(e
,!(g+l*(x-w/2+U())+u*(y-h/2+U())));c=c*(1./s)+14./241;V o=c+1;c=V(c.x/o.x,c.
y/o.y,c.z/o.z)*255;printf("%c%c%c",(I)c.x,(I)c.y,(I)c.z);}}// Andrew Kensler

Adding colors to outline the different parts helps to understand that it is in fact very simple with seven sections based on a Pixel Sampler with RayMarches into a Database.

#include // card > pixar.ppm #include #include
#define R return #define O operator typedef float F;typedef int I;
struct V{F x,y,z;V(F v=0){x=y=z=v;}V(F a,F b,F c=0){x=a;y=b;z=c;}V O+(V r){R V(x+r.x,y+r.y,z+r.z);}V O*(V r){R V(x*r.x,y*r. y,z*r.z);}F O%(V r){R x*r.x+y*r.y+z*r.z;}V O!(){R*this*(1/sqrtf(*this%*this) );}};

F L(F l,F r){R l<r?l:r;}F U(){R(F)rand()/RAND_MAX;}F B(V p,V l,V h){l=p +l*-1;h=h+p*-1;R-L(L(L(l.x,h.x),L(l.y,h.y)),L(l.z,h.z));}

F S(V p,I&m){F d=1 e9;V f=p;f.z=0;char l[]="5O5_5W9W5_9_COC_AOEOA_E_IOQ_I_QOUOY_Y_]OWW[WaOa_aW eWa_e_cWiO";for(I i=0;i0?f absf(sqrtf(o%o)-2):(o.y+=o.y>0?-2:2,sqrtf(o%o)));}d=powf(powf(d,8)+powf(p.z, 8),.125)-.5;m=1;F r=L(-L(B(p,V(-30,-.5,-30),V(30,18,30)),B(p,V(-25,17,-25),V (25,20,25))),B(V(fmodf(fabsf(p.x),8),p.y,p.z),V(1.5,18.5,-25),V(6.5,20,25))) ;if(r
I M(V o,V d,V&h,V&n){I m,s= 0;F t=0,c;for(;t <.01 readability="12.076654601861">99)R n=!V(S(h+V(.01,0 ),s)-c,S(h+V(0,.01),s)-c,S(h+V(0,0,.01),s)-c),m;R 0;}
V T(V o,V d){V h,n,r,t= 1,l(!V(.6,.6,1));for(I b=3;b--;){I m=M(o,d,h,n);if(!m)break;if(m==1){d=d+n*( n%d*-2);o=h+d*.1;t=t*.2;}if(m==2){F i=n%l,p=6.283185*U(),c=U(),s=sqrtf(1-c), g=n.z0&&M(h +n*.1,l,h,n)==3)r=r+t*V(500,400,100)*i;}if(m==3){r=r+t*V(50,80,100);break;}} R r;}
I main(){I w=960,h=540,s=16;V e(-22,5,25),g=!(V(-3,4,0)+e*-1),l=!V(g.z, 0,-g.x)*(1./w),u(g.y*l.z-g.z*l.y,g.z*l.x-g.x*l.z,g.x*l.y-g.y*l.x);printf("P 6 %d %d 255 ",w,h);for(I y=h;y--;)for(I x=w;x--;){V c;for(I p=s;p--;)c=c+T(e ,!(g+l*(x-w/2+U())+u*(y-h/2+U())));c=c*(1./s)+14./241;V o=c+1;c=V(c.x/o.x,c. y/o.y,c.z/o.z)*255;printf("%c%c%c",(I)c.x,(I)c.y,(I)c.z);}}

// Andrew Kensler

Usual tricks, Vector class, Utils, Database, Ray marching, Sampling, main.

With a bit of renaming and clang-tidy, a.cpp gives a much clearer picture.

Establishing the baseline


To make testing easier, I modified the code to take as parameter the number of samples per pixel. The 1mn budget was completely blown up when establishing the baseline (with all optimization disabled). Even lowering spp to one, it took 5mn27s to generate a very noisy image.

$ clang -O0 -lm -o a a.cpp 
$ time ./a 1 > /dev/null

real    5m27.311s
user    5m27.094s
sys     0m0.078s

One core, no SIMD, -O0, and 1spp = 5mn27s runtime.

Compiler optimizations


The first step, and some people would even argue that there is nothing below -O3 and it should have been the base, is to enable compiler optimizations. Performance improved 30x resulting in 6spp.

$ clang -lm -O3 -o a a.cpp 
$ time ./a 6 > /dev/null

real    0m58.100s
user    0m58.266s
sys     0m0.031s

With -O3 the pathtracer can be pushed to 6spp and renders in 57s.

-fFast-math


There is a compilation flag, -ffast-math, which allows the compiler to relax IEEE 754 compliance in favor of performance. It is automatically enabled when using -Ofast and shows another 2.6x performance improvement allowing 16spp.

$ clang -lm -Ofast -o a a.cpp 
$ time ./a 16 > /dev/null 

real    0m56.304s
user    0m56.266s
sys     0m0.031s

With -ffast-math we can cross a double digit spp to reach 16.
Trivia: Opening the executable with BinaryNinja showed that clang was able to detect two calls to cosf(X) and sinf(X) separated by several lines of code and optimizes them into sincosf (which I had no idea existed). Surprisingly this only happened with -Ofast and not with -O3.

Going SIMD


Going SIMD is something I also did when I revisited the business card raytracer. Or at least I thought I did. I was mistaken when after opening the binary and seeing XMM registers in use I concluded that SIMD instructions were used.

Just because compiler is outputting SIMD instructions it does not mean it's taken care of. In asm you can see that pretty much all mul/add instructions have ss or sd suffix - meaning they operate on single data element. What you want is to have mulps/mulpd instructions.

If Clang is be able to generate SIMD instructions via auto-vectorization, I found the feature capricious to trigger.

struct Vec{

 float p[3];

 NOT SIMD!!
 Vec operator+(Vec o){ 
   Vec v;
   for(int i=0 ; i3 ; i++){
     v.p[i] = p[i] + o.p[i];
   }
   return v;
 }
};
  
struct Vec{

 float p[4];

 NOT SIMD!!
 Vec operator+(Vec o){ 
   Vec v;
   for(int i=0 ; i3 ; i++){
     v.p[i] = p[i] + o.p[i];
   }
   return v;
 }
};
  
struct Vec{

 float p[4];

 SIMD!!
 Vec operator+(Vec o){ 
   Vec v;
   for(int i=0 ; i4 ; i++){
     v.p[i] = p[i] + o.p[i];
   }
   return v;
 }
};
  

In a professional environment I would probably not have felt confident relying on the compiler's good will. Using intrinsics via "nmmintrin.h", __m128, _mm_set_ps, _mm_add_ps, and _mm_div_ps[5] would have been safer. For research purposes however it was fine.

Modifying the Vector struct to operate on four items (b.cpp) secured the Packet SIMD Instructions mulps and addps.

$ clang -lm -Ofast -o b b.cpp 
$ time ./b 17 > /dev/null 

real    0m59.722s
user    0m59.719s
sys     0m0.000s

The speed increase was marginal (17spp) and the visual difference imperceptible.

Using all cores


At this point, it was time to use the 11 other cores on the Ryzen 5 2600. The task was considerably facilitated by the code from the previous exercise. A few pthread_create and pthread_join resulted in c.cpp which spawn 540 threads, each rendering lines of 960 pixels.

$ clang -lm -lpthread -Ofast -o c c.cpp 
$ time ./c 128 > /dev/null 

real    0m55.203s
user    10m52.188s
sys     0m0.063s

As expected, the performance gain was linear with the number of cores. Twelve of them allowed to push sampling to 128spp. However the visual result had too much personality for my liking.It is fast but also ugly. What is going on here?
There is a visible repeating horizontal pattern. Could it be explained by the random generator implemented by truncating libc's rand()[6] and keeping only the linear congruential generator (LCG)?

Nope. LCG is fine. The problem was that each thread pseudo-random number generator seed is initialized with the same value. Which means the same "random" serie is used across each line. Changing the seed value to use the thread id fixed the glitch (d.cpp).12 cores, -Ofast -> 128spp. Still noisy.
Trivia: The business raytracer optimization from last weekd prompted feedback from readers advising that 500+ threads was messing up with the scheduler and likely degraded performance. I tried to lower the thread count (12, 24, and even 36) but could not confirm this claim. An OS may have trouble dealing with 10,000 threads but it looks like Linux was well able to handle 500 of them.

GPGPU with CUDA


All optimizations used for the business card raytracer were applied into e.cu.

  • Maximize occupancy -> 0.2 :/.
  • Maximize branch efficiency -> 90% :).
  • Avoid float64.
  • Use intrinsics.
  • Use -use_fast_math flag.
C:Usersfab>nvcc -O3 -o e -use_fast_math e.cu
C:Usersfab>e.exe 2048 > NUL 
Time: 59s

The speed gain (2048spp) was monstrous. Unfortunately, so was the generated image.Probably a bug in the kernel.

That bug took an afternoon to track down. Ultimately the culprit was determined to be fastmath flag which made intrinsic __powf too inaccurate for ray-marching. Replacing it with manual multiplication (f.cu) solved the problem but also reduced spp to 600. There was also another issue. Quite artistic but also quite not right.
The pseudo-random generator had been converted in a hurry using the same seed uninitialized seed for a whole Streaming Multiprocessor. Using cuRAND and having a seed per block/thread solved the issue (g.cu).GPGPU rendering, 600spp. And still noisy.

Denoising


Starting with OptiX 5.0, NVidia provides an A.I based denoiser with a pre-trained neural network. Even though NVidia claimed the model was "trained using tens of thousands of images rendered from one thousand 3D scenes", I had reservations about the result. I didn't even check the API and instead used Declan Russeel's standalone executable[7].

The denoiser ran in 300ms and the result made me eat my words. It is so good it is mind-blowing. A denoised 600spp which takes 1mn to render is equivalent to a 40,960spp which takes 1h to render.

Below is presented the denoised version (d) from 1spp to 2048spp. There is a minimum under which denoising does not work well (1spp and 2spp :P). Inversely, there is a threshold after which returns diminishes. The sweet spot for this method seems to be around 512spp.

References



*

Original Source