Background
Given a description of the robot, an initial configuration, a set of goal configurations, and a set of obstacles, the robot motion planning problem is to find a path that starts from the initial configuration and reaches a goal configuration while avoiding collision with the obstacles. The motion planning problem is known to be challenging from a computational point of view due to the exploratory nature of the problem. The complexity is increased further if the motion of the robot is constrained due to physical and mechanical parameters. Incremental sampling-based algorithms, such as the Rapidly-exploring Random Tree (RRT) algorithm are used to find a path if it exists. This algorithm is probabilistically complete in the sense that the probability that these algorithms return a solution, if one exists, converges to one as the number of samples approaches infinity.
Challenge
RRT algorithm is probabilistically complete and hence it needs to sample a large amount of nodes to find a solution. However initially there are very few leaf nodes that head the exploration so the workload is considerably low keeping most of the resources idle. As the tree expands, the parallelism will increase optimizing the resource utilisation. However after a threshold, there would so many open nodes that we will not be able to explore all them in parallel. Hence due to the nature of the algorithm, the workload differs throughout the lifecycle of the program.
Another challenge is to keep track of nodes visited by other threads running in parallel. Ideally we do not want to explore the same nodes by multiple threads. There should be some sort of directory that keeps track of the nodes visited by all the threads.
The number of nodes required is largely dependent on the epsilon value we chose for the exploration. In the vanilla RRT this value is kept constant but it causes workload imbalance as mentioned above. Dynamically updating this value based on resource utilisation might yield better speedups.
Approach
The underlying data structure of NaiveTree and the RRT algorithm is implemented completely in C++. The NaiveTree class is implemented from scratch along with all the required data structures. RRT is a probabilistic algorithm that samples random points to add to the tree. This might result in significantly different execution times and path length for different runs with the same setup. To avoid and minimize the effects by random number generation, we set the random number generator seed to 0 before starting each run.
We are going to parallelize rapidly exploring random path planning for robots on both CPU and GPU, and perform a detailed analysis of both systems performance. We have implemented a serial 2D version of the algorithm on two maps with obstacles. We have parallelized the algorithm on upto 8 cores using OpenMP and implemented the same algorithm in GPU using CUDA.
OpenMP
First approach was to parallelize just the function that computes the nearest node in the tree. This sped up the execution but there are still a lot of computations that had to be done by a single thread. This mainly included the random sampling and finding the node to be inserted. Also checking if the node was in collision or not and finally appending the node to the tree data structure. All these operations were still carried out by a single thread.
The second approach is to sample multiple random nodes in parallel, one by each thread. All the computation related to that node is done by the assigned thread. The section when this node is added to the tree data structure is serialized to avoid multiple nodes with the same UID. This approach changes the algorithm a bit as the nearest node returned by a thread may not be the nearest node as multiple nodes are sampled in parallel. In the worst case it will be the nth nearest node when running with n threads in parallel. However this does not change the outcome of the algorithm as RRT is probabilistically complete, hence it will still find a path but it will need to sample more nodes in order to do so.
Map 1 - We ran RRT with extreme points. The starting point is (-4.5, -4.5) while the target is (4.5, -4.5). These points force the algorithm to find a zigzag path through the obstacles.
6 Random starting and target points
Key Observations
We observe a sublinear speedup as we increase the number of cores. The average speedup for 8 cores from our experiments is approximately 6x. This is due to the fact that there is some serialisation when we modify the tree data structure. Along with it we observe that as the number of cores increases, the number of nodes sampled increases. This happens as the nearest node can be the nth nearest node when n threads are running in parallel. This makes it necessary to sample more nodes to reach the target. Hence even from the same problem, as the number of threads increases, the total work done to find a path also increases.
The second prominent observation is that as the problem size increases, the parallel version performs better. This is because for small problem sizes, the number of nodes sampled is less and the ratio of incorrectly computed nearest nodes to the total nodes is high thus resulting in extra work and giving less speedup. As the problem size increases, this ratio goes down considerably thus giving a close resemblance to the linear speedup. There are some outliers observed in this trend as RRT depends on random sampling and in some cases it might find a valid path in less time if the algorithm gets lucky.
CUDA
Random starting points on map 2
Key Observations
From the above results we can see that as we introduce more obstacles (Map 2 has more obstacles than Map 1) we can see a significant speedup in the performance compared to 8 core OpenMP. In Map 1 we can see that in case 1 and case 3 we see negative speedup this is because more time is spent in sending memory to and fro from the GPU than calculating if the point is colliding with the obstacles. Whereas in Map 2 we increase the obstacles in the 2D space after which we can see significant speedup compared to the 8 core OpenMP solution. This shows that our interpretation of speedup with respect to collision checking proves. In the future we would like to explore in parallel the process of choosing the next point from an initial point using CUDA instead of OpenMP as we can initiate more threads with the GPU.
References
http://joonlecture.blogspot.com/2011/02/improving-optimality-of-rrt-rrt.htmlÂ
https://en.wikipedia.org/wiki/Rapidly-exploring_random_tree
16-662 Robot Autonomy course material