# Final Report

## Summary

We implemented label propagation, a graph-based semi-supervised learning algorithm, in parallel using OpenMP and then a hybrid of MPI and OpenMP on the latedays cluster. We also experimented with different techniques to improve the speedup such as reordering the order in which the nodes in the graph are visited and using low-precision math.

## Background

*Semi-supervised learning* (SSL) involves learning functions from a dataset S which has two partitions: the labeled portion of the set L and the unlabeled portion of the set U. Typically, |U| » |L| since it can be quite difficult to label or annotate all available data. If a supervised learning approach was used, then U would be ignored in order to learn the function. If an unsupervised learning approach was used, then clustering could be done on L and U together (this is S) ignoring the labels and then the labels from L could be used to label the clusters. This hybrid version of learning achieves better results in practice than unsupervised learning while having lower costs than supervised learning since labeling data is expensive. In many use cases, the objective of learning the function is just to label the datapoints in U; the learning algorithm a priori knows what points it will be predicting on. This sort of learning settings is known as *transductive learning*.

**Figure 1:** *Example of graph with labeled and unlabeled data.*

In graph-based SSL, the data is assumed to lie on a low-dimensional manifold so that a graph representation reasonably approximates the relationship between the data. Additionally, the data is assumed to have a local smoothness property: data close together is more likely to have the same label.

SSL algorithms, such as label propagation, which was proposed by Zhu *et al.*, are essentially vertex programs where updates at a vertex are a function of the values in the neighborhood. These vertex programs generally perform basic arithmetic operations such as multiplication, addition, and division. Each of these vertex programs can be run in parallel in a multiple instruction, multiple data (MIMD) manner. However, there are different consistency models as the program can return different values for a node depending on what other nodes have vertex programs running at the same time.

The entire implementation consists of three distinct parts. The first part involves reading in the graph and labels files and generating an internal graph representation. The second part involves reordering the nodes, but this part does not exist in the naive implementation. The third part involves running the actual label propagation algorithm on the graph.

## Approach

We used OpenMP on the Xeon Phis and MPI on the CPUs on the latedays cluster. Our OpenMP implementation targeted a shared memory system. The hybrid OpenMP-MPI implementation targeted a multi-node system. Almost all the code was written in C++. The only code note written in C++ was the code used to convert the Freebase1 and Freebase2 graphs to the appropriate formats. We used the PySpark API in Python to have Apache Spark efficiently parallelize this conversion on large graphs.

We parallelized the label propagation computation by running the vertex program on multiple vertices at the same time. Each thread was mapped to a vertex program running on a single vertex in both the OpenMP and hybrid OpenMP-MPI implementations. The only difference in between the two implementations in this regard are that the threads used in the OpenMP implementation were all on one node, while they were split across many nodes in the hybrid OpenMP-MPI implementation.

In order to further improve the speedup from the naive OpenMP implementation, we decided to use the reordering technique proposed by Bilmes and Subramanya. Although their work uses the measure propagation algorithm that they proposed earlier, we hypothesized that the reordering could improve performance for label propagation as well. The fundamental issue that the reordering addresses is poor cache performance due to different threads working on vertices with very little overlap in terms of their neighbors. As a result, the working set would increase by a lot as the number of threads went up and might not fit in the cache. In order to address this issue, they proposed a reordering in which vertices that are near each other in the ordering have a high number of common neighbors. This helps reduce the size of the working set and improve performance. Techniques such as this reordering that reorganize the graph structure to increase locality are widely used when optimizing performance on large graphs.

An important thing to look at when using this reordering algorithm and OpenMP is the scheduling policy. Blocked scheduling is not ideal since vertex programs on vertices that are not close together will be run together. This essentially has vertex programs on vertices that have little overlap in terms of the number of neighbors running at the same time and yields sub-optimal results. Interleaved scheduling, on the other hand, yields the best results since vertex programs on vertices that are next to each other and have a high overlap in terms of the number of neighbors are running at the same time.

In addition, we experimented with different consistency models to look at their effect on performance. In the first model, we replicated the current label distributions and used that copy to update a vertex’s label distribution. In the second model, we did not replicate the current label distributions. As a result, some vertices could see updated values for some of their neighbors. This consistency model is much looser.

We also realized that graph compression could be helpful in increasing performance. Instead of using traditional graph compression techniques, we decided to use low-precision math, which has been widely used in deep learning. Low-precision math allows for faster arithmetic operations and also reduces the size of the graph representation as label distributions and edge weights can be represented with less bytes of memory. Instead of using floats for the label distributions and edge weights, we used unsigned chars. The new values essentially represent the numerator in a fraction where the denominator was 100. Thus, we represented probabilities to a hundredth of a decimal place. For sparse graphs, such as Freebase1 and Freebase2, which have average degrees of 27 and 8 respectively, and a low number of labels this approach could increase the amount of data that can fit in the cache and also increase the speed of arithmetic operations.

In the hybrid approach, we parallelize in a similar manner to the single node case. The difference is that instead of communicating solely via shared memory, threads on different nodes communicate relevant information via message passing. In order to minimize the overhead of sending messages and take advantage of the fact that the dataset that we are dealing with is sparse, the entire label distributions for every vertex handled by a given compute node does not need to be communicated to all the other compute nodes; only label distributions for vertices which are neighbors of the vertices for a compute node are relevant. The essence behind our communication approach between nodes is that we identify the vertices which have neighbors in other compute nodes (and we hope to minimize the number of such vertices using a reordering heuristic), and collect the label distributions for these nodes after each iteration and then communicate these to the other compute nodes. We chose to use the AllGather MPI function because we wanted contiguous segments of memory to be communicated; sending a single message of contiguous memory that has relatively large size is preferable to many messages of small size because message sending and receiving has considerable overhead.

Furthermore, in the hybrid approach we used for multiple compute nodes, we implemented a variation of the reordering algorithm suggested by Bilmes and Subramanya for distributed computing settings. As discussed above, although their algorithm was designed for their specific use case of measure propagation on their specific dataset, label propagation also has the property of having updates at a vertex which are functions of values in the neighborhood of the vertex. The distributed reordering algorithm is almost identical to the original reordering algorithm. The primary difference is that there is a random picking near the boundaries to minimize the overlap between the neighbors of the paritions. This allows us to still make use of the original reordering heuristic to minimize cache misses within each compute node by maximizing neighbor overlap.

## Results

As we were primarily interested in making label propagation run faster and scale better, we measured performance by looking at speedup and wall-clock time for the actual running of the algorithm.

The two graphs that we used were Freebase1, which has 32972 vertices, 957076 edges, and 23 labels, and Freebase2, which has 301638 vertices, 2310002 edges, and 192 labels. It should be noted that both of these graphs are relatively sparse.

We ran label propagation for 10 iterations, and the baseline used for speedup was the original sequential code with the strict consistency model.

**Figure 2:** *Speedup graphs for Freebase1*

**Figure 2:** *Wall-Clock time graphs for Freebase1*

Figure 2 shows that the code with reordering and the strict consistency model achieves the best speedup when run with 64 threads across all combinations of reordering, consistency models, and number of threads. However, across different numbers of threads the code with the reordering and the full consistency model does not perform that well. It achieves similar speedup with a smaller number of processors and then performs incredibly well with 64 threads before performance drops off by quite a bit. One reason for this may be that the computation-to-communication ratio is low, so adding more threads after a certain point does not seem to help much. However, the fact that performance actually goes down is quite baffling. This may just be an anomaly since we do not see that same phenomenon occur with the code with reordering and the looser consistency model or the original code with the stricter consistency model.

Figure 2 also shows that the code with reordering and looser consistency seems to perform better than the code with the looser consistency and without the reordering. We also see that the looser consistency model actually performs worse than the stricter one. Because neither are really using any synchronization, the looser consistency model does not really present any advantages in terms of that. However, it does not have the copying that needs to take place for the stricter consistency model to work. Although the stricter consistency model has this added work, it turns out that this actually reduces the number of cache misses. The average number of cache misses, which was calculated by measuring the number of cache misses for 1, 2, 4, 8, 16, 32, 64, 128, and 256 threads and taking the average, was 288339.33 for the stricter consistency model with the original code and 303859.56 for the looser consistency model with the original code.

Figure 2 also shows that the reordering code does not seem to improve performance by a great deal. As a result, the added overhead of reordering the vertices does not make sense for Freebase1, which is not huge, is relatively sparse, and converges pretty quickly. The benefits of reordering are more evident in Figure 3 because Freebase2 is a much larger graph. However, one thing to note is that even here reordering is not worth it. This is because both of these graphs are relatively sparse, so there are not that many neighbors that are even being loaded in at a given time. This is evident from Figure 4, which shows that the cache misses for the original code do not increase with the number of threads. This means that the cache performance scaling poorly with the number of threads is not an issue for these graphs.

**Figure 3:** *Speedup graphs for Freebase2*

**Figure 4:** *Cache performance graphs*

Low-precision math, on the other hand, would be expected to perform well on these graphs because they are sparse and do not have that many labels. As a result, there is not much of a need to represent probabilities to more than a hundredth of a decimal place. Figure 5 shows that there is a definite speedup from using low-precision math on Freebase1. Figure 6 shows that there is an even greater speedup when using low-precision math on large graphs that are more sparse like Freebase2. This is likely due to the fact that a smaller and smaller percentage of the graph fits in the cache as the graphs get bigger. The smaller graphs like Freebase1 do not really suffer from poor cache performance, but this becomes a bigger deal for bigger graphs such as Freebase2. Improving cache performance by effectively increasing the number of data that can fit in it thus helps larger graphs because they are more likely to have been bounded by cache performance.

**Figure 5:** *Speedup graphs with low-precision math on Freebase1*

**Figure 6:** *Speedup graphs with low-precision math on Freebase2*

The reason that low-precision math performs so well compared to the reordering is that it improves the speed of arithmetic computations and essentially allows more data to fit in the cache instead of just reducing the size of the working set. Furthermore, there is no added overhead to using low-precision math.

It should be noted that all of the graphs shown only use the time taken for the actual label propagation. The parts that were not parallelized were the initialization of the graph structures and the reordering. In order to further improve the viability of reordering, the reordering algorithm could be parallelized as well.

In addition, all of the graphs make it clear that performance seems to flatten out after around 100 threads. This may be partly due to contention or cache invalidation due to false sharing as different threads try to write to vertices’ label distributions that might be on the same cache line. There is also relatively little computation, and the vast majority of the work is communication with reads and writes to memory. This is characteristic of most graph algorithms that have a low computation-to-communication ratio. Moving data between the Xeon Phi and the host machine is also something that cannot be parallelized and prevents us from reaching ideal performance. The amount of time taken to move data between the Xeon Phi and host machine was measured to be around 0.25s for Freebase1. This is a relatively large amount of time given that the code can run in between 1.2s and 22s depending on the number of threads. As the number of threads increase, the fixed time taken for the data transfer begins to take up a higher percentage of the performance time measured.

Because we were interested in seeing how using the hybrid approach of OpenMP and MPI worked on large graphs inspired by the large dataset used by Bilmes and Subramanya, we only tried the approach with Freebase2. A large graph is required to realize gains from message passing on compute nodes compared to shared memory on a single node; if we have a small graph there is no requirement for distributed memory since all of the computation can be done on the same node.

**Figure 7:** *Performance time graphs with hybrid approach on Freebase 2*

**Figure 8:** *Speedup graphs with hybrid approach on Freebase 2*

As can be seen from figure 7, the performance time increases as the number of compute nodes is increased without using a reordering heuristic. This is to be expected because communication via message passing is a bottleneck. With no reordering, even with a sparse graph, it is likely that the label distributions for a vast majority of the vertices need to be communicated. The reason for this being that the vertices are essentially ordered in a random order when originally parsed so even with a small number of edges, the probability that a vertex’s neighbor is handled by some thread on it’s own compute node is roughly the same that a vertex’s neighbor is handled by some thread on different compute node.

We implemented a variation of the reordering heuristic utilized by Bilmes and Subramanya in an attempt to circumvent this communication issue. In the case of Freebase2, we see that reordering does not help at all; we still see the increased communication causing performance dropoff from figure 8. We are not quite sure why this is the case; it is possible that the heuristic used in the aforementioned paper requires further modification to suit the dataset that we are using. It is also possible that the message passing could have been implemented more optimally irrespective of the reordering to eliminate some unnecessary communication. As mentioned in the Approach section, we chose to use the MPI AllGather function to send a contiguous segment of memory to all the nodes to reduce the overhead of message passing. It is possible we could have balanced this issue of additional smaller messages which the drawback of AllGather; for any given vertex whose label distribution is included as part of the AllGather call, several compute nodes (an increasing function of the total compute nodes used) may not require the label distribution for this vertex, thus wasting communication.

## References

- Amar Subramanya and Jeff Bilmes. Semi-supervised learning with measure propagation.JMLR: Journal of Machine Learning Research, 12:3311–3370, November 2011.
- Jeff Bilmes and Amar Subramanya. Parallel graph-based semi-supervisedlearning. In Ron Bekkerman, Mikhail Bilenko, and John Langford, editors, Scaling Up Machine Learning, chapter 15, pages 307—330. Cambridge University Press, January 2012.
- William Cohen. Machine learning with large datasets 10-405 in spring 2018,October 2018.
- Xiaojin Zhu and Zoubin Ghahramani. Learning from labeled and unlabeled data with labelpropagation. Technical report, 2002.