I am a first year PhD student at Princeton University. I was a Blue Scholar at IBM Research, India in the High Performance Computing (HPC) group headed by Yogish Sabharwal. At IBM, I was mentored by the incomparable Venkatesan Chakaravarthy, and I worked closely with him on parallel algorithms for graph and tensor computations. Previously, I was a Master's student at the Indian Institute of Science under the supervision of Prof. Sathish Vadhiyar.
Before IISc, I worked at Magma Design Automation (Synopsys) and obtained an undergraduate degree in Computer Science at BITS Pilani Goa Campus. A long time ago, I studied computer science in high school with Suma Nambisan.
On Optimizing Distributed Tucker Decomposition for Sparse Tensors
(alphabetical) V. T. Chakaravarthy, J.W. Choi, D. J. Joseph, P. Murali, Y. Sabharwal, S. Shivmaran and D. Sreedhar International Conference on Supercomputing (ICS) , 2018
Single source shortest path (SSSP) computation is classic algorithmic problem which is known to be very hard in distributed memory systems. As part of a Graph500 benchmarking effort, my colleagues at IBM studied techniques to accelerate the Delta-stepping algorithm on R-MAT graphs (a popular generator for scale-free graphs). They achieved a processing rate as high as three trillion edges per second on the Graph500 R-MAT graph and won a best paper award in IPDPS 2014. After I joined the group in 2015, I got a chance to play around with the code and extend the methods. The execution time of the Delta-stepping algorithm and the optimized versions is dependent on the structure of the graph and properties of the degree distribution. We extended the techniques to enable the algorithm to adapt to the characteristics of the input graph. We developed techniques to predict the value of the delta parameter, a better strategy to switch from Delta-stepping to Bellman-Ford etc., allowing the implementation to run efficiently on a large spectrum of R-MAT and real world graphs. In the largest tested configuration during my stint, we demonstrated scalable shortest path computations on a spectrum of 14 R-MAT graphs having over a trillion edges, using 65536 cores of the Mira system at Argonne National Laboratory.
Production parallel systems are space-shared, and resource allocation on such systems is usually performed using a batch queue scheduler. Jobs submitted to the batch queue experience a variable delay before the requested resources are granted. Predicting this delay can assist users in planning experiment time-frames and choosing sites with less turnaround times and can also help meta-schedulers make scheduling decisions. We developed an integrated adaptive framework, Qespera, for prediction of queue waiting times on parallel systems. Our algorithm is based on spatial clustering of historical job submissions and executions and it dyamically chooses the appropriate prediction method depending on prevalent job and system characteristics. The figure below shows the heatmap of actual and predicted waiting times for 40000 jobs in the Intrepid system at Argonne National Laboratory (IBM BlueGene/P, 163840 cores). For this workload, our method incurs an average prediction error of only 15 minutes.
The problem of counting occurrences of small query graphs in a large data graph, known as subgraph counting, is fundamental to several domains such as genomics and social network analysis. Color coding is a well known technique for approximate subgraph counting. Parallel implementations of color coding have been studied for query graphs which are trees. We developed the first efficient distributed implementation for color coding that goes beyond tree queries - our algorithm works for all queries of treewidth two (the figure below shows the queries used in our study). We designed a degree based counting algorithm which works around the high degree nodes in the graph, delivering up to 28x improvement in execution time over the baseline algorithm. These results were published in IPDPS 2016, and received a Distinguished Paper Award in IBM Research, India.
Subgraph Counting: Color Coding Beyond Trees
(alphabetical) V. T. Chakaravarthy, M. Kapralov, P. Murali, F. Petrini, X. Que, Y. Sabharwal, B. Schieber International Parallel and Distributed Processing Symposium (IPDPS), 2016
Scheduling HPC Workloads in Day-Ahead Electricity Markets
With the promise of exascale computing, high performance grid systems are expected to incur electricity bills that grow super-linearly over time. In order to achieve cost effectiveness in these systems, it is essential for the scheduling algorithms to exploit electricity price variations, both in space and time, that are prevalent in the dynamic electricity price markets. We developed a metascheduling algorithm to optimize the placement of jobs in a compute grid which consumes electricity from the day-ahead wholesale market. We used trace based simulation with real and synthetic workload traces, and real electricity price data sets to demonstrate our approach on two currently operational grids, XSEDE and NorduGrid. Experiments show that our approach simultaneously optimizes the total electricity cost and the average response time of the grid, without being unfair to users of the local batch systems.
Performance Predictions using a Single Execution of an HPC Application
We investigated was whether a single execution of a parallel application has enough information to yield insights about its scalability. Prior work on modeling and prediction of application performance had developed statistical models trained using multiple executions. Our method identifies significant phases in the application and matches them with reference kernels stored in a database. Using the performance of the reference kernels and static code analysis, we derived performance estimates for the application. The hallmark of this project is its ability to accurately predict the execution time of scientific applications, such as Gyrokinteic Toroidal Code (GTC), on a large number of processors, using a single small scale execution. The figure below shows the actual and predicted execution times for GTC corresponding to three experiments on an Intel cluster. In the first experiment, 16->128 means that the execution time is reported for 128 processors, and the prediction of execution time for the 128 processors run was made using a single execution on 16 processors. See paper for evaluations on Sweep3D and SMG2000.