(12) Monitor & Debug Large Linux Clusters & User Applications with Paramon & Paratune
Time: Monday, June 17, 2013
Speakers:   David Chen, IBM
Abstract:   Dynamically monitoring a large HPC Linux cluster for hardware failure and application behavior is the first step to maintain peak cluster performance. This is a crucial task for a cluster fully loaded with production workload in an industrial environment. Paramon, a software tool developed by Beijing Paratera Technology, simplifies the task through its dynamical GUI interface and richly integrated performance instruments such as hardware failure monitors, CPU performance and data traffic meters, and user application performance tracers. Working with its sister software Paratune, it also supports rich features in cluster performance measurement that can be zoomed further down to the user application process level. Their integrated GUI interfaces to performance measurement instruments, depth of technical coverage and interactive user experience have distinguished itself from other similar tools in the HPC market place.
Paramon and Paratune are widely deployed on large HPC systems throughout China including Tianhe-1A, the most powerful system on the Top500 list of Nov. 2010. Recently, we used them to debug customer clusters in several crit-sit situations and found it to be an effective tool to guide system admins to quickly isolate root causes of both hardware and software problems. Paramon for a Linux cluster is like a stethoscope for a medical doctor.

