Nagoya University exhibited at SC19 in Denver, CO, November 18 - 21, 2019

Information Technology Center, Nagoya University exhibited at SC19, the International Conference for High Performance Computing, Networking, Storage and Analysis in Denver, Colorado. Short talks were presented at the Nagoya University booth.

- Exhibition: November 18 - 21, 2019.
- Conference: November 17 - 22, 2019.
- Nagoya University Booth: #1787

Booth Talks

18th (Mon) November 2019 (In Opening Gala)

19:15-19:30 Osni Marques (Lawrence Berkeley National Laboratory, USA)
- "Massively Parallel Eigensolvers based on Unconstrained Energy Functionals Methods"
  Abstract: This talk summarizes work on a preconditioned conjugate gradient based iterative eigensolver using an unconstrained energy functional minimization scheme. This scheme avoids an explicit reorthogonalization of the trial eigenvectors and becomes an attractive alternative for the solution of very large problems. The scheme has been implemented in the first-principles materials and chemistry CP2K code, and we have studied systems with a number of atoms ranging from 2,247 to 12,288. We have examined the convergence and scaling of the eigensolver on a large Cray XC40, using up to 38% of the full machine.
19:30-19:45 Tetsuya Hoshino (The University of Tokyo, Japan)
- "Optimizations of H-matrix Library HACApK for Many-core Processors"
  Abstract: Hierarchical matrices (H-matrices) are an approximation technique for dense matrices, such as the coefficient matrix of the boundary element method (BEM). In this talk, we present HACApK, which is an open-source H-matrix library originally developed for CPU-based clusters, optimized for many-core processors such as GPUs and Intel Xeon Phi.
19:45-20:00 Franz Franchetti (Carnegie Mellon University, USA)
- "FFTX and SpectralPack"
  Abstract: We present a status update on FFTX and SpectralPack and the ongoing collaboration between Carnegie Mellon University and University of Nagoya. FFTX and SpectralPACK are developed as part of the DOE ExaScale effort by LBL, Carnegie Mellon University, and SpiralGen, Inc. We aim at translating the LAPACK/BLAS approach from the numerical linear algebra world to the spectral algorithm domain. FFTX is extending and updating FFTW for the exascale era and beyond while providing backwards compatibility. SpectralPack captures higher level spectral algorithms and their variants, including convolutions, Poisson solvers, correlations, and numerical differentiation approaches that translate to FFT calls. The SPIRAL code generation and autotuning system–now available as open source under a BSD/Apache license–underpins the effort to provide performance portability. We will discuss the current status of the software packages and future plans.

19th (Tue) November 2019

13:00-13:15 Takeshi Fukaya (Hokkaido University, Japan)
- "Efficient Tall-Skinny QR Factorization using the Cholesky QR Algorithm "
  Abstract: We consider to computing the QR factorization of a tall and skinny matrix, which is one of basic building blocks in various numerical algorithms. The Cholesky QR algorithm is suitable for recent computer systems including a large-scale distributed parallel system, but it has a serious numerical instability. Recently, we proposed a technique to improve its numerical stability while retaining the advantages in high-performance computing. In this talk, we will present an overview of the Cholesky QR algorithm, our proposed idea, and performance results that show the effectiveness of the improved algorithm. This is join work with R. Kannan, Y. Nakatsukasa, Y. Yamamoto, and Y. Yanagisawa.
13:15-13:30 Kenji Ono (Kyusyu University, Japan)
- "Scalable Direct-Iterative Hybrid Sparse Matrix Solver for coming "FUGAKU" computer"
  Abstract: We propose an efficient direct-iterative hybrid solver for sparse matrices that can derive the scalability of the latest multi/many-core and vector architectures and examine the execution performance of the proposed SLOR-PCR method.We also present an efficient implementation of the PCR algorithm for SIMD and vector architectures so that it is easy to output instructions optimized by the compiler.The proposed hybrid method has high cache reusability, which is favorable for modern low B/F architecture because efficient use of the cache can mitigate the memory bandwidth limitation. The measured performance revealed that the SLOR-PCR solver showed excellent scalability of 90% on 88 cores of the cc-NUMA environment.
13:30-13:45 Toru Nagai (ITC, Nagoya University, Japan)
- "Development of a new method for solving the wave equation, DOWT (Discrete Operational Wave Theory)"
  Abstract: Establishment of the ACROSS (Accurately Controlled, Routinely Operated, Signal System), which is an observation technology in geophysical exploration and enables us to acquire very accurate data in frequency domain urged us to develop a theoretical support for wavefield analysis which is capable of solving the wave equation representing a large body with the most general structures in frequency domain. Here, we propose a new method for solving the wave equation, DOWT (Discrete Operational Wave Theory)
13:45-14:00 Satoshi Ohshima (ITC, Nagoya University, Japan)
- "Optimization of Numerous Small Dense-Matrix-Vector Multiplications on GPU"
  Abstract: Dense-matrix-vector multiplication is one of the well-known matrix calculations. Because there are no data reusability, it is difficult to accelerate this calculation and few studies focus on this calculation. However, some applications require numerous small dense-matrix-vector multiplications at once. Current high-performance processors having many computation cores are not good at small matrix calculations because these calculations can't fill the all computation cores. On the other hand, numerous small dense-matrix-vector multiplications have large number of total calculations. In this work, to accelerate the calculations, we proposed new optimization strategies are measured the performance on GPU.

20th (Wed) November 2019

13:30-13:45 Hiroyuki Takizawa (Tohoku University, Japan)
- "Memory-centric performance tuning for modern processors with high bandwidth memory"
  Abstract: It is well-known that the performance of a scientific application is often limited by the sustained memory bandwidth. Since the memory subsystem of a modern processor is complicated, a high sustained memory bandwidth can be attained for only specific memory access patterns. Therefore, this talk shows the importance of memory access patterns for memory-intensive kernels, and discusses a memory-centric performance tuning strategy for modern processes with high bandwidth memory.
13:45-14:00 Ryo Yoda (Kogakuin University, Japan),
　　　　　　Akihiro Fujii (Kogakuin University, Japan),
　　　　　　Teruo Tanaka(Kogakuin University, Japan)
- "Linear MGRIT preconditioned inexact Newton-Krylov method for Nonlinear Time Integration Problems"
  Abstract: Multigrid Reduction in Time (MGRIT) is a parallel-in-time multigrid for the time integrations.There are the linear MGRIT and nonlinear MGRIT with Full Approximation Scheme (FAS).In previous work, we proposed the linear MGRIT preconditioning, and it improved the instability due to the enlarged time-step width.In this work, we extend the MGRIT preconditioning to the nonlinear time integrations in a different way than MGRIT with FAS.We apply the Newton-Krylov to the nonlinear system overall time-steps, and use the linear MGRIT preconditioning for the inner iterations ( Newton-MGRIT-Krylov).Numerical experiments show an example comparing Newton-MGRIT-Krylov and MGRIT with FAS for the one-dimensional heat diffusion and burgers equation.
14:00-14:15 Rio Yokota (Tokyo Institute of Technology, Japan)
- " Training ImageNet on Thousands of GPUs with Second Order Optimization"
  Abstract: Scalable deep learning methods have gone through a rapid evolution during the past few years to accommodate the demand from the high performance computing field. When scaling deep learning applications to thousands of GPUs on leadership class systems, a common problem arises in all of these applications. Since there exist a lower bound on the number of steps required for a given model to converge, distributing the data on evermore GPUs will not yield ideal strong scaling indefinitely, even in the absence of communication and load-imbalance. This poses a fundamental limit on the parallelism achievable in distributed deep learning. To solve this problem, an optimization method that can take larger steps in the correct direction is necessary. We will discuss such optimization methods in this booth talk.
14:15-14:30 Takahiro Katagiri (ITC, Nagoya University, Japan)
- "Automatic Preconditioner Selection for Sparse Iterative Libraries by Deep Learning"
  Abstract: Preconditioner selection is one of crucial processes to utilize sparse iterative library to solve linear equations. In this research, we present the latest auto-tuning method by deep learning for the selection of preconditioners for GMRES library with restarting. Result of performance evaluation indicates that proposed method with sparsity information from each divided section of sparse matrix shape improves rate of correct answer.