Triangle counting is a fundamental graph analytic operation that is used extensively in network science and graph mining. As the size of the graphs that needs to be analyzed continues to grow, there is a requirement in developing scalable algorithms for distributed-memory parallel systems. To this end, we present a distributed-memory triangle counting algorithm, which uses a 2D cyclic decomposition to balance the computations and reduce the communication overheads. The algorithm structures its communication and computational steps such that it reduces its memory overhead and includes key optimizations that leverage the sparsity of the graph and the way the computations are structured. Experiments on synthetic and real-world graphs show that our algorithm obtains an average relative speedup range between 3.24 to 7.22 out of 10.56 across the datasets using 169 MPI ranks over the performance achieved by 16 MPI ranks. Moreover, we obtain an average speedup of 10.2 times on comparison with previously developed distributed-memory parallel algorithms.