This paper presents an energy-efficient, deep parallel Convolutional Neural Network (CNN) accelerator. By adopting a recently proposed binary weight method, the CNN computations are converted into multiplication-free processing. To allow parallel accessing and storing of data, we use two RAM banks, where each bank is composed of NRAM blocks corresponding to N-parallel processing. We also design a reconfigurable CNN computing unit in a divide-and-reuse to support a variable-size convolutional filter. Compared with full-precision computing on the MNIST and CIFAR-10 classification tasks, the inference Top-1 accuracy of the binary weight CNN has dropped by 1.21% and 1.34%, respectively. The hardware implementation results show that the proposed design can achieve 2100 GOPs with a 4.6 millisecond processing latency. The deep parallel accelerator exhibits 3X energy efficiency compared to a GPU-based design.