The techniques for fine-grained data sharing are generally available only on specialized architectures, usually involving a shared-bus. The CIRculating Common-Update Snaring (CIRCUS) mechanism has low latency user-level contention-free access to a set of shared circulating data registers. The local access latency is near zero for both read and write operations. These operations can be mapped into more complex operations, such as arithmetic, logical, or data reduction operations such as minimum or sum to be performed by the circulating register hardware (CRH) on the circulating copy of a register. The CRH can also be used to perform atomic operations, such as fetch&add or swap. For a two-dimensional hierarchy of N processing elements (PEs), the write-latency (until the circulating register is updated with a new value) and the update-latency (when all CRH modules can see the updated value) have an optimum cluster size proportional to (N ⋯ I/D)1/2, where I is the intercluster time and D is the inter-PE time, including the time between and through one node. The latencies, when optimally clustered, are proportional to (N ⋯ I ⋯ D)1/2. Sub-micro-second write-latency is expected for up to 15,255 PEs or 660 workstations. For higher levels of hierarchy, the expected write-latency is shown to be proportional to the sum of the latencies of all loop hierarchies. CIRCUS is applicable to a wide variety of system architectures and topologies.