If you're using C++, check out Boost.Compute [1]. It provides a high-level STL-like API for OpenCL (without preventing you from directly using the low-level OpenCL APIs). It simplifies common tasks such as copying data to/from the device and also provides a number of built-in algorithms (e.g. sorting/reducing/transforming, etc).
Hey that's pretty cool, and would probably make OpenCL usable by mere mortals. One improvement that I see you could borrow from vector is getting rid of this explicit copying business. Take a look at the array implementation in our runtime library.
Basically, the VectorArray class contains both the host array pointer and the device array pointer. There are also two boolean flags, h_dirty and d_dirty. When you modify array elements on the host, h_dirty is set to one. Then, when you run a kernel, the data is copied to the device if h_dirty is set, h_dirty is cleared, and d_dirty is set. When you try to read an array element again on the CPU, the data is copied from device to host if d_dirty is set, and d_dirty is then cleared.
It's cool because it offers C++ developers an easy path to running code on GPUs and multi-core CPUs via an STL-like API. It's similar to NVIDIA's Thrust library but supports all OpenCL compatible devices (including AMD GPUs and Intel CPUs/accelerators).
P.S. It's still under active development and we're looking for more contributors with an interest in parallel computing and C++. Send me an e-mail if you're interested!
Check it out here: https://github.com/kylelutz/compute