You are half correct and half wrong. We want to also speed up programs that handle largely independent events without using threads, which is a much bigger "market" in Python than allowing existing multithreaded programs to run faster. The idea is to create threads under the hood, and have each one run a complete event in a transaction -- basically, in standard Python terms, forcing the "GIL release" to not occur randomly. This means adding threads and then controlling the transaction boundaries with a new built-in function. This is a change that can be done in the core of the event system only (Twisted, etc.) without affecting the user programs at all.
Two details. First, cloning a complete page and calling remap_file_pages() only occurs once on a given page, so if a number of transactions repeatedly modify the same objects, they will already be in un-shared pages (the page are tentatively shared again during major collection only).
And second, about the exact time at which "read-write" conflicts are found: when we commit, we know which objects we modified, so we can check if someone else has read them and abort that other thread instead. We never need to walk the list of all read objects: it is enough to check if some (smaller number of) objects were read or not by another thread. So the "list" of all read objects is implemented as a byte mask over all objects.
The HTM buffer is too small for any language for doing what I'm talking about, which is long transactions. Haswell uses its L1 Data cache as buffer --- which means at most 32KB or 64KB, and only if you're on a lucky day. (Actually it depends on cache collision and alignment issues: it can in theory overflow with only 9 memory accesses, but I'd expect the numbers to be generally more than half the full cache size.)
The HTM buffer is big enough if the goal is "only" to remove the GIL. It's going to be a 10-lines patch to CPython or PyPy (yes, really!). The GIL is often made into a big issue, but in my opinion removing it isn't the final solution to all multicore troubles in Python: that's just the first step.