I'd like to see the before and after code. Such difference is not impossible between languages, but it's possible that a better implementation/algorithm in the original code wouldn't bring a similar speedup...
>Such difference is not impossible between languages
Not only it's not impossible, but it's also quite common, even for programs having the same program-level algorithmic complexity.
Except if the programs are IO bound (which happens with most of them), you can see real differences between various languages. It doesn't even have to be compiled vs interpreted (well, if you exclude JITs). A V8 JS program can smoke a Python program (when the latter doesn't reach for some C builtin function).
Of course, a Python program can also smoke a V8 JS program, this happens too. And it is not realistic to compare Python to JS while excluding any commonly used optimization strategy for the former, whether that be C or PyPy or using different language constructs.
Neither C nor PyPy is "commonly used optimization strategy" so the comparison is very much correct.
People reach for C when they are desperate for performance, have the skills to do C, have the time to do C and are ok with loosing the cross-platform capabilities. Pure python code runs on mac/windows/unix (90% of the time). The moment you add a C module to the mix, you end up with the usual cross-platform mess (separate installers for each platform, complicated build processes etc.).
>Of course, a Python program can also smoke a V8 JS program, this happens too.
A cleverly written Python program can indeed beat a naively written V8/JS program.
But if they have the same program-level complexity and the Python program doesn't delegate to C libraries, V8 wins. That is: when using the same algorithms in both languages, V8 is faster.
We have h264 decoders using Javascript, and even ports of the QT widget stuck into pure JS (using emscripten). And they run in acceptable speed. Those would be hopeless in pure Python without adding new ad-hoc C extensions.
>And it is not realistic to compare Python to JS while excluding any commonly used optimization strategy for the former, whether that be C or PyPy or using different language constructs.
Well, it's fair, in the sense that mainstream JS is fast, whereas mainstream Python (CPython) is not. And you don't have to resort to stuff like Cython, SWIG and such with JS to get that performance.
This is the sort of algorithm that really beats on CPython's weaknesses. Strings are immutable, so you're sitting there and just creating little tiny string object after little tiny string object, themselves made out of characters that are fairly heavyweight. You could hope that PyPy or something could work out how to optimize it, but I'm not sure PyPy would optimize around the constant strings.
Meanwhile, since Go has true arrays, and since you can probably just assume we're in English only and let byte[] substitute for real unicode, you can write very nearly what you'd write in C, albeit more safely, and not be creating string objects by the thousands, and it wouldn't be particularly unnatural Go code. A 12x differential is very plausible. I'd believe you could even go higher.
This is not to say that Go is awesome or that (C)Python is a bad language in general. I like (C)Python. But it does have some bad cases in it if you need lots of performance, fiddly little string algorithms and lots of numeric computation being two of the big ones. (By which I mean numeric computation in pure CPython, not with Numeric Python or PyPy.) In both cases the sheer rate of object creation and destruction work dwarfs the real work being done.
This gist of what you said was correct, but you made a few technical errors.
First of all, Go's strings are also immutable. To get around this, you can use the []byte type, as you said, but Go is built with the assumption that all text is UTF-8. Frankly, I find the suggestion that ASCII is good enough, even for Americans, absurd.
Second, Python's "list" type and Go's slice type (what you're calling a "true array") are almost the same under the hood. Both are variable length arrays. (I'm not sure why Python calls it a "list" since it's not a linked list.) The only major difference is that Python's lists are always filled with pointers to the underlying objects. In Go, you have the choice of whether you want to use pointers to the underlying objects or to put the objects into a contiguous block of memory.
Speaking more broadly, there's no reason you couldn't use a lot of different Go practices in Python and end up with much more efficient Python code: pre-allocating lists of the proper size, for example, or making sure to use mutable buffers instead of immutable str objects. However, Go makes it natural to be concerned with these efficiencies, and Python makes it natural to ignore them. Each language has its strengths.
"Such difference is not impossible between languages" You are correct in wanted to make sure the algorithms and data-structures used still have the same run-time complexity. But saying this is impossible is not accurate.
This kind of difference is not uncommon when BOTH the before and after program optimally handle IO (network,file,db,etc), BUT the before program is from the interpreted performance tier (PHP,Perl,Ruby,Python) and the after program is from the compiler tier (C/C++,Java,C#,Go,Scala,Haskell).
You find it unlikely that a compiled language might be 15x faster than a interpreted language doing a compute-bound workload, when said compiled language is often 30x faster in posted synthetic benchmarks... I think you are extremely skeptical...
I have ported node.js applications to Go and seen similar substantial speedups. In my cases though I was running in a multi-threading enviroment. not GAE and the performance difference (improvement) was much greater under concurrent load versus single request latency. But that seems reasonable since node.js is closer to Go in performance than Python is.
I guess I just believe it because I have seen it. :)
Sometimes, with smart data structures for specialized cases (persistency, better asymptotic behaviour etc.), it can happen that memory accesses dominate the actual execution. If that happens, and you can't saturate CPU's execution units, the difference between an efficient interpreter and a compiler gets smeared very quickly.