What are the main differences of this architecture when compared with the Apache Spark ? Something that I see as a nice advantage is the Python -> LLVM IR, but I can't see what are the main advantages over Spark.
In particular, we're working on byte-level shared-memory integration with Impala (which is implemented in C++ with LLVM runtime codegen — the project's tech lead, Marcel Kornacker, was the tech lead for Google F1's query engine) to run user-defined logic without data serialization / memory usage overhead. This also opens up Python's HPC / scientific computing stack and existing data libraries to be run in a Hadoop setting without Python-JVM interoperability issues.