About Cloudera Impala
Cloudera impala is a high performance MPP engine. It is built using C++, LLVM in critical performance parts. It is an open source. “Cloudera Impala” is a trademark of Сloudera.
Words of thanks
We want to tell a lot of thanks to Cloudera who has developed this beautiful codebase. It is one of the best solutions to analyze data inside Hadoop cluster.
Why ImpalaToGo
The main incentive is that we want to free Impala from the storage layer. – Hadoop hardware usually has a lot of HDD drives, some CPU and RAM. – Impala likes machines with high IO, a lot of RAM and a lot of CPU. – We want to enable running Impala on best possible hardware.
Some hardware examples
Typical Hadoop server: 32-48GB RAM, 4 HDD drives, dual six cpu. Machine optimal for Impala: 128-256 GB Ram, 4 SSD or 12 HDD, dual 12 core CPU.
Architecture – diagram
Hadoop cluster or S3 Caching layer on local SSD drives ImpalaToGo cluster.
Architecture – in words
Current Cloudera Impala is working with HDFS only. ImpalaToGo has own caching layer on local drives and works with any DFS, like S3.
Caching algorithm – LRU
We write to local drives as long as there is a space. When space is about to finish – we delete files we haven’t used for the longest time.
When it is applicable
Assuming you have data on hardware, which is not ready for Impala because: – It is not Yours (s3). – There is not enough RAM (old Hadoop machines). – RAM is occupied (Map Reduce already use it). – You Hadoop version does not support Impala. – You do not want to risk running anything alongside your critical processes.
ImpalaToGo solution
You get hold of bunch of good machines and run ImpalaToGo on them. ImpalaToGo will access data from s3 or from remote HDFS. In the same time – it keeps hot data on local drives us improving performance and reducing load on storage cluster as much as possible.
What hardware to use?
We suggest using the same hardware as recommended for Cloudera Impala with one change. Instead of using a lot of HDD drives to get both space and bandwidth – you can put a few SSD. You will get even better bandwidth and ImpalaToGo will keep hot data there.
How much it differs from Cloudera Impala
We didn’t change too much. We replaced HDFS access with our caching layer. We replaced data locality which was received from NameNode with consistent hashing. So, 99% of code is original Cloudera Impala code.
Want to know more?