Benchmark Details

  • TPC-H Benchmark was used to evaluate the Hungry Hippos Application. It was tested on both 50 GB of TPC-H data and 100 GB of TPC-H data.
  • Queries performance on a Hungry Hippos cluster was compared against a Hadoop cluster (which stores data in Orc format).
  • Cluster of twenty machines were used to have a fair comparison of query performance on both the applications.
  • Replication factor of three was used for both the Hadoop Distributed File System (HDFS) and Hungry Hippos File System.
  • Local client machines (separate from the cluster) were used to load the data into both Hadoop and Hungry Hippos File Systems.
  • Test Data was generated using the TPC-H tool on the client machine in text format. The generated text data was pushed into the HDFS. Spark driver program was used to convert the text data into Orc format and then stored the converted data in HDFS.The same generated text data was pushed from the client machine into the Hungry Hippos File System.
  • Separate Spark driver programs were prepared for the HDFS and Hungry Hippos. Both the programs had all the TPC-H queries in the same sequence.
  • Hungry Hippos and its corresponding helper programs were shut down while we were running queries or pushing data for HDFS and vice versa.

Results:

Results for 50 GB TPC-H data

 

Hungry Hippos Table Configuration

Table Name Sharding Dimension Number of Buckets cut-off-percent
/input2/SUPPLIER S_NATIONKEY,S_NATIONKEY,S_NATIONKEY 2 1
/input2/REGION R_REGIONKEY,R_REGIONKEY,R_REGIONKEY 1 1
/input2/PARTSUPP PS_SUPPKEY,PS_AVAILQTY,PS_SUPPKEY 20 1
/input2/PART P_SIZE,P_BRAND,P_SIZE 20 1
/input2/ORDERS O_ORDERPRIORITY,O_ORDERSTATUS,O_ORDERPRIORITY 20 1
/input2/NATION N_NATIONKEY,N_NATIONKEY,N_NATIONKEY 1 1
/input2/LINEITEM L_SHIPMODE,L_SUPPKEY,L_SHIPMODE 20 1
/input2/CUSTOMER C_NATIONKEY,C_MKTSEGMENT,C_NATIONKEY 20 1

 

Hungry Hippos shard column chosen while running Spark SQL

Table Name Shard Column
/input2/SUPPLIER S_NATIONKEY
/input2/REGION R_REGIONKEY
/input2/PARTSUPP PS_SUPPKEY
/input2/PART P_SIZE
/input2/ORDERS O_ORDERPRIORITY
/input2/NATION N_NATIONKEY
/input2/LINEITEM L_SHIPMODE
/input2/CUSTOMER C_NATIONKEY

Results for 100 GB TPC-H data
The Hadoop Spark Orc data for query 21 and 22 is unavailable because the Spark program failed at query 21.

 

Hungry Hippos Table Storage Configuration

Table Name Sharding Dimension Number of Buckets cut-off-percent
/input/SUPPLIER S_NATIONKEY,S_NATIONKEY,S_NATIONKEY 2 1
/input/REGION R_REGIONKEY,R_REGIONKEY,R_REGIONKEY 1 1
/input/PARTSUPP PS_SUPPKEY,PS_AVAILQTY,PS_SUPPKEY 20 1
/input/PART P_SIZE,P_BRAND,P_SIZE 20 1
/input/ORDERS O_ORDERPRIORITY,O_ORDERSTATUS,O_ORDERPRIORITY 20 1
/input/NATION N_NATIONKEY,N_NATIONKEY,N_NATIONKEY 1 1
/input/LINEITEM L_SHIPMODE,L_SUPPKEY,L_ORDERKEY 20 1
/input/CUSTOMER C_NATIONKEY,C_MKTSEGMENT,C_NATIONKEY 20 1

 

Hungry Hippos shard column chosen while running Spark SQL

Table Name Shard Column
/input/SUPPLIER S_NATIONKEY
/input/REGION R_REGIONKEY
/input/PARTSUPP PS_SUPPKEY
/input/PART P_SIZE
/input/ORDERS O_ORDERPRIORITY
/input/NATION N_NATIONKEY
/input/LINEITEM L_SHIPMODE
/input/CUSTOMER C_NATIONKEY

 

Cluster Configuration

Processor Details RAM (in GB) HDD (in GB)
Model Name Architecture CPU(s) Mem Swap Storage
Machine 1 Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz x86_64 8 7 15 443
Machine 2 Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz x86_64 8 7 15 443
Machine 3 Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz x86_64 8 7 7 451
Machine 4 Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz x86_64 8 7 15 443
Machine 5 Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz x86_64 8 7 7 227
Machine 6 Intel(R) Core(TM) i7 CPU 870 @ 2.93GHz x86_64 8 7 7 139
Machine 7 Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz x86_64 8 7 7 451
Machine 8 Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz x86_64 8 7 7 451
Machine 9 Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz x86_64 8 7 7 451
Machine 10 Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz x86_64 8 7 15 443
Machine 11 Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz x86_64 8 7 7 451
Machine 12 Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz x86_64 8 7 7 451
Machine 13 Intel(R) Core(TM) i7 CPU 870 @ 2.93GHz x86_64 8 7 7 139
Machine 14 Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz x86_64 8 7 7 451
Machine 15 Intel(R) Core(TM) i7 CPU 870 @ 2.93GHz x86_64 8 7 7 139
Machine 16 Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz x86_64 8 7 7 451
Machine 17 Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz x86_64 8 7 15 443
Machine 18 Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz x86_64 8 7 15 443
Machine 19 Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz x86_64 8 7 7 451
Machine 20 Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz x86_64 8 7 7 451
Client Machine Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz x86_64 8 7 7 451

 

Network Switch 24 ports 1 Gbps switch

 

Application Configuration

Applications used for Hadoop Cluster Version
Openjdk 1.8.0_171
Apache Hadoop 2.7.2
Apache Spark 2.2.0 (prebuilt for Apache Hadoop 2.7 and later)

 

Applications used for Hungry Hippos Cluster Version
Openjdk 1.8.0_171
Apache Zookeeper 3.5.1-alpha
Apache Spark 2.2.0 (prebuilt for Apache Hadoop 2.7 and later)
Hungry Hippos 0.7.0