NOTE: For prerequisites and basic installation guidelines along with end to end execution of Hungry Hippos application, check out the source code on https://github.com/Talentica/HungryHippos. For a quick start, the README explains how to run Hungry Hippos on a Digital Ocean cluster. Currently, there are no binary distributions available.

Hungry Hippos Cluster Installation Manual

Prerequisites:

  • All the nodes in the cluster are running on Linux OS (all the scripts are written only for Linux based OS)
  • Java 8 is installed on all the nodes
  • Each node has a user named “hhuser” (currently, “hhuser” user is used in all the scripts)
  • The ssh login to these nodes as “hhuser” is without a password (to enable smooth running of the scripts)
  • In each of the nodes, the following entries in the mentioned files are present:
  • File “/etc/sudoers” (essential for clearing the cache of the nodes)

hhuser  ALL = NOPASSWD: /bin/sync hhuser  ALL = NOPASSWD: /sbin/sysctl vm.drop_caches=3

  • File “/etc/security/limits.conf” (essential for opening large number of files simultaneously)
* soft nofile 500000 * hard nofile 500000

Step 1 (Apache Zookeeper Cluster Installation)

  • Choose any set of nodes from the Hungry Hippos cluster to form the Zookeeper Cluster.
NOTE: The number of nodes for Zookeeper Cluster should be greater than or equal to the replication factor to be used for the files in the Hungry Hippos cluster.
  • To download and install the zookeeper cluster, visit http://zookeeper.apache.org.
  • Download and install zookeeper-3.5.1-alpha version.
  • Start the Zookeeper Cluster as per the Apache Zookeeper Manual.
NOTE: Zookeeper Cluster must be running before you start the Hungry Hippos Cluster.

Step 2 (Apache Spark Cluster Installation)

  • For maximum computing performance, choose all the nodes as workers for Apache Spark Cluster.
NOTE: All the nodes in the Hungry Hippos cluster are eligible to become workers of Apache Spark Cluster.
  • To download and install Apache Spark cluster, visit https://spark.apache.org.
  • Download and install the 2.2.0 pre-built for Apache Hadoop v2.7 or later version of Apache Spark.
NOTE: The Spark Cluster must be configured in Standalone mode.

Step 3 (Hungry Hippos Cluster Installation)

  • Login as hhuser on each node.
  • Download the distribution file from http://hungryhippos.io.
  • Unzip the distribution file to the home directory of hhuser.

Cluster Configuration files

  • client-config.xml : Used by the application to connect and retrieve data from the Zookeeper cluster.
  • cluster-config.xml : Used to store Hungry Hippos cluster details such as ID of each node, a name as identifier of each node, IP of the node and the port number on which the Hungry Hippos application will run for that node. This is a common configuration file uploaded to the Zookeeper during cluster initialization.
  • filesystem-config.xml :  Used to store file system configuration on each node. This is also a common configuration file uploaded to the Zookeeper during cluster initialization.

Configuration steps

  • Go to config directory of each node.
  • Prepare a file hhnodes containing the list of IPs of all the nodes of Hungry Hippos cluster.
  • Create a file cluster-config.xml from the cluster-config.xml.template. Update the cluster details in the cluster-config.xml.
  • Create a file filesystem-config.xml from the filesystem-config.xml.template. Update the filesystem details in the filesystem-config.xml.
  • Create a file client-config.xml from the client-config.xml.template. Update the zookeeper details in the client-config.xml.

Cluster Managing Scripts

After the configuration is complete, use the following scripts present in the sbin directory of the distribution in any one of the node in the cluster. These scripts must be run from inside of sbin directory.

  • clean-cluster.sh : used to format the data stored in the Hungry Hippos cluster.
  • start-cluster.sh : used to start the Hungry Hippos cluster.
  • stop-cluster.sh : used to stop the Hungry Hippos Cluster.
NOTE: Before running start-cluster.sh, run the clean-cluster.sh script atleast once for the first time after all the configuration files are updated.

Additional Important Configuration script:

  • hh-env.sh : This file is presnt in bin directory of distribution directory. This script stores the path of the distribution home directory. The path can be updated to use another path as distribution home directory.
    NOTE: All the Hungry Hippos clusters must have the same distribution directory path.

Data Storage Procedure

Data is stored in the Hungry Hippos files in two steps. The data-publisher.jar is required to perform both the steps. The jar file is present in the lib directory of the distribution.

Step 1 (Sharding)

This is a one time step for each Hungry Hippos file. This evaluates the distribution of data on the columns on which the data is to be sharded and assigns shards to the nodes.

The following files are required to perform sharding:

  • sample : a file containing the sample data (actual data to be stored). The sample data will be used to determine the distribution of the data on the shard columns.
  • client-config.xml : the same configuration file used for cluster configuration.
  • sharding-client-config.xml : the template file found in the config directory. This file contains the following details:
    • Path to the sample file
    • Distributed file path (directory and file representation of data in Hungry Hippos) cluster
    • Schema details of the data
    • Delimiter character of input file
    • Comma separated Shard Column names. The replication factor is equal to the number of Shard Column names. Repeating a column name multiple times will create multiple shard replicas.
NOTE: Number of Shard columns must be less than or equal to the number of nodes in the cluster.
  • sharding-server-config.xml : The template file found in config directory. This file contains the following details:
    • Number of buckets. The number of shards into which the data will be partitioned on each column.
      • If the number is equal to or greater then the number of nodes existing  in the cluster, the data is evenly distributed.
      • It directly affects the number of file chunks created in the Hungry Hippos cluster, so the number shouldn’t be chosen arbitrarily large.
      • The number of file chunks generated for each Hungry Hippos file can be calculated using the following formula:

The number of Buckets = P

The distinct number of Shard Columns = Q

The number of Shard Columns (replication factor) = R

Total number of file chunks = R * (PQ)

  • Cut off percentage: the minimum cut off presence in percentage of a certain shard column value in the sample file to be considered for sharding computation.

Use the following command to perform sharding:

java -cp data-publisher-0.7.0.jar com.talentica.hungryHippos.sharding.main.ShardingStarter <path to the client-config.xml> <path to the folder containing sharding config xml files>

Step 2 (Data Publish)

This step can be performed multiple times for the same Hungry Hippos file to append data into the same file.

The following files are required to perform sharding:

  • data: a file containing the data (actual input data to be stored). This data will be appended to the Hungry Hippos file.
  • client-config.xml : The same configuration file used for cluster configuration.

Use the following command to perform data publish:

java -cp data-publisher-0.7.0.jar com.talentica.hungryHippos.master.DataPublisherStarter <path to the client-config.xml> <path to actual data file> <file path of Hungry Hippos File>

Running Jobs on Hungry Hippos files

We are currently using Spark Cluster to run jobs on Hungry Hippos Files.

The following files are required to run the Spark Jobs on Hungry Hippos files.

NOTE: The jars can be found in the lib directory of distribution. Add path of these jars to the –jars list while running Spark Jobs.
  • hhrdd.jar : contains the API classes for Spark to interact with the Hungry Hippos File System.
  • node.jar : contains the API classes to interact with the Hungry Hippos File.
  • client-config.xml : the same file used for Cluster Configuration. It is used to connect to the Zookeeper.

The jobs on Hungry Hippos file can be either written in scala or in java after additionally putting the hhrdd.jar in the classpath during code compilation.

 

——————————————- Start of Sample Code ——————————————-

import com.talentica.hungryhippos.datasource.HHSparkContext;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.SparkSession;
public class SampleApp {
   public static void main(String[] args) throws Exception {
       // path to the Hungry Hippos File
       String hungryHipposFilePath = "/ParentDirectory/FileName";
       String clientConfigPath = "/home/hhuser/HungryHippos-0.7.0/config/client-config.xml";
       String appName = "Row Counter";
       // use the com.talentica.hungryhipos.datasource.HHSparkContext class to create the JavaSparkContext.     SparkSession sparkSession =p

               SparkSession.builder().appName(appName).getOrCreate();
       HHSparkContext context = new HHSparkContext(sparkSession.sparkContext(), clientConfigPath);
       // use the HHSparkContext class to create the org.apache.spark.sql.SQLContext Instance.
       SQLContext sqlContext = new SQLContext(context);
       // create Dataset<Row> Object for the Hungry Hippos File.
// The “dimension” property can be used to specify the column number of the shard column. The specified shard columns will be used while generating Hungry Hippos RDD for a Dataset. The Hungry Hippos API automatically picks the first shard column as default shard shard column if column number is not specified to a particular shard column number.

       Dataset<Row> df = sqlContext.read().
               format("com.talentica.hungryhippos.datasource").
               option("dimension", "0").
               load(hungryHipposFilePath);
       // register the Dataset Object to a View
       String viewName = "TB_DATA";
       df.createOrReplaceTempView(viewName);
       // run your queries on the view
       Dataset<Row>
               query = sparkSession.sql("SELECT COUNT(*) FROM "+viewName);
       query.show();
   }
}

——————————————- End of Sample Code ——————————————-