Jaguar Supports Spark

Since now Jaguar provides JDBC connectivity, developers can use Apache Spark to load data from Jaguar and perform data analytics and machine learning. The advantages provided by Jaguar is that Spark can load data faster, especially for loading data satisfying complex conditions, from Jaguar than from other data sources. The following code is based on two tables that have the following structure:

create table int10k ( key: uid int(16), score float(16.3), value: city char(32) );
create table int10k_2 ( key: uid int(16), score float(16.3), value: city char(32) );

Scala program:

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import scala.collection._
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.log4j.Logger
import org.apache.log4j.Level

object TestScalaJDBC {
def main(args: Array[String]) {

def sparkfunc()
val sparkConf = new SparkConf().setAppName(“TestScalaJDBC”)
val sc = new SparkContext(sparkConf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._


val people =“jdbc”)
Map( “url” -> “jdbc:jaguar://”,
“dbtable” -> “int10k”,
“user” -> “test”,
“password” -> “test”,
“partitionColumn” -> “uid”,
“lowerBound” -> “2”,
“upperBound” -> “2000000”,
“numPartitions” -> “4”,
“driver” -> “”

// work fine

val people2 =“jdbc”)
Map( “url” -> “jdbc:jaguar://”,
“dbtable” -> “int10k_2”,
“user” -> “test”,
“password” -> “test”,
“partitionColumn” -> “uid”,
“lowerBound” -> “2”,
“upperBound” -> “2000000”,
“numPartitions” -> “4”,
“driver” -> “”

// sort by columns

people.sort($”score”.desc, $”uid”.asc).show()
people.orderBy($”score”.desc, $”uid”.asc).show()

// select by expression
people.selectExpr(“score”, “uid” ).show()
people.selectExpr(“score”, “uid as keyone” ).show()
people.selectExpr(“score”, “uid as keyone”, “abs(score)” ).show()

// select a few columns
val uid2 =“uid”, “score”);

// filter rows
val below60 = people.filter(people(“uid”) > 20990397 ).show()

// group by

// groupby and average

“score” -> “avg”,
“uid” -> “max”

// rollup
“uid” -> “avg”,
“score” -> “max”

// cube
“uid” -> “avg”,
“score” -> “max”

// describe statistics
people.describe( “uid”, “score”).show()

// find frequent items
people.stat.freqItems( Seq(“uid”) ).show()

// join two tables
people.join( people2, “uid” ).show()
people.join( people2, “score” ).show()
people.join(people2).where ( people(“uid”) === people2(“uid”) ).show()
people.join(people2).where ( people(“city”) === people2(“city”) ).show()
people.join(people2).where ( people(“uid”) === people2(“uid”) and people(“city”) === people2(“city”) ).show()
people.join(people2).where ( people(“uid”) === people2(“uid”) && people(“city”) === people2(“city”) ).show()
people.join(people2).where ( people(“uid”) === people2(“uid”) && people(“city”) === people2(“city”) ) .limit(3).show()

// union

// intersection

// exception

// Take samples
people.sample( true, 0.1, 100 ).show()

// distinct

// same as distinct

// cache and persist

// SQL dataframe
val df = sqlContext.sql(“SELECT * FROM int10k where uid < 200000000 and city between ‘Alameda’ and ‘Berkeley’ “)

The class generated from the above Scala program can be submitted to Spark as follows:

/bin/spark-submit –class TestScalaJDBC \
–master spark://masterhost:7077 \
–driver-class-path /path/to/your/jaguar-jdbc-2.0.jar \
–driver-library-path $HOME/jaguar/lib \
–conf spark.executor.extraClassPath=/path/to/your/jaguar-jdbc-2.0.jar \
–conf spark.executor.extraLibraryPath=$HOME/jaguar/lib \



A very useful tool in cluster environment

Distributed Shell (dsh) is a very powerful tool for system administrators in a cluster environment. Here are some tips for installing and using it:

sudo apt-get install dsh


sudo yum install dsh

In /etc/dsh/dsh.conf change remoteshell:

remoteshell =ssh

Here is how to make your public key if you do not have one yet (~/.ssh/

$ ssh-keygen -t rsa -P “”
(no passphrase , ~/.ssh/ will be created )
$ ssh-copy-id -i ~/.ssh/   ALL_OTHER_HOSTS

Then in /etc/dsh/machines.list  put all your hosts:


Finally you can issue commands to ALL the hosts in your cluster:

$ dsh –aM –c YOUR_COMMAND

For example:
$ dsh –aM –c uptime



Jaguar Benchmark Against Spark

Jaguar is the distributed version of ArrayDB. Simply installing ArrayDB over any distributed file system such as Gluster and Ceph, ArrayDB performed extremely well.

We have setup a cluster of computer servers running Glusterfs, which is a distributed file system capable of scaling to several petabytes and handling thousands of clients. We then mounted the Gluster volume on the clustered servers.  Spark 1.3.1 is also installed on these same servers to benchmark SQL operations on a data set consisting of two million key-value pairs.

In Spark testing, the procedure to compile and execute Spark Scala program is as follows:

$ vim MyTest.scala

$ sbt package

$ spark-submit –class MyTest –master yarn-client   target/scala-2.11/mytest_2.11-1.0.jar

In Jaguar testing, the data directory in $HOME/arraydb/ is soft-linked to the mounted directory of gluster volume. Then client programs are started on the different clustered hosts.

1. Joining two tables each consisting of 2000000 data items (32 bytes key, 48 bytes value).


import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.SparkContext._
import org.apache.spark.sql._
import org.apache.spark.sql.types.{StructType,StructField,StringType};

object MyTest
def main(args: Array[String])


val sparkConf = new SparkConf().setAppName(“MyTest”);
val sc = new SparkContext(sparkConf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

val people1 = sc.textFile(“hdfs://HD3:9000/home/exeray/2M.txt”)

val people2 = sc.textFile(“hdfs://HD3:9000/home/exeray/2M2.txt”)

val schemaString = “uid v1 v2 v3″;
val schema = StructType ( schemaString.split(” “).map( fieldName => StructField( fieldName, StringType, true ) ) )
val rowRDD1 = _.split(“,”)).map( p=>Row( p(0), p(1), p(2), p(3) ) )
val rowRDD2 = _.split(“,”)).map( p=>Row( p(0), p(1), p(2), p(3) ) )
val peopleSchemaRDD1 = sqlContext.applySchema( rowRDD1, schema )
val peopleSchemaRDD2 = sqlContext.applySchema( rowRDD2, schema )
peopleSchemaRDD1.registerTempTable( “people1” );
peopleSchemaRDD2.registerTempTable( “people2” );

val res = sqlContext.sql(“SELECT * FROM people1 join people2 on people1.uid=people2.uid “)




adb> select * join ( 2M, 2M2 );

Result: Spark took 356 seconds, Jaguar took 168 seconds.


2. Joining two tables with condition

The Spark Scala program has added a where clause:

val res = sqlContext.sql(“SELECT * FROM people1 join people2 on people1.uid=people2.uid where people1.uid >= ‘aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa’ and people1.uid <= ‘gggggggggggggggggggggggggggggggg’ “)

So does Jaguar:

adb> select * join ( 2M, 2M2) where 2M.uid >= ‘aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa’ and 2M.uid <= ‘gggggggggggggggggggggggggggggggg’;

Result: Spark took 85 seconds, Jaguar took 13 seconds.


3.  Count items by keys

Spark:  SELECT count(*) FROM people1 where uid >= ‘kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk’ and uid <= ‘mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm’ “)

Jaguar:  select count(*) from 2M2 where uid >= ‘kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk’ and uid <= ‘mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm’ limit 999999999;

Result: Spark took 52 seconds, Jaguar took 0.1 seconds.

4. Point queries


val res1 = sqlContext.sql(“SELECT count(*) FROM people1 where uid = ‘yGW4r5thqpu7Bb4TCmxtdTpxXTxcOjhk’ “)
val res2 = sqlContext.sql(“SELECT count(*) FROM people1 where uid = ‘lZ1wt3llixT0r5jujuwfcKYb0Og2JF05’ “)
val res3 = sqlContext.sql(“SELECT count(*) FROM people1 where uid = ‘YKyoRLuBuYBGTpmQauGgnPZg3FGI3GxZ’ “)
val res4 = sqlContext.sql(“SELECT count(*) FROM people1 where uid = ‘YOzKDhmtCN095BVtyJRESRjhamhbJD1H’ “)
val res5 = sqlContext.sql(“SELECT count(*) FROM people1 where uid = ‘w0zDgzD2BdWE5sgFxgEL6zBjZckY6mnA’ “)
val res6 = sqlContext.sql(“SELECT count(*) FROM people1 where uid = ‘HhB6p8srwRG4PpHCgT1IG1jKJU0PXDJE’ “)
val res7 = sqlContext.sql(“SELECT count(*) FROM people1 where uid = ‘4Fpqf8JLORNavhwnthF7olySkAk0ggOj’ “)
val res8 = sqlContext.sql(“SELECT count(*) FROM people1 where uid = ‘tvApMnTzxc8SCkyRiSnTWtIYUHJQc91E’ “)
val res9 = sqlContext.sql(“SELECT count(*) FROM people1 where uid = ‘1Sg72G7ubanKSiYkzOqaGf9VvQjIVDLV’ “)
val res10 = sqlContext.sql(“SELECT count(*) FROM people1 where uid = ‘omo64Q5VxjzhDs148tNzrW4sGk4ouASS’ “)


select * from 2M where uid=yGW4r5thqpu7Bb4TCmxtdTpxXTxcOjhk;
select * from 2M where uid=lZ1wt3llixT0r5jujuwfcKYb0Og2JF05;
select * from 2M where uid=YKyoRLuBuYBGTpmQauGgnPZg3FGI3GxZ;
select * from 2M where uid=YOzKDhmtCN095BVtyJRESRjhamhbJD1H;
select * from 2M where uid=w0zDgzD2BdWE5sgFxgEL6zBjZckY6mnA;
select * from 2M where uid=HhB6p8srwRG4PpHCgT1IG1jKJU0PXDJE;
select * from 2M where uid=4Fpqf8JLORNavhwnthF7olySkAk0ggOj;
select * from 2M where uid=tvApMnTzxc8SCkyRiSnTWtIYUHJQc91E;
select * from 2M where uid=1Sg72G7ubanKSiYkzOqaGf9VvQjIVDLV;
select * from 2M where uid=omo64Q5VxjzhDs148tNzrW4sGk4ouASS;

Result: Spark took 134 seconds, Jaguar took 0.7 seconds.


Conclusion: For conditional queries, especially when indexes are used, Jaguar performs much faster than Spark.

ArrayDB 1.0 Official Release

After six months of hard work on ArrayDB, we are proud to announce ArrayDB 1.0, the next-generation NewSQL data store that delivers high-performance based on our revolutionary array-indexing technology.

Some key features of ArrayDB are:

  • High Performance: 5,000,000 per minute data ingestion and indexing building at the same time.  High performance of data write allows storing data at high velocity.
  • Fast Join:  joining of multiple tables at the same time, and at high speed because of fast merge join operation of the unique array-indexed tables.
  • Configurable memory usage: easy configuration of memory usage for fast data load and table join in environments where DRAM resources can be leveraged.
  • More client binding: In addition to C binding, Java and JDBC client API are provided. Any Java application can call native ArrayDB Java API or ArrayDB JDBC to query the fast ArrayDB server.
  • Semi-structured data support: Keys in a table have a schema, but the value fields in a table are schema-less. This feature allows flexible storage of non-structured data as well as fast lookup of key data.

We will continue to improve our product and make ArrayDB scalable.  Future work will include integration of our fast indexing engine with big data platforms to offer a spectrum of computing functionality.

ArrayDB Beta Version has been released

Today we proudly announce that beta version of ArrayDB analytical database has been released to the general public. In the past few months, Exeray has developed a new cloud based enterprise database software, ArrayDB, in analytical processing of big data. By leveraging our breakthrough technological invention (ArrayIndexing ™) that can speed up indexing of data by orders of magnitude, we will provide customers an exceptionally fast query engine for customers to gain deep insights into data. Our software will be a valuable resource as both an transaction engine and an analytical engine. Our ArrayDB product is clearly superior to all competitors in analytics of big data. Firs time in history, a database that employs revolutionary array-based indexing technology is developed and released. The ArrayDB product package can be downloaded from github repository:

Exeray has released ArrayDB Embedded Database

ArrayDB Embedded Database is useful for embedded devices such as smart phones or smart meters.  In such cases, all you need is a fast, file-based stand-alone data storage engine that can manage your data with high performance. Embedded ArrayDB storage enginer can store data up to 36000 records per second and read the data at 35 millions records per second.  Even better, in addition to its high speed, memory consumption is minimal.

Exeray has released new version (V5) of Bigdata Enabler

Exeray has released version 5 of BigData Enabler. This new version has changed the method of low level data storage.   It has been redesigned and developed from scratch and uses pure array-based indexing technology.  Most new containers maintain order and hash lookup of keys, so they are extremely fast in point search and ordered range search.

On top of the existing data containers, new classes such as AbaxCounter and AbaxGraph are added to the family. The AbaxCounter container collects data and maintains ordered counts of each key in real-time. The AbaxGraph container dynamically manages directed graph as well as undirected graph. One can easily add nodes, edges to a graph. Also methods are provided to detect adjacency relationship, retrieve neighor list, and find minimum spanning tree in a graph.