I would like to use Hbase to store network events, the rowkey will be the timestamp of the event. This will create hotspotting in Hbase, to avoid this i am going to add a salt prefix to the rowkey. This will distribute the rows evenly across all the regions.
so far no problem.
But i also want to use Cloudera Impala to query this data with SQL. Is it possible to use Impala with salted rowkeys? or do i need * sql queries to get all the results?
Unfortunately Impala does not support this. I've also found that the performance of Impala on HBase is very poor. I'm still trying to tweak my configurations so that the performance is better. Parquet in my experiments seems to be the fastest (by 100s of x). In fact event a flat tsv file with no compression was performing better than HBase. The following describes the use cases for Impala with HBase (directly from Cloudera documentation):
Use Cases for Querying HBase through Impala
The following are popular use cases for using Impala to query HBase tables:
Keeping large fact tables in Impala, and smaller dimension tables in HBase. The fact tables use Parquet or other binary file format optimized for scan operations. Join queries scan through the large Impala fact tables, and cross-reference the dimension tables using efficient single-row lookups in HBase. Using HBase to store rapidly incrementing counters, such as how many times a web page has been viewed, or on a social network, how many connections a user has or how many votes a post received. HBase is efficient for capturing such changeable data: the append-only storage mechanism is efficient for writing each change to disk, and a query always returns the latest value. An application could query specific totals like these from HBase, and combine the results with a broader set of data queried from Impala. Storing very wide tables in HBase. Wide tables have many columns, possibly thousands, typically recording many attributes for an important subject such as a user of an online service. These tables are also often sparse, that is, most of the columns values are NULL, 0, false, empty string, or other blank or placeholder value. (For example, any particular web site user might have never used some site feature, filled in a certain field in their profile, visited a particular part of the site, and so on.) A typical query against this kind of table is to look up a single row to retrieve all the information about a specific subject, rather than summing, averaging, or filtering millions of rows as in typical Impala-managed tables.
Or the HBase table could be joined with a larger Impala-managed table. For example, analyze the large Impala table representing web traffic for a site and pick out 50 users who view the most pages. Join that result with the wide user table in HBase to look up attributes of those users. The HBase side of the join would result in 50 efficient single-row lookups in HBase, rather than scanning the entire user table.
Specifically on running SQL queries against HBase. Have you looked into Pheonix? It does support salted tables and provides SQL syntax. I don't know how fast it is or how it compares to Dremel implementations though.