Owing to the SQL-like nature of Hive, it is simpler and a lot easier to frame our jobs with stored structured data, thus justifying its extensive use in Hadoop. While that being true, another fact merits attention: When there are large data-sets in Hive joins, a frequent predicament faced is the errors resulting from out of memory or java heap space. But don’t fret. There are some smart options with Hive that can be used to avoid such scenarios. Here they are: Let us look into a few of them.
Enable map joins
When you are trying to do a join with multiple tables in Hive, and if one or more of them has a smaller data volume, it is well worth enabling map-joins. This makes sure that even before the actual MapReduce job starts, the smaller tables get distributed on the distributed cache as a hash table by a local MapReduce task. This, in effect, is like running a map-side join and not the common MapReduce side-join (normal Hive join), thus helping in saving a good amount of time for the job(s).
To achieve that, set the following at Hive CLI before running the join query
sethive.auto.convert.join = true;
There is a catch in this approach, however: Performance. Since Hive is intelligent when it comes to map-side joins, especially when the data volume is larger and doesn’t fit into map-side joins; it executes the back-up task, i.e. the common join (full MapReduce execution) to accomplish the job. But, what this means is that the time to check on the executablity of map-join is a job process overhead. Therefore, it may not be advisable to take this approach if there are huge volumes of data in the tables (in the join).
Further, it would also be advisable to set a cut-off for the data volume for adopting this approach (enabling map-joins). Therefore, you can set the table size to a decently small size (higher than the conservative 25Mb) post-which the Hive common-job should be triggered.
The desirable value can be modified by setting the following configuration variable
sethive.smalltable.filesize = 40000000;
Further, we can also include a comment in the join query as under:
e.g.: select /*+ MAPJOIN(dept) */ emp.name, dept.deptname from emp join dept on emp.deptid=dept.deptid;
Exploiting EQUI Join support in Hive
Out of memory errors are a given owing to the Hive queries running infinitely, which also is the reason for some other common scenarios. Therefore, it would be well worth to employ the EQUI joins that are supported by Hive, to avoid facing these problems.
It get better it your join query has quite some ‘WHERE’ clauses with equality, which can then be included inside the ‘ON’ clause in the joins. This will reduce the number of records chosen to join by a considerable measure, amounting to lesser number of records in the sort phase.
Let us consider a query as under:
Select Table1.Column1, Table2.Column2 FROM Table1 JOIN Table2 ON (Table1.Column5 = Table2.Column7 AND Table1.Column9=Table2.Column3) WHERE Table1.Column1 = ‘1024’ AND Table2.Column2 > 5;
If you see, this query has an equality expression in the ‘WHERE’ clause involving one of the tables in the join. To optimize the Hive query in this case, you should include the ‘equality’ filter condition in the join as:
Select Table1.Column1, Table2.Column2 FROM Table1 JOIN Table2 ON (Table1.Column5 = Table2.Column7 AND Table1.Column9=Table2.Column3 AND Table1.Column1 = ‘1024’) WHERE Table2.Column2 > 5;
The difference this slight modification brings in during query execution is the application of the filter. What we are doing is simple: as opposed to applying the filter condition after the join (as in the first query above), it is being done before or on joining (in the second query), hence relatively fewer records in the join.
Increase the heap size
Your Hive query already has the necessary modifications (in 1 & 2 above) for joins optimization, but you are still facing issues, for whatever reason (heap size etc.). Don’t worry: you have yet another way out to achieve optimal performance from your Hive query.
After having included the map-join and/or the EQUI join, one of things you may want to look at is increasing the heap size of the child tasks of MapReduce.
You can do so by setting the property ‘mapred.child.java.opts’ to a higher value. For example, to increase the heap size to 1GB, set it as
There may still be many other ways of dealing with issues such as the above, like io.sort.mb, but it is best left to you to make the choice to be able to best out of Hive query.
General tips to improve Hive queries to run faster
- 1. Use ORC File
Hive supports ORC file – a new table storage format that sports fantastic speed improvements through techniques like predicate pushdown (pushup in Hive), compression and more.
Using ORCFile for every HIVE table should really be a no-brainer, and extremely beneficial to get fast response times for your HIVE queries.
customerIDint, namestring, age int, address string
- 2. Use Vectorization
Vectorized query execution improves performance of operations like scans, aggregations, filters, and joins, by performing them in batches of 1024 rows at once instead of a single row each time.
Introduced in Hive 0.13, this feature significantly improves query execution time, and is easily enabled with two parameters settings:
- I. sethive.vectorized.execution.enabled = true;
- II. sethive.vectorized.execution.reduce.enabled = true;
- 3. Partition Based Joins:
To optimize joins in Hive, we have to reduce the query scan time. For that, we can create a Hive table with partitions by specifying the partition predicates in the ‘WHERE’ clause or the ON clause in a JOIN. For Example: The table ‘state view’ is partitioned on the column ‘state.’
The below query retrieves rows for only a given state:
SELECT state_view.* FROM state view WHERE state_view.state= ‘State-1’ AND state_view.state = ‘State-3’;
If a table state view is joined with another table city users, you can specify a range of partitions in the ON clause as follows:
SELECT state_view.* FROM state_view JOIN city_users ON (state_view.state = city_users.state);
Hope this post helped you with all your joins optimization needs in Hive.