Apache Drill is very efficient and fast, till you try to use it with huge chunk of one file (such as a few GB) or if you attempt to query a complex data structure with nested data. Now, this is what I am trying to do right now - attempting to query large segments of data with a dynamic structure and nested schema.
I may construct a parquet data source from a nested array, as below,
create table dfs.tmp.camic as ( select camic.geometry.coordinates[0][0] as geo_coordinates from dfs.`/home/pradeeban/programs/apache-drill-1.6.0/camic.json` camic);
Here I am giving the indices of the array.
Then I can query the data efficiently. For example,
select * from dfs.tmp.camic;
However, giving the indices won't work as I need, as I don't just need the first element. Rather I need the entire elements - in a large and dynamic array, representing the coordinates of geojson.
$ create table dfs.tmp.camic as ( select camic.geometry.coordinates[0] as geo_coordinates from dfs.`/home/pradeeban/programs/apache-drill-1.6.0/camic.json` camic);
Error: SYSTEM ERROR: UnsupportedOperationException: Unsupported type LIST
Fragment 0:0
[Error Id: a6d68a6c-50ea-437b-b1db-f1c8ace0e11d on llovizna:31010]
(java.lang.UnsupportedOperationException) Unsupported type LIST
org.apache.drill.exec.store.parquet.ParquetRecordWriter.getType():225
org.apache.drill.exec.store.parquet.ParquetRecordWriter.newSchema():187
org.apache.drill.exec.store.parquet.ParquetRecordWriter.updateSchema():172
org.apache.drill.exec.physical.impl.WriterRecordBatch.setupNewSchema():155
org.apache.drill.exec.physical.impl.WriterRecordBatch.innerNext():103
org.apache.drill.exec.record.AbstractRecordBatch.next():162
org.apache.drill.exec.record.AbstractRecordBatch.next():119
org.apache.drill.exec.record.AbstractRecordBatch.next():109
org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():129
org.apache.drill.exec.record.AbstractRecordBatch.next():162
org.apache.drill.exec.physical.impl.BaseRootExec.next():104
org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext():81
org.apache.drill.exec.physical.impl.BaseRootExec.next():94
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():257
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():251
java.security.AccessController.doPrivileged():-2
javax.security.auth.Subject.doAs():422
org.apache.hadoop.security.UserGroupInformation.doAs():1657
org.apache.drill.exec.work.fragment.FragmentExecutor.run():251
org.apache.drill.common.SelfCleaningRunnable.run():38
java.util.concurrent.ThreadPoolExecutor.runWorker():1142
java.util.concurrent.ThreadPoolExecutor$Worker.run():617
java.lang.Thread.run():744 (state=,code=0)
Error: SYSTEM ERROR: UnsupportedOperationException: Unsupported type LIST
Fragment 0:0
[Error Id: a6d68a6c-50ea-437b-b1db-f1c8ace0e11d on llovizna:31010]
(java.lang.UnsupportedOperationException) Unsupported type LIST
org.apache.drill.exec.store.parquet.ParquetRecordWriter.getType():225
org.apache.drill.exec.store.parquet.ParquetRecordWriter.newSchema():187
org.apache.drill.exec.store.parquet.ParquetRecordWriter.updateSchema():172
org.apache.drill.exec.physical.impl.WriterRecordBatch.setupNewSchema():155
org.apache.drill.exec.physical.impl.WriterRecordBatch.innerNext():103
org.apache.drill.exec.record.AbstractRecordBatch.next():162
org.apache.drill.exec.record.AbstractRecordBatch.next():119
org.apache.drill.exec.record.AbstractRecordBatch.next():109
org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():129
org.apache.drill.exec.record.AbstractRecordBatch.next():162
org.apache.drill.exec.physical.impl.BaseRootExec.next():104
org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext():81
org.apache.drill.exec.physical.impl.BaseRootExec.next():94
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():257
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():251
java.security.AccessController.doPrivileged():-2
javax.security.auth.Subject.doAs():422
org.apache.hadoop.security.UserGroupInformation.doAs():1657
org.apache.drill.exec.work.fragment.FragmentExecutor.run():251
org.apache.drill.common.SelfCleaningRunnable.run():38
java.util.concurrent.ThreadPoolExecutor.runWorker():1142
java.util.concurrent.ThreadPoolExecutor$Worker.run():617
java.lang.Thread.run():744 (state=,code=0)
Here, I am trying to query a multi-dimensional array, which is not straight-forward.
(I set the error messages to be verbose using SET `exec.errors.verbose` = true;
above).
above).
The commonly suggested options to query multi-dimensional arrays are:
1. Using the array indexes in the select query: This is impractical. I do not know how many elements I would have in this geojson - the coordinates. It may be millions or as low as 3.
2.
Flatten keyword: I am using Drill on top of Mongo - and finding an
interesting case where Drill outperforms certain queries in a
distributed execution than just using Mongo. Using Flatten basically
kills all the performance benefits I have with Drill otherwise. Flatten
is just plain expensive operation for the scale of my data (around 48 GB. But I can split them into a few GB each).
This is a known limitation of Drill. However, this significantly reduces its usability, as the proposed workarounds are either impractical or inefficient.
No comments:
Post a Comment
You are welcome to provide your opinions in the comments. Spam comments and comments with random links will be deleted.