Friday, August 12, 2016

[GSoC 2016] MediCurator : Near Duplicate Detection for Medical Data Warehouse Construction

This summer, at the Department of Biomedical Informatics, Emory University (Emory BMI), we have another set of intelligent students working on interesting projects. I have been mentoring Yiru Chen (Irene) from Peking University, on the project "MediCurator: Near Duplicate Detection for Medical Data Warehouse Construction" for the past couple of months. Currently we have reached the final stages of the project, as the student evaluation period starts on the 15th of August. This post is a summary of this successful GSoC, as well as a history behind the near duplicate detection efforts.

The early history of MediCurator
MediCurator was a research prototype that I initially developed based on my paper ∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data (CoopIS'15) as part of my data quality research, along with my GSoC 2015 work on data integration. The early results were presented as a poster at AMIA 2016 in San Francisco.

MediCurator and Infinispan
Now we have a more complete implementation of MediCurator and a use case for medical data, thanks to the support provided by GSoC. For her implementation, Irene did some benchmarks before choosing to go with the Infinispan's latest distributed streams for the distributed execution. (You may find some interesting discussion on the Infinispan distributed streams here.)

MediCurator Usecase
MediCurator is a data quality platform for the ETL workflows in data warehouse construction. It optimizes the bandwidth usage by avoiding the duplicate downloads, and optimizes the storage by eliminating the near duplicates in the warehouse thus increasing the data quality. When data is downloaded, the source locations are tracked, and when data is updated in the source at a latter time, the subsequent download process will download only the new data.

Similarly, data is deduplicated at the data warehouse, as near duplicates could be present there since data is integrated from multiple data sources. Here the data pairs are evaluated for near duplicates in a distributed manner, with duplicate pairs stored separately, while the clean data stays in the warehouse. The duplicate detection workflow also considers the corrupted data/metadata, and synchronizes/downloads the clean data from the source.

This is useful for medical images due to the large scale of the data, often binary in nature along with textual metadata. Efficiency of MediCurator is ensured through its in-memory data grid-based architecture. MediCurator fits well with the landscape of distributed data integration and federation platforms developed at Emory BMI.

More Details on GSoC 2016
Irene developed the entire code base from scratch as an open source project. MediCurator also has a ReadTheDocs* based documentation which gives more detailed description to the project. In addition, you may learn the summary of weekly progresses at Irene's blog. MediCurator's scope remained dynamic throughout the project. MediCurator has download tracking and detecting duplicates across the datasets online and offline, in addition to the near duplicate detection. Most of the code was developed exclusively having the cancer imaging archive (TCIA) as the core data source with DICOM as the default data format, while maintaining relevant interfaces and APIs for extension to other data sources and data types.

Future Work
The summer was productive. It included both research and implementations. The GSoC time is limited to 4 months (including the community bonding period), and we are reaching a successful end to a yet another Google Summer of Code. Nevertheless, we hope to work on a research publication with combined results on MediCurator, along with the previous ∂u∂u** and SDN-based Mayan (presented at ICWS 2016) approaches in November. This will be our first publication with Irene on her findings and implementations, with further evaluations on the clusters in INESC-ID Lisboa. More updates on this later (possibly after publishing the paper ;)).

Concluding Remarks
This is my 4th time in the Google Summer of Code as a mentor, and 3rd time as the primary mentor for a project. Previously I mentored 2 successful students in 2011 and 2012 for AbiWord. I wish every student success as they reach the end of their summer of code.

* I recommend ReadTheDocs. You should give a try!
** You may find the paper on ∂u∂u interesting, if you are into data quality or distributed near duplicate detection.


  1. Great!
    Wish it a success and very grateful to you, my mentor!

    1. Good luck for your final few days with GSoC and wish you all the best for your studies too, 陈一茹. :)


You are welcome to provide your opinions in the comments. Spam comments and comments with random links will be deleted.