Tuesday, August 16, 2016

How not to send a follow up (marketing / sales) email

Today I received another follow up email. Ever since I came to the US, I am getting many of these. This is probably the worst follow up email I received from a big company. I previously had received emails from some other companies. They were better in nature. At one instance, one company was too eager in sales email and sent more email even after I clearly indicated I am no longer evaluating them since I am busy. But this takes everything to new level.
Creepy and lazy follow up email hurts more than helps

So the title says "follow up from download". First no proper title - no correct capitalization.

The email starts with Hi {FirstName}. They are too lazy to check the email formatter. Just like you would spend time on testing the product, please spend time on testing these email sender programs. That will help you a lot.

"Saw you had a chance to download.."
Comes to the point too quick. Somewhat scary, and reads like someone was monitoring/watching/seeing my whole activity.

".. download our Community Edition (M3)"
I swear to the God that I do not remember what product he is referring to. It would be better, if you mention the product name instead of only "Community Edition". M3 can even be omitted.

"A lot of our customers are happy with community support. However, some have come back asking to provide enterprise grade support. We're pleased to announce we now support enterprise grade.."
The fact that "some have come back asking for paid support" does not motivate me to get it. There is no new information provided to arouse my curiosity or interest, except some filler sentence.

"I can fill you in on details if you're interested."
No thanks. I am not going to waste time dealing with automated telephone calls or more emails. In fact, I am not going to reply a "No thanks" either. This blog post should be sufficient.

"Do you have time to connect sometime this week?"
Certainly not. Why would I connect with some emailing bot for no reason?

Too common name with no results in Google or LinkedIn search for full name with company. Probably a made up name.

I indeed opened this email despite its lack of interest in the title just because of the company that sent this. Now they have lost my reply. However, I decided to make a blog post to summarize my learning. If you are going to send a customer a follow up email, use it as an opportunity. Don't send a random email just because you learned somewhere that sending a follow up email would earn you a customer. Better wait for the moment and send the right mail than a random one, like the one I dissected above.

Friday, August 12, 2016

[GSoC 2016] MediCurator : Near Duplicate Detection for Medical Data Warehouse Construction

This summer, at the Department of Biomedical Informatics, Emory University (Emory BMI), we have another set of intelligent students working on interesting projects. I have been mentoring Yiru Chen (Irene) from Peking University, on the project "MediCurator: Near Duplicate Detection for Medical Data Warehouse Construction" for the past couple of months. Currently we have reached the final stages of the project, as the student evaluation period starts on the 15th of August. This post is a summary of this successful GSoC, as well as a history behind the near duplicate detection efforts.

The early history of MediCurator
MediCurator was a research prototype that I initially developed based on my paper ∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data (CoopIS'15) as part of my data quality research, along with my GSoC 2015 work on data integration. The early results were presented as a poster at AMIA 2016 in San Francisco.

MediCurator and Infinispan
Now we have a more complete implementation of MediCurator and a use case for medical data, thanks to the support provided by GSoC. For her implementation, Irene did some benchmarks before choosing to go with the Infinispan's latest distributed streams for the distributed execution. (You may find some interesting discussion on the Infinispan distributed streams here.)

MediCurator Usecase
MediCurator is a data quality platform for the ETL workflows in data warehouse construction. It optimizes the bandwidth usage by avoiding the duplicate downloads, and optimizes the storage by eliminating the near duplicates in the warehouse thus increasing the data quality. When data is downloaded, the source locations are tracked, and when data is updated in the source at a latter time, the subsequent download process will download only the new data.

Similarly, data is deduplicated at the data warehouse, as near duplicates could be present there since data is integrated from multiple data sources. Here the data pairs are evaluated for near duplicates in a distributed manner, with duplicate pairs stored separately, while the clean data stays in the warehouse. The duplicate detection workflow also considers the corrupted data/metadata, and synchronizes/downloads the clean data from the source.

This is useful for medical images due to the large scale of the data, often binary in nature along with textual metadata. Efficiency of MediCurator is ensured through its in-memory data grid-based architecture. MediCurator fits well with the landscape of distributed data integration and federation platforms developed at Emory BMI.

More Details on GSoC 2016
Irene developed the entire code base from scratch as an open source project. MediCurator also has a ReadTheDocs* based documentation which gives more detailed description to the project. In addition, you may learn the summary of weekly progresses at Irene's blog. MediCurator's scope remained dynamic throughout the project. MediCurator has download tracking and detecting duplicates across the datasets online and offline, in addition to the near duplicate detection. Most of the code was developed exclusively having the cancer imaging archive (TCIA) as the core data source with DICOM as the default data format, while maintaining relevant interfaces and APIs for extension to other data sources and data types.

Future Work
The summer was productive. It included both research and implementations. The GSoC time is limited to 4 months (including the community bonding period), and we are reaching a successful end to a yet another Google Summer of Code. Nevertheless, we hope to work on a research publication with combined results on MediCurator, along with the previous ∂u∂u** and SDN-based Mayan (presented at ICWS 2016) approaches in November. This will be our first publication with Irene on her findings and implementations, with further evaluations on the clusters in INESC-ID Lisboa. More updates on this later (possibly after publishing the paper ;)).

Concluding Remarks
This is my 4th time in the Google Summer of Code as a mentor, and 3rd time as the primary mentor for a project. Previously I mentored 2 successful students in 2011 and 2012 for AbiWord. I wish every student success as they reach the end of their summer of code.

* I recommend ReadTheDocs. You should give a try!
** You may find the paper on ∂u∂u interesting, if you are into data quality or distributed near duplicate detection.

Friday, August 5, 2016

The new laptop and the elephant speaker..

The new Asus laptop and the elephant speaker
Being a student volunteer in a conference was a remarkable experience. I was able to involve deeply in organizing the conference event on-site, and experience how things are done. I was able to get a first hand experience and have a different perspective as I have never been in the other side of the table. I was always a participant. This 2nd of July, I also won a laptop in a lucky draw. It was interesting as I have never won any lucky draws before. :) Also all of the student volunteers received a beautiful elephant speaker.

Now it has Windows 10. I should probably install Ubuntu in a dual boot, or rather wipe Windows and install Ubuntu solely. But I am lazy to mess up with the computer yet. So letting it rest mostly. Currently I have 3 laptops for me, with the one that I am using - an HP with a dual boot of Ubuntu 16.04 and Windows 10, this new Asus with Windows 10, and my old one with Ubuntu 13.10 - if I remember correctly. Those days I was upgrading the OS for each Ubuntu releases. Now I like the stability and upgrading for only LTS.

Configuring API Umbrella with LDAP - with containers

This is a live post on an on-going effort, configuring API Umbrella with LDAP, both in Docker containers.

Make the configuration file:
$ mkdir config && touch config/api-umbrella.yml

      - pkathi2@emory.edu
        - github
        - google
        - persona
        - ldap
        client_id: xxxxxxxxxxxxx
        client_secret: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
        client_id: yyyyyy-yyyy.apps.googleusercontent.com
        client_secret: yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
          host: lion.bmi.emory.edu
          port: 389
          base: dc=example, dc=org

Follow the wiki on configuring github and google authentication.

Run the OpenLDAP Docker container:
sudo docker run --hostname lion.bmi.emory.edu  -p 389:389 -p 636:636 --name my-openldap-container --detach osixia/openldap:1.1.5

Run the API Umbrella Docker container:
sudo docker run -d --name=api-umbrella -p 80:80 -p 443:443  --link my-openldap-container:ldap -v $PWD/config:/etc/api-umbrella \
-v /var/log/api-umbrella/nginx:/var/log/api-umbrella/nginx \
           -v /var/log/api-umbrella/web-puma/current:/var/log/api-umbrella/web-puma/current \
           -v /var/log/api-umbrella/trafficserver/access.blog:/var/log/api-umbrella/trafficserver/access.blog \

Unfortunately when ldap is enabled in the configuration file, https://lion.bmi.emory.edu/admin/ is throwing a gateway timeout. This might be something to do with the OmniAuth LDAP configurations. The documentation is still an issue with API Umbrella, just like other API gateways that I tried recently.

This issue also has been reported to API Umbrella team.

Upgrade to Xenial Xerus, Virtualbox, and more..

Each time upgrading the Ubuntu is like taking a challenge. When I bought this laptop in 2014 February, I installed Ubuntu 13.10, and later upgraded it to the next LTS version in August (14.04.1). Now this is my second upgrade in this laptop, again to the next LTS version (16.04.1 - Xenial Xerus). While this is all exciting, all upgrades come with some breaks in previously working and stable software. This time, VirtualBox is broken. I am sure it is not the only one though.

So I got this error, when I tried to open an existing VM or create a new VM in the VirtualBox following my upgrade.

Kernel driver not installed (rc=-1908)

The VirtualBox Linux kernel driver (vboxdrv) is either not loaded or there is a permission problem with /dev/vboxdrv. Please reinstall the kernel module by executing

'/etc/init.d/vboxdrv setup'

as root. If it is available in your distribution, you should install the DKMS package first. This package keeps track of Linux kernel changes and recompiles the vboxdrv kernel module if necessary.

When I tried to do as suggested:

root@llovizna:/home/pradeeban# /etc/init.d/vboxdrv setup
Stopping VirtualBox kernel modules ...done.
Uninstalling old VirtualBox DKMS kernel modules ...done.
Trying to register the VirtualBox kernel modules using DKMSERROR: Cannot create report: [Errno 17] File exists: '/var/crash/virtualbox-4.3.0.crash'
Error! Bad return status for module build on kernel: 4.4.0-31-generic (x86_64)
Consult /var/lib/dkms/vboxhost/4.3.28/build/make.log for more information.
  (Failed, trying without DKMS)
Recompiling VirtualBox kernel modules ...failed!
  (Look at /var/log/vbox-install.log to find out what went wrong)

Finally, I had to uninstall and reinstall everything
sudo apt-get remove virtualbox-\*

Luckily, all the VMs and virtual hard disks were safe and ready to use (thanks to virtualization!)