Thursday, August 18, 2016

SDN helps other Vs in BigData

I will be working on a book chapter "SDN helps other Vs in BigData". You may find more details at the book's web page Big Data and Software Defined Networks and the Table-of-Content.

It is going to be an interesting book, as I am interested in both SDN and BigData, and this is a topic that covers both.

Monday, August 15, 2016

How not to send a follow up (marketing / sales) email

Today I received another follow up email. Ever since I came to the US, I am getting many of these. This is probably the worst follow up email I received from a big company. I previously had received emails from some other companies. They were better in nature. At one instance, one company was too eager in sales email and sent more email even after I clearly indicated I am no longer evaluating them since I am busy. But this takes everything to new level.
Creepy and lazy follow up email hurts more than helps

So the title says "follow up from download". First no proper title - no correct capitalization.

The email starts with Hi {FirstName}. They are too lazy to check the email formatter. Just like you would spend time on testing the product, please spend time on testing these email sender programs. That will help you a lot.

"Saw you had a chance to download.."
Comes to the point too quick. Somewhat scary, and reads like someone was monitoring/watching/seeing my whole activity.

".. download our Community Edition (M3)"
I swear to the God that I do not remember what product he is referring to. It would be better, if you mention the product name instead of only "Community Edition". M3 can even be omitted.

"A lot of our customers are happy with community support. However, some have come back asking to provide enterprise grade support. We're pleased to announce we now support enterprise grade.."
The fact that "some have come back asking for paid support" does not motivate me to get it. There is no new information provided to arouse my curiosity or interest, except some filler sentence.

"I can fill you in on details if you're interested."
No thanks. I am not going to waste time dealing with automated telephone calls or more emails. In fact, I am not going to reply a "No thanks" either. This blog post should be sufficient.

"Do you have time to connect sometime this week?"
Certainly not. Why would I connect with some emailing bot for no reason?

"Best,
David"
Too common name with no results in Google or LinkedIn search for full name with company. Probably a made up name.
The not-so-improved further follow up emails.

I indeed opened this email (and further follow-up emails from the same sales person) despite its lack of interest in the title just because of the company that sent this. Now they have lost my reply. However, I decided to make a blog post to summarize my learning. If you are going to send a customer a follow up email, use it as an opportunity. Don't send a random email just because you learned somewhere that sending a follow up email would earn you a customer. Better wait for the moment and send the right mail than a random one, like the one I dissected above.

Later Update
As you can see above, I received further follow up emails. In the second email, you may see he just mentions

"I wanted to follow up from my previous email. 
 
Are you available for a quick call this week?
 
This does not add any value to the previous email, and vague at its best. 


The third email goes on to say, 
 
"If anything, I’d love to provide you with resources to help educate you and your team on what’s new in the space. "
 
It is fine that you like to "educate" your customer. But do not say so explicitly. First, I am an individual - rather, a poor student, I am not a team. Moreover, I do not think I need some sales team to "educate" me on this domain. I am pretty good. ;)


Finally, the fourth (and potentially, last) email goes,
 
"Since I haven’t heard back from you, I thought I might be heading down the wrong path.

Is there a better time frame or a different person I should be reaching out to discuss Big Data/Hadoop with?

If so, I’d appreciate you pointing me in the right direction."

So the sudden realization comes that the "potential lead" is possibly uninterested. Only in this last email, I realize that all this time, he just wanted to discuss with me BigData and Hadoop. It might be a good idea to mention this in the first email itself than being so vague. I mean, come on - I am not going to let you spam another person by forwarding you to them. :)

So next time, kids. Don't just keep emailing. I know, you must have learned "persistence is the key", or "potential clients reply 85% at the 5th email" in your sales and marketing training. But don't just send meaningless emails. Put some real effort, and send when you are confident. Hope that helps. Let me know if you need any help in writing an email, of course. ;)

Friday, August 12, 2016

[GSoC 2016] MediCurator : Near Duplicate Detection for Medical Data Warehouse Construction

This summer, at the Department of Biomedical Informatics, Emory University (Emory BMI), we have another set of intelligent students working on interesting projects. I have been mentoring Yiru Chen (Irene) from Peking University, on the project "MediCurator: Near Duplicate Detection for Medical Data Warehouse Construction" for the past couple of months. Currently we have reached the final stages of the project, as the student evaluation period starts on the 15th of August. This post is a summary of this successful GSoC, as well as a history behind the near duplicate detection efforts.

The early history of MediCurator
MediCurator was a research prototype that I initially developed based on my paper ∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data (CoopIS'15) as part of my data quality research, along with my GSoC 2015 work on data integration. The early results were presented as a poster at AMIA 2016 in San Francisco.


MediCurator and Infinispan
Now we have a more complete implementation of MediCurator and a use case for medical data, thanks to the support provided by GSoC. For her implementation, Irene did some benchmarks before choosing to go with the Infinispan's latest distributed streams for the distributed execution. (You may find some interesting discussion on the Infinispan distributed streams here.)

MediCurator Usecase
MediCurator is a data quality platform for the ETL workflows in data warehouse construction. It optimizes the bandwidth usage by avoiding the duplicate downloads, and optimizes the storage by eliminating the near duplicates in the warehouse thus increasing the data quality. When data is downloaded, the source locations are tracked, and when data is updated in the source at a latter time, the subsequent download process will download only the new data.

Similarly, data is deduplicated at the data warehouse, as near duplicates could be present there since data is integrated from multiple data sources. Here the data pairs are evaluated for near duplicates in a distributed manner, with duplicate pairs stored separately, while the clean data stays in the warehouse. The duplicate detection workflow also considers the corrupted data/metadata, and synchronizes/downloads the clean data from the source.

This is useful for medical images due to the large scale of the data, often binary in nature along with textual metadata. Efficiency of MediCurator is ensured through its in-memory data grid-based architecture. MediCurator fits well with the landscape of distributed data integration and federation platforms developed at Emory BMI.

More Details on GSoC 2016
Irene developed the entire code base from scratch as an open source project. MediCurator also has a ReadTheDocs* based documentation which gives more detailed description to the project. In addition, you may learn the summary of weekly progresses at Irene's blog. MediCurator's scope remained dynamic throughout the project. MediCurator has download tracking and detecting duplicates across the datasets online and offline, in addition to the near duplicate detection. Most of the code was developed exclusively having the cancer imaging archive (TCIA) as the core data source with DICOM as the default data format, while maintaining relevant interfaces and APIs for extension to other data sources and data types.

Future Work
The summer was productive. It included both research and implementations. The GSoC time is limited to 4 months (including the community bonding period), and we are reaching a successful end to a yet another Google Summer of Code. Nevertheless, we hope to work on a research publication with combined results on MediCurator, along with the previous ∂u∂u** and SDN-based Mayan (presented at ICWS 2016) approaches in November. This will be our first publication with Irene on her findings and implementations, with further evaluations on the clusters in INESC-ID Lisboa. More updates on this later (possibly after publishing the paper ;)).

Concluding Remarks
This is my 4th time in the Google Summer of Code as a mentor, and 3rd time as the primary mentor for a project. Previously I mentored 2 successful students in 2011 and 2012 for AbiWord. I wish every student success as they reach the end of their summer of code.

* I recommend ReadTheDocs. You should give a try!
** You may find the paper on ∂u∂u interesting, if you are into data quality or distributed near duplicate detection.

Friday, August 5, 2016

The new laptop and the elephant speaker..

The new Asus laptop and the elephant speaker
Being a student volunteer in a conference was a remarkable experience. I was able to involve deeply in organizing the conference event on-site, and experience how things are done. I was able to get a first hand experience and have a different perspective as I have never been in the other side of the table. I was always a participant. This 2nd of July, I also won a laptop in a lucky draw. It was interesting as I have never won any lucky draws before. :) Also all of the student volunteers received a beautiful elephant speaker.

Now it has Windows 10. I should probably install Ubuntu in a dual boot, or rather wipe Windows and install Ubuntu solely. But I am lazy to mess up with the computer yet. So letting it rest mostly. Currently I have 3 laptops for me, with the one that I am using - an HP with a dual boot of Ubuntu 16.04 and Windows 10, this new Asus with Windows 10, and my old one with Ubuntu 13.10 - if I remember correctly. Those days I was upgrading the OS for each Ubuntu releases. Now I like the stability and upgrading for only LTS.

Configuring API Umbrella with LDAP - with containers

This is a live post on an on-going effort, configuring API Umbrella with LDAP, both in Docker containers.


Make the configuration file:
$ mkdir config && touch config/api-umbrella.yml

web:
  admin:
    initial_superusers:
      - pkathi2@emory.edu
    auth_strategies:
      enabled:
        - github
        - google
        - persona
        - ldap
      github:
        client_id: xxxxxxxxxxxxx
        client_secret: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
      google:
        client_id: yyyyyy-yyyy.apps.googleusercontent.com
        client_secret: yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
      ldap:
        options:
          host: lion.bmi.emory.edu
          port: 389
          base: dc=example, dc=org


Follow the wiki on configuring github and google authentication.
http://api-umbrella.readthedocs.io/en/latest/getting-started.html
                                               

Run the OpenLDAP Docker container:
sudo docker run --hostname lion.bmi.emory.edu  -p 389:389 -p 636:636 --name my-openldap-container --detach osixia/openldap:1.1.5


Run the API Umbrella Docker container:
sudo docker run -d --name=api-umbrella -p 80:80 -p 443:443  --link my-openldap-container:ldap -v $PWD/config:/etc/api-umbrella \
-v /var/log/api-umbrella/nginx:/var/log/api-umbrella/nginx \
           -v /var/log/api-umbrella/web-puma/current:/var/log/api-umbrella/web-puma/current \
           -v /var/log/api-umbrella/trafficserver/access.blog:/var/log/api-umbrella/trafficserver/access.blog \
nrel/api-umbrella




Unfortunately when ldap is enabled in the configuration file, https://lion.bmi.emory.edu/admin/ is throwing a gateway timeout. This might be something to do with the OmniAuth LDAP configurations. The documentation is still an issue with API Umbrella, just like other API gateways that I tried recently.

This issue also has been reported to API Umbrella team.

Thursday, August 4, 2016

Upgrade to Xenial Xerus, Virtualbox, and more..

Each time upgrading the Ubuntu is like taking a challenge. When I bought this laptop in 2014 February, I installed Ubuntu 13.10, and later upgraded it to the next LTS version in August (14.04.1). Now this is my second upgrade in this laptop, again to the next LTS version (16.04.1 - Xenial Xerus). While this is all exciting, all upgrades come with some breaks in previously working and stable software. This time, VirtualBox is broken. I am sure it is not the only one though.

So I got this error, when I tried to open an existing VM or create a new VM in the VirtualBox following my upgrade.

Kernel driver not installed (rc=-1908)

The VirtualBox Linux kernel driver (vboxdrv) is either not loaded or there is a permission problem with /dev/vboxdrv. Please reinstall the kernel module by executing

'/etc/init.d/vboxdrv setup'

as root. If it is available in your distribution, you should install the DKMS package first. This package keeps track of Linux kernel changes and recompiles the vboxdrv kernel module if necessary.




When I tried to do as suggested:


root@llovizna:/home/pradeeban# /etc/init.d/vboxdrv setup
Stopping VirtualBox kernel modules ...done.
Uninstalling old VirtualBox DKMS kernel modules ...done.
Trying to register the VirtualBox kernel modules using DKMSERROR: Cannot create report: [Errno 17] File exists: '/var/crash/virtualbox-4.3.0.crash'
Error! Bad return status for module build on kernel: 4.4.0-31-generic (x86_64)
Consult /var/lib/dkms/vboxhost/4.3.28/build/make.log for more information.
 ...failed!
  (Failed, trying without DKMS)
Recompiling VirtualBox kernel modules ...failed!
  (Look at /var/log/vbox-install.log to find out what went wrong)


Finally, I had to uninstall and reinstall everything
 
sudo apt-get remove virtualbox-\*
 

Luckily, all the VMs and virtual hard disks were safe and ready to use (thanks to virtualization!)

Wednesday, August 3, 2016

Bad guy Windows and good guy Ubuntu

A few moments ago I upgraded my Ubuntu 14.04 LTS to Ubuntu 16.04.1 LTS (Xenial Xerus). I waited for the service pack 1 to be released in the end of July, and hence the delay. Everything was smooth, and my dual boot runs perfectly well with Windows and Ubuntu just as before. On the other hand, every time Windows does a minor upgrade (not just the Windows 8 to Windows 10 upgrade), it breaks my grub and makes Windows the default and only option, making Ubuntu unreachable. Luckily I have a Lubuntu-based software, named boot-repair-disk that can fix these boot issues. The current version of the software does not work for Windows 10. However, I have a previous version with me that just works fine.

I am not entirely sure why Windows made each upgrade to break the grub. Is it made intentionally, or is it that hard for the engineers to fix this minor bug (or may I say, feature?). This is so annoying and inconvenient. This makes Windows look like a big bully.

So now I have the latest LTS version, and I will upgrade again in 2018 August with 18.04.1 LTS. I made some interesting observations during the upgrade.

Configuring encfs ├──────────────────────────────────────────────────────────────────────────────────────────┐ 
 │                                                                                                                                                                                                        │ 
 │ Encfs security information                                                                                                                                                                             │ 
 │                                                                                                                                                                                                        │ 
 │ According to a security audit by Taylor Hornby (Defuse Security), the current implementation of Encfs is vulnerable or potentially vulnerable to multiple types of attacks. For example, an attacker   │ 
 │ with read/write access to encrypted data might lower the decryption complexity for subsequently encrypted data without this being noticed by a legitimate user, or might use timing analysis to        │ 
 │ deduce information.                                                                                                                                                                                    │ 
 │                                                                                                                                                                                                        │ 
 │ Until these issues are resolved, encfs should not be considered a safe home for sensitive data in scenarios where such attacks are possible.                                                           │ 
 │                                                                                                                                                                                                        │ 
 │                                                                                                                                                                                                    │ 
 │                                                                                                                        


So, EncFS is not secure after all.


The other prompt that caught my attention was the below:


 Configuring davfs2 ├──────────────────────────────────────────────────────────────────┐                        
                         │                                                                                                                                                         │                        
                         │ The file /sbin/mount.davfs must have the SUID bit set if you want to allow unprivileged (non-root) users to mount WebDAV resources.                     │                        
                         │                                                                                                                                                         │                        
                         │ If you do not choose this option, only root will be allowed to mount WebDAV resources. This can later be changed by running 'dpkg-reconfigure davfs2'.  │                        
                         │                                                                                                                                                         │                        
                         │ Should unprivileged users be allowed to mount WebDAV resources?                                                                                         │                        
                         │                                                                                                                                                         │                        
                         │                                                                                               []                                                 │                        
                         │                                                                                                                      

As of now, everything seems to work just fine after the upgrade. I will update further when I find out whether something is broken due to the upgrade, later.

Installing DCM4CHEE using Docker

DCM4CHEE Admin Page.
DCM4CHEE installation through Docker was the smoothest installation/configuration I did recently. Despite its complicated setting environment with WildFly, OpenLDAP, and PostgreSQL, everything was simple and to the point, thanks to Docker and the compact documentation provided by DCM4CHEE.
This blog post is based on the wiki page, which is in itself well written.

Add docker host to the /etc/hosts file
127.0.0.1 dockerhost

Make sure that the Docker is started.
sudo service docker start


Run OpenLDAP Server

sudo docker run --name slapd \
           -p 389:389 \
           -e LDAP_BASE_DN=dc=dcm4che,dc=org \
           -e LDAP_ORGANISATION=dcm4che.org \
           -e LDAP_ROOTPASS=secret \
           -e LDAP_CONFIGPASS=secret \
           -e DEVICE_NAME=dcm4chee-arc \
           -e AE_TITLE=DCM4CHEE \
           -e DICOM_HOST=dockerhost \
           -e DICOM_PORT=11112 \
           -e HL7_PORT=2575 \
           -e SYSLOG_HOST=logstash \
           -e SYSLOG_PORT=8512 \
           -e SYSLOG_PROTOCOL=UDP \
           -e STORAGE_DIR=/storage/fs1 \
           -v /var/local/dcm4chee-arc/ldap:/var/lib/ldap \
           -v /var/local/dcm4chee-arc/slapd.d:/etc/ldap/slapd.d \
           -d dcm4che/slapd-dcm4chee:5.5.2




Run PostgreSQL Server

sudo docker run --name postgres \
           -p 5432:5432 \
           -e POSTGRES_DB=pacsdb \
           -e POSTGRES_USER=pacs\
           -e POSTGRES_PASSWORD=pacs \
           -v /var/local/dcm4chee-arc/db:/var/lib/postgresql/data \
           -d dcm4che/postgres-dcm4chee:5.2



Run DCM4CHEE Archive 5

We choose version with secured UI and secured RESTful services (Tag Name: 5.5.2-secure).

sudo docker run --name dcm4chee-arc \
           -p 8080:8080 \
           -p 9990:9990 \
           -p 11112:11112 \
           -p 2575:2575 \
           -e LDAP_BASE_DN=dc=dcm4che,dc=org \
           -e LDAP_ROOTPASS=secret \
           -e LDAP_CONFIGPASS=secret \
           -e DEVICE_NAME=dcm4chee-arc \
           -e POSTGRES_DB=pacsdb \
           -e POSTGRES_USER=pacs\
           -e POSTGRES_PASSWORD=pacs \
           -e JAVA_OPTS="-Xms64m -Xmx512m -XX:MetaspaceSize=96M -XX:MaxMetaspaceSize=256m -Djava.net.preferIPv4Stack=true -Djboss.modules.system.pkgs=org.jboss.byteman -Djava.awt.headless=true" \
           -e WILDFLY_CHOWN="/opt/wildfly/standalone /storage" \
           -v /var/local/dcm4chee-arc/wildfly:/opt/wildfly/standalone \
           -v /var/local/dcm4chee-arc/storage:/storage \
           --link slapd:ldap \
           --link postgres:db \
           -d dcm4che/dcm4chee-arc-psql:5.5.2-secure


Make sure that no port conflicts:
sudo netstat -anp | grep

Due to failed attempts, there may be conflicts in starting a Docker container.
docker: Error response from daemon: Conflict. The name "/dcm4chee-arc" is already in use by container 537bd21a41bb01680ea598ad35a33a1cc07d1d222dc75605c64398c7a43fb73c. You have to remove (or rename) that container to be able to reuse that name..

Find and remove the contaianer and re-attempt if this happens.
sudo docker ps -a
537bd21a41bb        dcm4che/dcm4chee-arc-psql:5.5.2-secure   "/docker-entrypoint.s"   3 minutes ago       Created                                              dcm4chee-arc

sudo docker rm 537bd21a41bb

or just,
sudo docker rm slapd
sudo docker rm postgres
sudo docker rm dcm4chee-arc

Check the status of the Docker containers after everything has started
 sudo docker ps -a
CONTAINER ID        IMAGE                                    COMMAND                  CREATED             STATUS                     PORTS                                                                                              NAMES
6cbfcb336f65        dcm4che/dcm4chee-arc-psql:5.5.2-secure   "/docker-entrypoint.s"   6 seconds ago       Up 3 seconds               0.0.0.0:2575->2575/tcp, 0.0.0.0:8080->8080/tcp, 0.0.0.0:9990->9990/tcp, 0.0.0.0:11112->11112/tcp   dcm4chee-arc
a5cf7e96ba0e        dcm4che/postgres-dcm4chee:5.2            "/docker-entrypoint.s"   8 minutes ago       Up 8 minutes               0.0.0.0:5432->5432/tcp                                                                             postgres
43d3eb7e1237        dcm4che/slapd-dcm4chee:5.5.2             "/docker-entrypoint.s"   25 minutes ago      Up 25 minutes              0.0.0.0:389->389/tcp                                                                               slapd




Web Service URLs

    Archive UI: http://localhost:8080/dcm4chee-arc/ui - if secured, login with
    Username     Password     Role
    user     user     user
    admin     admin     user + admin
    Keycloak Administration Console: http://localhost:8080/auth, login with Username: admin, Password: admin.
    Wildfly Administration Console: http://localhost:9990, login with Username: admin, Password: admin.
    Kibana UI: http://localhost:5601
    DICOM QIDO-RS Base URL: http://localhost:8080/dcm4chee-arc/aets/DCM4CHEE/rs
    DICOM STOW-RS Base URL: http://localhost:8080/dcm4chee-arc/aets/DCM4CHEE/rs
    DICOM WADO-RS Base URL: http://localhost:8080/dcm4chee-arc/aets/DCM4CHEE/rs
    DICOM WADO-URI: http://localhost:8080/dcm4chee-arc/aets/DCM4CHEE/wado



Check the logs of the containers
sudo docker logs 6cbfcb336f65


Restart a container
sudo docker restart 6cbfcb336f65


Entering a containerized instance for Debugging, etc

sudo docker exec -i -t dcm4chee-arc /bin/bash

*********************

Optionally, to store the log and audit messages in Elastic search, run these additional containers.

Run Elasticsearch

sudo docker run --name elasticsearch \
 -p 9200:9200 \
 -p 9300:9300 \
 -v /var/local/dcm4chee-arc/elasticsearch:/usr/share/elasticsearch/data \
 -d elasticsearch:2.2

 

Run Logstash

sudo docker run --name logstash \
 -p 12201:12201/udp \
 -p 8514:8514/udp \
 -p 8514:8514 \
 -v /var/local/dcm4chee-arc/elasticsearch:/usr/share/elasticsearch/data \
 --link elasticsearch:elasticsearch \
 -d dcm4che/logstash-dcm4chee
(the official documentation points to version 5.5.2 which is non-existent).

 

 Run Kibana

sudo docker run --name kibana \
 -p 5601:5601 \
 --link elasticsearch:elasticsearch \
  -d kibana:4.4

Also now you need to run the DCM4CHEE linking to the above optional containers to be able to retrieve the logs.

sudo docker run --name dcm4chee-arc \
           -p 8080:8080 \
           -p 9990:9990 \
           -p 11112:11112 \
           -p 2575:2575 \
           -e LDAP_BASE_DN=dc=dcm4che,dc=org \
           -e LDAP_ROOTPASS=secret \
           -e LDAP_CONFIGPASS=secret \
           -e DEVICE_NAME=dcm4chee-arc \
           -e POSTGRES_DB=pacsdb \
           -e POSTGRES_USER=pacs\
           -e POSTGRES_PASSWORD=pacs \
           -e JAVA_OPTS="-Xms64m -Xmx512m -XX:MetaspaceSize=96M -XX:MaxMetaspaceSize=256m -Djava.net.preferIPv4Stack=true -Djboss.modules.system.pkgs=org.jboss.byteman -Djava.awt.headless=true" \
           -e WILDFLY_CHOWN="/opt/wildfly/standalone /storage" \
           -v /var/local/dcm4chee-arc/wildfly:/opt/wildfly/standalone \
           -v /var/local/dcm4chee-arc/storage:/storage \
           --link slapd:ldap \
           --link postgres:db \
           --link logstash:logstash \
           -d dcm4che/dcm4chee-arc-psql:5.5.2-logstash-secure