Friday, May 30, 2014

~~ Where did the Sun go?

The legendary blueberry sauce. ;)
This summer seems to be colder than the last year's. Probably the spring hasn't ended yet. The year is going pretty fast. Final moments of EMDC, working on the thesis. 

I am also eagerly waiting to start my PhD - Erasmus Mundus Joint Doctorate in Distributed Computing (EMJD-DC), with IST Lisboa and UPC Barcelona. My track will be IST - UPC - IST, during this 3 - 4 years of doctorate. EMJD-DC is a continuation of EMDC (European Master in Distributed Computing), the master program that I am following. Our research group is Grupo de Sistemas Distribuídos, INESC-ID (Commonly referred to as, GSD or Distributed Systems Group).

Early this year, I did not blog much, and even I started to doubt myself whether my blog is going to die shortly. But thanks to my projects, I am back to business lately. :D Hope to blog more on some random stuff too, once in a while, similar to my recent rant on LinkedIn. ;)

I do not have much time to cook these days. Still, I made this blueberry sauce for some boiled leaves. The ingredients of the sauce were, blueberries, rice wine, garlic, soy sauce, salt, pepper powder, lemon juice, and cheese with herbs. That tasted good. :D

Thursday, May 29, 2014

Fault-tolerant data replication and synchronization with Infinispan

Figure 1. Deployment
Fault-tolerance
Having multiple instances running over different nodes provide fault-tolerance, as when one node terminates, the other nodes have the backup replica of the partitions stored in the terminated node. Figure 1 shows the higher level deployment view of the solution.

Design
Two distributed cache instances exist in InfDataAccessIntegration.
    protected static Cache userReplicasMap;
    protected static Cache replicaSetsMap;
userReplicasMap is a mapping of userId -> Array of replicaSetIDs. UserID could be the logged in user name. (for now, testing with random strings).
replicaSetsMap is a mapping of replicaSetID -> replicaSet
Figure 2. Core class hierarchy

Though this could be replaced with a single cache instance with the mapping of userID -> replicaSets, I decided to go with this design, as having two cache instances will be more efficient during searches, duplicates, and push changes. Hence, I decided to go with two cache instances design.

InfDataAccessIntegration provides the API for publisher/consumer, TCIAInvoker (which extends InterfaceManager, an abstract class I created) implements the TCIA integration to invoke these methods. Figure 2 provides a core class hierarchy of the system.
 
Figure 3. Execution Flow
Execution Flow
The execution flow is depicted by Figure 3.
* User logs in -> logIn() checks whether the user has already stored replicaSets from the Infinispan distributed Cache. If so, execute them all again. This would be changed later as we do not have to execute all. Rather, we need to execute for the diffs.

* The user performs new searches, for the images, series, collections, and the other meta data. New searches will create and write the replicaSet to the distributed cache, before returning the results.

The replicaSet for the image will be as,
TCIAConstants.IMAGE_TAG + "getImage?SeriesInstanceUID=" + seriesInstanceUID

For other information (meta data), such as collections, series, etc,
TCIAConstants.META_TAG + query;
Here, query is something like, "getSeries?format=" + format +
                "&Collection=" + collection +
                "&PatientID=" + patientID +
                "&StudyInstanceUID=" + studyInstanceUID +
                "&Modality=" + modality;
When a new instance starts now, and invokes the log in action for the same user, it will execute the queries for the stored replicaSets again, and reproduce the same results.

Further updates will be posted, when they are available. :-)

Tuesday, May 20, 2014

Summer in Lisboa.. ^_^

After some sunny days, it has started to rain heavily again in Lisboa. Coding period began yesterday, and I have committed some interesting bits to the project. Please have a look at the status update presentation below. Wait for more updates. :)

Wednesday, May 14, 2014

Initial moments with the Data Replication System

Thursday, May 8, 2014

MapReduce Implementations - Hazelcast Vs Infinispan

I was testing the MapReduce implementation of Hazlecast with the recent release of Hazelcast 3.2. Then I decided to compare the performance with the Infinispan 6.0.2 MapReduce implementation.
Infinispan outperforming Hazelcast MapReduce implementation
Infinispan outperformed Hazelcast in the sample MapReduce implementation tested on different scenarios, in a single instance, as shown by the figure. Infinispan still outperformed Hazelcast in the nodes up to 6.

Is Infinispan really faster than Hazelcast? Probably it is, as shown by scala-map-benchmarks. Probably, it is something to do with the scenarios, as discussed in Hazelcast group. However, this difference is huge, unlike the previous benchmarks. My opinion is, it is something to do with the still immature MapReduce implementation of Hazelcast, as Hazelcast proven to be quite effective for my other distributed execution tasks. 

If your use case is centred around the MapReduce implementation, I would suggest Infinispan over Hazelcast, as Hazelcast implementation is quite buggy as of 3.2. I have encountered 3 issues so far - a known issue #2105 that was reproduced during MapReduce executions and two other (probably MapReduce implementation specific) issues that I reported - #2354 (Update: This issue has been fixed for 3.2.2 and 3.3 versions of Hazlecast. Thanks Noctarius for attending to this) and #2359. Hazelcast MapReduce might turn to be more scalable and highly performing, once these issues are addressed.

It should be noted that the API of the initial roots of Hazelcast MapReduce implementation (code-named, CastMapR) was inspired heavily by that of the stable and matured MapReduce implementation of Infinispan. The Hazelcast word-count MapReduce example hence follows the same design of that from Infinispan.

I am using Hazelcast 3.2 and Infinispan 6.0.2 for my master thesis at INESC-ID Lisboa. Wait for more updates from the awesome Lisbon. ^_^


Note:
These results are part of the paper given below, which was published in 2014 December. Please cite the paper, if you used these results in your research work.
Kathiravelu, P. & L. Veiga (2014). An Adaptive Distributed Simulator for Cloud and MapReduce Algorithms and Architectures. In IEEE/ACM 7th International Conference on Utility and Cloud Computing (UCC 2014), London, UK. pp. 79 – 88. IEEE Computer Society.

Wednesday, May 7, 2014

Publish/Consumer API for the Data Replication - Synchronization Tool