Thursday, October 29, 2015

∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data

Our paper "∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data" Pradeeban Kathiravelu, Helena Galhardas, Luís Veiga was presented at the 23rd International Conference on COOPERATIVE INFORMATION SYSTEMS (CoopIS 2015), 28-30 October 2015, Rhodes, Greece.

Due a clash in my travel schedules, I could not attend the conference and present the paper myself. Hence, my supervisor and co-author Prof. Luís Veiga presented the paper in Greece on the 29th of October.

Near duplicate detection algorithms have been proposed and implemented in order to detect and eliminate duplicate entries from massive datasets. Due to the differences in data representation (such as measurement units) across different data sources, potential duplicates may not be textually identical, even though they refer to the same real-world entity. As data warehouses typically contain data coming from several heterogeneous data sources, detecting near duplicates in a data warehouse requires a considerable memory and processing power.

Traditionally, near duplicate detection algorithms are sequential and operate on a single computer. While parallel and distributed frameworks have recently been exploited in scaling the existing algorithms to operate over larger datasets, they are often focused on distributing a few chosen algorithms using frameworks such as MapReduce. A common distribution strategy and framework to parallelize the execution of the existing similarity join algorithms is still lacking.

In-Memory Data Grids (IMDG) offer a distributed storage and execution, giving the illusion of a single large computer over multiple computing nodes in a cluster. This paper presents the research, design, and implementation of ∂u∂u, a distributed near duplicate detection framework, with preliminary evaluations measuring its performance and achieved speed up. ∂u∂u leverages the distributed shared memory and execution model provided by IMDG to execute existing near duplicate detection algorithms in a parallel and multi-tenanted environment. As a unified near duplicate detection framework for big data, ∂u∂u efficiently distributes the algorithms over utility computers in research labs and private clouds and grids.

The full paper can be accessed here.

Friday, October 16, 2015

Git Rebase for OpenDaylight/Gerrit

I hate merge conflicts, and luckily so far I did not have to do a git rebase for gerrit in OpenDaylight. However, this time I had to. So here are the steps I followed.

pradeeban@llovizna:~/OpenDaylight/distribution$ git-review -d  28471
Creating a git remote called "gerrit" that maps to:
Downloading refs/changes/71/28471/1 from gerrit
Switched to branch "review/pradeeban_kathiravelu/28471"

pradeeban@llovizna:~/OpenDaylight/distribution$ git rebase origin/master
First, rewinding head to replay your work on top of it...
Applying: Add messaging4transport to integration
Using index info to reconstruct a base tree...
M    features/index/pom.xml
M    features/index/src/main/resources/features.xml
M    features/test/src/main/resources/features.xml
M    pom.xml
Falling back to patching base and 3-way merge...
Auto-merging pom.xml
CONFLICT (content): Merge conflict in pom.xml
Auto-merging features/test/src/main/resources/features.xml
CONFLICT (content): Merge conflict in features/test/src/main/resources/features.xml
Auto-merging features/index/src/main/resources/features.xml
CONFLICT (content): Merge conflict in features/index/src/main/resources/features.xml
Auto-merging features/index/pom.xml
CONFLICT (content): Merge conflict in features/index/pom.xml
Failed to merge in the changes.
Patch failed at 0001 Add messaging4transport to integration
The copy of the patch that failed is found in:

When you have resolved this problem, run "git rebase --continue".
If you prefer to skip this patch, run "git rebase --skip" instead.
To check out the original branch and stop rebasing, run "git rebase --abort".

pradeeban@llovizna:~/OpenDaylight/distribution$ git status
rebase in progress; onto b524a56
You are currently rebasing branch 'review/pradeeban_kathiravelu/28471' on 'b524a56'.
  (fix conflicts and then run "git rebase --continue")
  (use "git rebase --skip" to skip this patch)
  (use "git rebase --abort" to check out the original branch)

Unmerged paths:
  (use "git reset HEAD ..." to unstage)
  (use "git add ..." to mark resolution)

    both modified:      features/index/pom.xml
    both modified:      features/index/src/main/resources/features.xml
    both modified:      features/test/src/main/resources/features.xml
    both modified:      pom.xml

no changes added to commit (use "git add" and/or "git commit -a")

Modify the offending files, the files that are marked as 'both modified'.

Add the modified files.
pradeeban@llovizna:~/OpenDaylight/distribution$ git add features/index/pom.xml features/index/src/main/resources/features.xml features/test/src/main/resources/features.xml pom.xml

pradeeban@llovizna:~/OpenDaylight/distribution$ git rebase --continue
Applying: Add messaging4transport to integration

pradeeban@llovizna:~/OpenDaylight/distribution$ git status
On branch review/pradeeban_kathiravelu/28471
nothing to commit, working directory clean

To amend or modify the commit message,
pradeeban@llovizna:~/OpenDaylight/distribution$ git commit --amend -s

Finally submitting the changes to gerrit,
pradeeban@llovizna:~/OpenDaylight/distribution$ git review
Your change was committed before the commit hook was installed.
Amending the commit to add a gerrit change id.
remote: Processing changes: updated: 1, refs: 1, done   
remote: Updated Changes:
remote: Adding messaging4transport features.
To ssh://
 * [new branch]      HEAD -> refs/publish/master/28471

Saturday, October 3, 2015

2015 and the Flowers of Autumn

Flowers once more.
Each year is different, specially in Europe. This year, it started to get colder in August. But summer returned in September. It is getting a bit colder, yet sunny in Lisbon in October. The small plants in my apartment are having colourful flowers, just like in summer, once again in 2015.

This summer was pleasant. Spring was equally comfortable. Autumn has just started. So I will refrain from complimenting it yet. I kinda miss the early days of 2015. :) It went smooth and I had a good time.

The trip to Frankfurt was the first this summer. It is just a small city. Till early September, I was travelling a lot this year. Even after that, I still have 3 papers accepted to 3 different conferences - that means, in 3 different countries. Due to the clash in my time schedule, visa delay, or other reasons, I could not attend. Hence, I missed the opportunity to travel Sydney - Australia, Rhodes - Greece, and Vancouver - Canada this year. However, that does not demotivate me, as I am enjoying the sunny Lisbon, working on my EMJD-DC projects. :)
Frankfurt river view.

Also I just completed my first year of PhD, and entered the second year a few days ago. Have a long way to go. This year felt long, but at the same time, it also felt that the time passed pretty quick. It is contradictory, I know. Winter is not so cool in Lisbon. Specially when it just rains - wet and cold. There is no proper central heating system either. No snow and so there is no fun associated with snow, unless I decide to travel to some other countries. More or less, I have been living in Portugal for 3 years. However, with the mandatory mobility and internships, I am foreseeing my next migration next year. I am starting to feel the fear how can I pack all my things for the migration. :P Probably I will have to throw a considerable amount of things before moving. I also contribute to the Humana Portugal regularly. That is really a great initiative, reducing wastage specially when we leave the country.