Monday, December 13, 2010

SoCPaR2010

SoCPaR (International Conference on Soft Computing and Pattern Recognition) is an annual conference that focuses on bringing the Soft Computing and Pattern Recognition together ~ "Innovating and Inspiring Soft Computing and Intelligent Pattern Recognition". For the second consecutive year, SoCPaR has been successfully conducted. SoCPaR 2009 was held in Malacca, Malaysia on December 4th - 7th, 2009, which was followed by SoCPaR2010 in Cergy Pontoise / Paris, France at Universite' de Cergy Pontoise on December 7th - 10th, 2010. Following the two successful years, SoCPaR2011 has been scheduled to be held on Dalian, China on October 14th - 16th, 2011.


Presenting the paper on Association Rule Mining
It was really a pleasant experience for me presenting our paper on our research "Horizontal Format Data Mining with Extended Bitmaps [1]" on SoCPaR2010. Our paper has been listed as Paper 113 in the proceedings. I presented the paper on Dec 8th, 2.30 - 3.00 p.m at E1 auditorium of the University, where the conference had 3 parallel sessions at E1, E2, and Colloque. It should be noted that the paper was from the same team of undergraduates from the University of Moratuwa who published a paper on their product, "Mooshabaya - A Mashup Generator for XBaya [2]". Our paper got positive and constructive feedbacks, which essentially gives us more idea towards taking the algorithm forward. We have our algorithm implementation benchmarked with the FIMI datasets, and also have the door opened to the competition of algorithm implementations on Frequent Itemset Mining Implementations, as suggested by the chair.

Paris (6th - 11th, December 2010)
Apart from the paper presentations and tech talks, we also had social events and extra social activities such as 'Paris by Night', 'Wine Tasting', 'Visit to Chateau de Chantilly', and Banquet at Abbey of Royaumont [3][4], organized by the committee. It was a nice learning experience along with the days filled with fun. It should also be noted that we had the opportunity to face the strongest snowfall that Paris experienced after the year 1986. After the conference, we were also able to enjoy two more days at Paris, and were lucky enough to visit Louvre (the museum where Mona Lisa and many other master pieces live), Notre Dame Cathedral, Montmartre Hill along with a big white church Basilica of Sacre-Coeur on its crust, Eiffel Tower, and a few other places of interest.

[1] Buddhika De Alwis, Supun Malinga, Kathiravelu Pradeeban, Denis Weerasiri, Shehan Perera. "Horizontal Format Data Mining with Extended Bitmaps," in  Proceedings of the 2010 International Conference on Soft Computing and Pattern Recognition (SoCPaR2010), Cergy-Pontoise, Paris, France. pp 220-223, Dec. 2010.

[2] Buddhika De Alwis, Supun Malinga, Kathiravelu Pradeeban, Denis Weerasiri, Srinath Perera, Vishaka Nanayakkara. "Mooshabaya: mashup generator for XBaya," in Proceedings of the 8th International Workshop on Middleware for Grids, Clouds and e-Science (MGC '10), Bangalore, India. ISBN: 978-1-4503-0453-5 doi>10.1145/1890799.1890807 

[3] The abbey
[4] Photos of the Abbey

Wednesday, December 8, 2010

Horizontal Format Data Mining with Extended Bitmaps

Abstract
Analysing the data warehouses to foresee the patterns of the transactions often needs high computational power and memory space due to the huge set of past history of the data transactions. Apriori algorithm is a mostly learned and implemented algorithm that mines the data warehouses to find the associations. Frequent item set mining with vertical data format has been proposed as an improvement over the basic Apriori algorithm.
In this paper we are proposing an algorithm as an alternative to Apriori algorithm, which will use bitmap indices in conjunction with a horizontal format data set converted to a vertical format data structure to mine frequent itemsets leveraging efficiencies of bitmap based operations and vertical format data orientation.

Categories and Subject Descriptors
[Data Mining] Association Rule, Apriori, Bitmap Indices.
[Data Analysis] Data warehousing, Data Analysis.
General Terms - Data Analysis and Mining
Keywords - Data mining, Association Rule, Apriori, Vertical format mining, Bitmap Indices



Here we are proposing an algorithm "Horizontal Format Data Mining with Extended Bitmaps," for the Association Rule Mining. First we will have a look into the association rule mining and the roots of our algorithm. What is association rule mining? Finding interesting relationships between the variables is defined as Association Rule Mining. Association rule mining is often explained by market basket analysis, where the customers' purchase details are analyzed to find interesting relationships between the items. Here we find the variable sets which appear together. This is defined as Frequent Itemsets, and it is an interest of research due to its expensiveness.

Apriori Algorithm is a fundamental algorithm for the association rule mining. This mines the frequent patterns that are presented in a horizontal format, where the items are listed against their respective transaction. Apriori algorithm abides to the apriori property - any subset of the frequent itemsets is frequent. Each pass should go through the whole data set in Apriori algorithm. Hence it is not an optimized algorithm. Many improvements are suggested to the Apriori Algorithm.

Transaction data mostly occur in horizontal format, where vertical format is an alternative way of looking into it. Here the transactions are listed against the respective items. Since data may not appear in this format, we may need to re-organize the data into the vertical format, before mining them for the associations. Many effective algorithms are built on top of the vertical format data mining.

Now let's look at the next interesting terminology of our algorithm - Bitmaps. Bitmaps are used to store individual bits compactly. It's 0's and 1's where 1's depict the existence. Major advantage of using bitmaps is the possibility of effectively exploiting the bit-level parallelism. We have seen the vertical data formats and the bitmaps. Now we have a question. Is it possible to grab the benefits from both the vertical format representation and the bitmap operations to find frequent itemsets in a distributed environment?

Here we propose the algorithm 'Horizontal Format Data Mining with Extended Bitmaps'. The algorithm takes the data set organized in horizontal format. With one pass of the data set, we construct a bit map based data structure. The bit map structure will be in the vertical format. This structure facilitates an efficient mining.

First we take the transaction id of T100. (T100, {I1, I2, I5}). We will mark the items that appear in T100 in the ordered array. At the same time we link the associated items to the ordered array. Hence I2 will be linked to I1 in the master array, while I5 will be linked to I2. I5 will also be linked to I2 in the ordered array. Linking I1 to I2 or I2 to I5 in the ordered array is avoided to prevent redundancy. I1, I2, and I5 are marked 1 to represent their existence, hence constructing their bitmaps.

Refer to the Slides for a simple explanation on the algorithm "Horizontal Format Data Mining with Extended Bitmaps" itself.

T - Average size of transaction (Transactions).
I - Average size of maximal potentially-large itemsets (Itemsets).
D - Number of Transactions (Datasets).


Image: http://en.wikipedia.org/wiki/File:Storeisle.png

Tuesday, December 7, 2010

Mooshabaya and the Detailed Story..

Abstract
Visual composition of workflows enables end user to visually depict the workflow as a graph of activities in a process. Tools that support visual composition translate those visual models to traditional workflow languages such as BPEL and execute them thus freeing the end user of the need of knowing workflow languages. Mashups on the other hand, provide a lightweight mechanism for ordinary user centric service composition and creation, hence considered to have an active role in the web 2.0 paradigm. In this paper, we extend a visual workflow composition tool to support mashups, thus providing a comprehensive tooling platform for mashup development backed up by workflow style modeling capabilities, while expanding the reach of the workflow domain into web 2.0 resources with the potential of the mashups. Furthermore, our work opens up a new possibility of converging the mashup domain and workflow domain, thus capturing beneficial aspects from each domain.

Full Text



Before going into the Mooshabaya research, it is worth to have a look at the roots of Mooshabaya. JavaScript based Mashups - A content aggregation technology, that is used to compose services by remixing two or more sources [sample mashup site]. And the workflows - We prefer to represent real world processes as a sequence of operations, and here comes workflows. In practice both the mashups and the workflows have their own domain of applications. Workflows have gained their major use cases from the research communities as scientific workflows and the business communities as the business workflows. Mashups mostly target the end users and empower the user-oriented web.


What if a merge of the domains is possible? Can the domains use each other? This became the core research of  Mooshabaya.

Mooshabaya composes workflows graphically, and exports them as mashups. The mashups can then be executed and monitored at run time.
What motivates us into Mooshabaya? We have two major views. 1) As a Mashup Composer - Mashups can be created graphically by dragging and dropping components. 2) Extending the reach of the workflow domain. This includes web based APIs and web contents such as feeds. Mashup also becomes an alternative lightweight medium for workflow execution.

Before going further into Mooshabaya, we look at the related works. Yahoo Pipes, JackBe Presto, and Serena are some of the Mashup composers, that allow visual mashup composition through their graphical user interface. XBaya, Taverna, Triana, CAT (Composition Analysis Tool), Kepler, and Pegasus are some of the graphical workflow composers. Each of these tools has its own set of features and target use cases.

So what is special about Mooshabaya? Visual Mashup Composers mostly restrict themselves to Web 2.0 APIs. So far mashups are seen mostly as a content aggregation technology. Existing mashup composers do not support the monitoring of the execution of the mashups at run time. When considering the workflow composers, the traditional workflow languages have a high learning curves. Content aggregation is minimally supported in the existing tools. Mooshabaya is expected to fill these gaps in both the domains.

Let's have a look at the implementation of Mooshabaya. Registry Integration, Mashup Generation, Deployment, Monitoring, Security and the user interface are the major components.
Here we have the major use case of Mooshabaya. 

First the user wants to compose workflows - He integrates a web service registry into the system. Then he discovers and fetches the services metadata from the registry. By dragging and dropping the metadata files and other service components into the Mooshabaya canvas, he composes the workflows. 

Then he exports the workflow as a mashup. After that he can deploy the mashups into a mashup server. Here mashups are deployed into a mashup server and the relevant workflow files and the service metadata are stored into a registry. After the deployment, he can execute the mashups deployed in the mashup server and monitor them at the execution time. Secured services found in an Identity Server too can be used in the workflow composition.

As we discussed, Mooshabaya provides a solution for the complete mashup life cycle, where the existing  products are not providing at the moment. For example the monitoring phase of the mashups.

Our reasearch was mostly focused on the two domains. The integration of a WS-Eventing based notification system for remote monitoring of the mashups. Next we have the mashup security, based on username password based authentication.

Mooshabaya has used XBaya as the core workflow composer. Rather we can consider it a mashup generator for XBaya. WSO2 Mashup Server is used as the mashup server for Mooshabaya, and WSO2 Governance Registry as the web services registry. For the performance analysis, we benchmarked its performance against the XBaya's existing options. The performance graphs compare the file sizes and the generation times. The mashup file size was considerably smaller than the respective BPEL file, thus reducing the time to deploy the file to a remote server. The mashup generation time and BPEL generation time too were close to each other, without much difference. 

Scientific researches such as the LEAD system, business processes, and educational researches are the major target users of Mooshabaya. As a point for future works, we can mention that, Mooshabaya still can be extended with further research, specifically a web based interface, and the support for the delegated authentication scenarios.