# Issue with eclipse debugger while creating plugins for GeoKettle

With regards to the issue with the debugger not able to install breakpoints, I believe we have found a solution. If you face the same issue where you cannot debug with Eclipse due the error I had described in my previous post, try switching to Eclipse Kepler R2. That worked for us. Also make sure, you have the following set:

• Java version to 1.6.
• Javac set to debug=true in build.xml.

Update

For those interested, here’s the line(s) of code that throws the error:

in here.

# Upper bound to the number of nodes and height in a full binary tree

Definitions

A full m-arr tree is one which has either m children or no children at all.

Extending that definition a Binary Tree (hereby referred to as BT) should have either 2 children or no children at all.

Now let’s make a claim.

Claim: A full m-arr tree with i internal nodes have mi+1 nodes in total.

Now in a full BT, there can be two types of nodes: nodes with a parent and nodes without a parent. The later can only amount to 1 (the root node). To find the former, we can take the number of parents in tree (i) and multiplying them with the branching factor (m).

To find the number of leaves, we can say

$Number of leaves = (mi + 1) - i = (m-i) + 1$

Now to find the height of the BT.

Definition: The height of a tree is the maximum level of any leaf node.

So what do we mean by a level? – It is the number of edges in the path from it to the root.

Now if there is a BT of height h, how many nodes and leaves can it contain? – There is no straight answer because this would depend directly on the structure of the tree. However, we can try and find an upper bound. Is the bound tight? – Hardly.

To find the upper bound, we will use structural induction. This can be done using numeral induction but it would probably be more complex and I am a lazy person :). However, for all those math geeks out there, structural induction can be converted to numeral induction pretty easily. The structural induction proof is based on the recursive representation of the tree. For that we need to consider two points:

• A single node is a full BT (just the root).
• If X and Y are two full BT’s, then say we have a new root (x), whose children are X and Y, the this new tree (X) is also a full BT.

Claim 1: Let T be a full BT, with height h and n nodes. Then $n <= 2^{h+1} - 1$

Base: If a BT had just one node, then n =1 and h =0. Then $1 = 2^{h+1} -1$, or $h=0$.

Inductive step: Let X and Y be two full BT’s for which this holds true. We need to show the claim also holds true for a new tree T of whose X and Y are children.

Let,

height of T = h

height of X = p, and

height of Y = q

Since, h is greater than p and q, the height of the larger one between p and q must be exactly $h-1$ and the other one $<=(h-1)$.

By induction, the number of nodes in X is $<= 2^{p+1} - 1$ which is $<= 2^{(h-1)+1}-1 = 2^{h}-1$. Similarly, for Y the number of nodes is $<= 2^{h}-1$.

The number of nodes in T is $<= p+q+1$, which is

$2^{h}-1 + 2^{h}-1 + 1 = 2.2^{h}-1 = 2^{h+1} -1$, which is what we wanted to show in the first place.

# Order statistic trees

What is it and why should I care?

Order statistic trees are a modification of Binary Search Trees or in generate B-Trees. It has the same structure of a BST with just one addition structurally and as a result supports two additional methods: Select and Rank.

Structurally each node of an order statistic tree has an extra value: size of its subtrees, where

$size[x] = size[left[x]] + size[right[x]] + 1$

The Select method can be used to find the ith smallest element in the tree and the Rank method can be used to find the rank of an element in the tree.

Thus the code for select is as follows:

And the code for rank is as follows:

The running time of Select and Rank are both $O(log n)$

# 0-1 Knapsack problem

The 0-1 Knapsack problem is a very famous problem that can be solved using dynamic programming in pseudo-polynomial time. A variation of the problem can be defined as follows:

You have a Knapsack that can store M items. You have N items with weights W and values V each respectively. Which items would you choose to maximize your profit.

Before we start discussing the problem, lets talk about what makes this algorithm a pseudo-polynomial time algorithm rather than a polynomial time algorithm aka what is meant by pseudo-polynomial time?

For an instance take my word for it that the running time of this algorithm is O(nW). Now for a 32-bit machine and 64-bit machine, the number of computation we would have can be calculated as follows:

As you can see the running time increases instead of decreasing on a 64-bit machine when compared to a 32-bit machine. This is not expected of an algorithm running in polynomial time since such an algorithm would scale linearly.

Now to the main problem.

Let’s take an example:

If the Knapsack can take a maximum weight of 10, and we have the following 3 items, what would we do.

Item    Value    Weight

1           \$30       6lb

2          \$14        3lb

3          \$16        4lb

Let K(W) be our final solution. So how would we define K(W)? Our target is to accumulate maximum value in treasure while keeping the total weight <= W.

Thus, K(W) can be defined as:

K(W) = maximum value achievable with capacity W.

Now there are two ways to look at this problem. If you look closely at it, the first thing that will probably come to your mind is, hey, its a maximization problem based on constraints such as the capacity of the Knapsack. And right you are, while solving via DP (Dynamic Programming), it does turn out to be a maximization problem.

So what’s out optimal sub-structure and does it repeat? Let’s take a look at this graph to figure that out:

Let’s have a graph with the weights as columns and the capacity of the Knapsack as rows. At any point of time, an edge has two choices, either choose to pick that item or not choose to do so. If it picks that item, it goes to a certain state, if it doesn’t, well, it goes to a different state. Of course, they cannot go to just about anywhere. For instance, they cannot go to a state where the weight of the Knapsack is smaller than the total weight of the items.

However, please notice that whilst we started off saying that we are trying to solve a maximization problem, we are no longer doing so. The reason behind this is, although we can build a graph that makes it look like a maximization problem, it is much easier to solve as a minimization problem.

Now that we have the graph it is fairly easy to solve. Use any path finding algorithm and find the shortest path from the start to the end. If we sort it topologically and then solve the running time comes to O(V+E) whereas algorithms like Bellman-Form would give a running time of O(VE). I would leave the code for that for another post or if you can be a good guy/girl and post it in the comments.

We are going to figure out how to get our optimal sub-structure from this graph and also turn it from a minimization to a maximization problem.

To do so, let’s figure out which one thing are we repeating at every step. We are deciding at each step whether to select that item or not. At every step, we do not know what we have already selected and we do not know what we are going to select in the future. All we know is that we are either going to select this item or we are not going to select it. That’s all and that is our optional sub-structure. Mathematically, we can define it as:

Like we discussed earlier we have two choices. The first part of the representation defines if we didn’t have to choose the item and the second part vice versa. In the second part, if we do choose an item, we need to add the value of that item and reduce our Knapsack’s capacity.

Now that we got the representation, how are we going to use it? Let’s take a 2D matrix with the sum as rows and the items (weights) as columns with one extra column as we took for our graph representation. Now we can recurse in a topological sort manner where we start from the rightmost column and the topmost row and proceed. With that in mind, here’s the JAVA code for the algorithm:

In a variation of this problem, you might be asked, what if instead of the weight accumulated being less than and equal to the capacity of the Knapsack, you have to find the one with the exact weight as that of the capacity. The difference between the two lies in the fact that if its less than and equal the value might be different from if its exactly equal. What would you do in that case?

I will leave that to be covered in another post. Till then, bye!

# Breakpoints error while debugging

Recently I was working on a GeoKettle plugin and faced the following error:

Unable to install breakpoints due to missing line number attributes. Modify compiler options to generate line number attributes.

GeoKettle was compiled on version 1.6 and thus, if you are on a version greater than that, you might face this error. To fix it, just set your target version to 1.6.

# Create pairs of parenthesis

Question: Given a number N, create all possible valid combinations of parenthesis. For example, given 2, all valid combinations would be: ()() and (()) – equal opening and closing parens.

Solution

This question is a very nice template of the build on top of base cases approach where we create the simplest case, build on top of it and find a way to do so recursively for all higher cases. Lets follow that approach:

For n = 0, we can only have an empty set

For n = 1, the best we can do is

()

For n = 2, how can we build based on n = 1?

Observe that for n = 2, the valid combinations would be ()(), (()). This we could come up intuitively. Now if we look at the result for n = 1, we can see that what we did was basically insert parens in between parens or add a new set of parens if we couldn’t find a pair to insert in-between.

Thus, the recursive code would be:

# Store elements at every level separately during level order traversal

To explain the problem, say we have the following tree:

And we want to store elements at every level in a separate list. For this  example, we would have the following three lists:

1

2 3

4 5 6 7

We will use the usual BFS algorithm but with a small modification. To know the level, we will keep two variables: one to store the numbers of nodes in the current level and the second to store the number of nodes in the next level. We will also take one list to store the nodes at each level and another list to store all the lists.

The basic idea of the algorithm is, whenever a node is popped out of the queue, we store it in the list pertaining to that level and decrement the counter for the number of nodes for that level. When the counter goes down to zero, we know we have reached the end of the level.

Similarly, when we add a new node to the queue, we increment the counter for number of nodes in the next level. When the current level is exhausted, we store the list in the bigger list, start a new list, and make the next level the current level and repeat the process.

# Why the inability to access ElastiCache from outside is a deal breaker

AWS ElastiCache cannot be accessed from the outside. Amazon cites latency issues as the cause for it but this may turn out to be a deal breaker for many customers. Here’s why:

The entire infrastructure might not be AWS based

Private-public hybrid clouds is a very real scenario and most companies prefer to have their test infrastructure on their own hardware and deploy on the cloud. While it shouldn’t be a problem with unit tests, integration tests would be a total bummer.

We are aware of latency issues, but we still want access to it from outside the subnet

Amazon’s concern over not allowing Elasticache outside the subnet is inherent latency issues. That is definitely a concern, I agree but a concern in production systems. I still want to run integration tests on my laptop/local server and I hate you for not letting me do that. So much for developer’s happiness over bad press.

Security! Security! Security!

We know and if somebody wants to keep a production server open to the world, he deserves to be hacked. Wait, let me put the last one I forked behind a firewall. BRB!

Thy shalt not be in the same subnet

Even if I am on EC2, I might not be in the same subnet as the cache cluster. That’s a very real scenario and the whole design would fail right then and there. Go figure!

Conclusion

Build your own cluster manually with a lot of pain, sleepless nights, load balancers, energy drinks…. oh wait, there is Google. Bye and see you on the other side!

# GeoKettle version mismatch while developing plugins

I faced this error yesterday while working on a GeoKettle plugin:

Exception in thread “main” java.lang.NoSuchMethodError: org.pentaho.di.i18n.BaseMessages.getString(Ljava/lang/Class;Ljava/lang/String;[Ljava/lang/String;)Ljava/lang/String;

Apparently this happens when there is a version mismatch of the jars included in your project and those used by GeoKettle/Spoon.

# Share photos with friends and family

There are so many options for this right? But lets look at the facts. Nothing is for free and there are so many places where you don’t want your photos to be even if its for free. Major services like Dropbox, OneDrive etc would charge you anywhere between \$5-\$10 a month once you cross their free tier, which is rather easy to do if you have photos from a couple of trips to share (photos from my recent trip to Austin and San Antonio was 1.37GB alone). Come in S3, Amazon’s Object storage service, which in other words mean you can store whatever you like. Great, so what’s the big deal? You don’t need to buy in bulk. S3 charges customers 3c per GB and it gets cheaper by the TB. So the more you upload, the less you pay. How great is that!

So now that there is a place to upload for cheap, how to do it? Now S3 is no Dropbox. So here comes my little Java code that does it all for you: upload all photos, create a nice index page, archive all photos and let users download it as one downloadable. The source code is still in the works so expect bugs. Nonetheless, its available here. The JAR can be downloaded from here.

So how to use it?

First you need to create a AWS account if you already don’t have one. Next you need to create a S3 bucket from your AWS account.

Next you need to setup your account so its accessible to the outside world. If you’d rather like it private, you may create AWS IAM users and give them access to read the bucket. To make your bucket accessible to the outside world, enable “Static Hosting” and put index.html as your index document.

Now to grant the application access to your S3 bucket. First create a new AWS IAM user and grant it full access to S3. Second create the configuration file for the application. It should be in the format:

{
“awsKey”: “xxxxxxxxxxxxxxx”,
“awsSecret”: “xxxxxxxxxxxx”,
“bucketName”: “myawesomebucket”,
}

Now download the application jar. You will require JAVA to run the app, so if you don’t have it already, install JAVA. The application works with JAVA 1.7 or higher.

Put the jar and the configuration file in the same directory and issue the following command to share your first photo gallery via S3.

java -jar photosharer-1.0-SNAPSHOT.jar path/to/gallery path/to/configuration/file galleryname

And that should do it. You can now share your photos using the bucket URL you had copied earlier. The entire URL would be http://your_bucket_url/gallery_name.

I am eager to know if this helps in sharing photos with friends and family. And as always do let me know if you have a suggestion or a criticism to make it better.

# Avoiding race conditions in your code

Recently I was writing some code that had to be used by multiple threads in the same application. That reminded me of writing about the oh, so important race condition and how to avoid it in a multi-threaded application.

What is race condition and why should I care about it?

A race condition occurs when more than one thread tries to access the same resource and the order in which it is accessed matters. Say for instance you have a static variable and that resource is accessed from multiple threads. However, when one thread is modifying it, another thread tries to access it and reads a stale value. The static variable in this context is a critical section and the phenomenon is called a race condition.

Great but my application is single threaded, why should I bother?

Is it? What if you are trying to access a memcached cluster. Would you rather queue multiple requests or fork separate connections for each request. Here a memcached connection object is your critical section. Consider yourself doing this for a web portal receiving a million hits at any instant of time. That means a million connections to your memcached cluster. That can’t be good, right?

Ok you got me hooked, how do I avoid it?

I suppose the solution is absolutely intuitive. Don’t provide concurrent access. In Java, this can be easily done with the synchronized keyword. You may make your method, variable or a certain scope synchronized. I won’t go into the details since there are a ton of articles on the Internet but I hope this little snippet of programmer rant helped you out in some way.

# Fun with Microsoft “support technicians”

I had been getting calls from somebody called Sylvestar from Microsoft “tech support” for a few days now and since I had some free time today, decided to have some fun with the scammers. The guy was Indian and judging by his accent probably from the southern part of the country. So here goes a brief description of the call.

After Sylvestar, a lady got on the line. She had a taught american accent. That made me think how many of these scammers are ex-BPO or current BPO employees. Most of the support work is outsourced to India and I wouldn’t be surprised if some of the more shady companies sold user information. Anyways, back to our story. She tried to scare me about viruses and how I could lose all data and that my warranty was nearing an end (I still have 3 more years of warranty on that computer). Next she wanted me to download and run an application called PC Check. From the looks of it, it screamed malware, so I decided to draw the line there and call her bluff. Fortunately, I didn’t end up being verbally abused like so many others who have been victims of this scam and drew the line before the scammers ripped them off their money.

As a conclusion I would ask you to beware of such scammers. First of all, ask for their customer service number and tell them that you will call them back. A legitimate technician wouldn’t mind giving their number, a scammer would. To be on the safe side, always Google up the number and verify it. The number I got the call from is 4044447233. Of course the number is a virtual one but if you get a call from this number, you know who’s calling.

# Is your webservice really RESTful

REST is the new buzzword in town and more often than not I have seen people claiming their webservices being RESTful when they are not. So here is a list of things that make your webservice NOT RESTful.

It should not perform an action

The REST architecture is resource based, meaning it should provide access only to resources and not actions. If an end-point performs an action and changes the state, your webservice is not RESTful.

There should be clear separation of concerns

This is easier said than done but a RESTful service should have a clear separation of concern. In other words, if your client and server are both writing to the same database, your service is definitely not RESTful.

Should be stateless

If, for example, you have a service that check for authentication, it should not change state if a user logs in. Session management should be done on the client side and not on the server side. If your server changes state, your webservice is not RESTful.

Layered system

The biggest advantage, in my opinion, of the RESTful architecture is the idea of building it in layers. In other words, your server or client should have no knowledge of the path through which it receives requests or responds. And this brings us to the next point: Cacheability.

Cacheability

A layered system allows cacheability to the system. Consider putting in a firewall between your server and client in a non-layered system.

Uniform interface

The most important thing in this is the representation of a resource. A resource should be self sufficient enough to identify itself to the client. Once a client has a resource, it should have the power to modify or delete that resource. Another thing of note here is that the resources are conceptually different from the representations that are sent back to client. For instance, a resource might be a database view to the server but  an array of names to the client.

Apart from this, I have also noticed that people liberally use POST. Some developers have the notion that POST is “better” than GET. This is no where near the truth. When creating a RESTful service, specifications dictate that if you are overwriting a complete representation of the resource, PUT should be used. For partial overwrites, POST may be used. For retrieval, GET should be used.

Hope you enjoyed this post and please drop a line if I have missed something.

Picture courtesy: drdobbs.com

# Rajnikanth humor

Found this old and gold article at Funspace. Great actor but even better is the humor surrounding him. Enjoy!

# Notes on Jackson version mismatch with Stormpath SDK v1.0.beta

One of our projects at Tango uses Stormpath and who doesn’t use Jackson, right. So while using version 1.0.beta of their JAVA SDK, I came across this problem.

My project used jackson-asl version 0.9.5 and the Stormpath JAVA SDK uses version 1.9.11 of jackson-mapper-asl. This might cause a version mismatch conflict while constructing a new Client, typically on the third line of this piece of code:

and may look something like this:

The easiest way to get out of it is use jackson-core-asl (JSON processor) and jackson-mapper-asl (Data mapper) instead of jackson-asl and exclude the above artifacts from the stormpath dependencies in your pom.

If it still doesn’t work, look through the dependency graph and see where else jackson-mapper-asl is used causing the version mismatch.

Thanks to Jose from the Stormpath engineering team for his help on this one.

# Amazon Zocalo: first experiences

Zocalo is a file sharing service from Amazon. Its still in limited preview but I just got access to it. So here’s a video of my first experience of Amazon Zocalo.

# Integration testing with JUnit and Maven failsafe plugin

I am sure you know the differences between integration and unit testing but for the sake of completeness I am going to re-iterate the major difference from the perspective of a test designer.

When designing an unit test, you are concerned only about that one class, so all data coming into the class are usually mocked. However, during integration tests you need to use “real” data from the intended source.

Now as it is evident from the title of the post, I am going to go through a small “Hello World” project to cite the differences while writing unit and integration tests along with how to use the Maven fail safe plugin along with JUnit to achieve it.

1. Defining the goals

Unit tests and integration tests run at different phases of the project build. Unit tests may stop a compilation from completion, however integration tests are usually run after the compilation cycle of the module. In Maven lingo all phases such as test or install will run the unit tests whereas integration tests will only run under the verify phase. Thus the Maven fail safe plugin comes as a build plugin and runs as an exclusive phase. Thus the pom.xml goes as follows:

2. Defining the project structure

Version 4 and above of Junit defines test cases as follows:

• Class name: ClassNameTest.java
• Method signature: public void testSomeLogic()

For integration tests, however the Class name becomes ClassNameIT.java. Below is a depiction of a typical project file structure:

The business logic, as usual, goes in src/main/java. For our little project, below are the classes for that:

DataService.java – Class to mimic data received from a remote source

4. Writing the unit and integration tests

As we talked about earlier, unit test classes end with “Test” whereas integration test classes end with “IT”. Here are the definitions for our example:

and the integration test class:

5. Running the tests

As we discussed earlier, unit tests are run before the build phase and integration tests after. On our example project, if we run mvn test, it would run just the unit test case. A (truncated) output for that would be as follows:

——————————————————-

T E S T S

——————————————————-

Hello

Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.026 sec

Please note that it ran the test on the mocked input. Now for the goal verify,

——————————————————-

T E S T S

——————————————————-