Category Archives: Data

Running a Local ElasticSearch Cluster for Development

ElasticSearch is a document database built on Lucene, a full text-search engine. It clusters and is useful in a variety of scenarios. If you want to run it locally and test some of the clustering feature, here are some things I learned from my experience.

Install with your preferred package manager, or from source. In my case I use homebrew, so install is as easy as:

brew install elasticsearch

You can run multiple nodes from one install. First, you will want to tweak the config. For me, the elasticsearch.yml config file was located in /usr/local/Cellar/elasticsearch/1.4.0/config/, but that may vary based on your OS, package manager, or version.

Mainly, I set a value for cluster. name: some_cluster_name_of_your_choosing
marvel.agent.enabled: false

By default, elasticsearch joins any cluster with the same name, so you do not want to run on the default or you will be syncing with other local developers on your network.

Then you want to configure your startup mechanism. For some users, that would mean configuring the elasticsearch file in /etc/init.d, but for OS X homebrew users, my startup file is ~/Library/LaunchAgents/homebrew.mxcl.elasticsearch.plist. First, setup extra nodes. In OS X, that means making copies of the LaunchAgent file in the same directory with unique names. It’s worth noting if you’re using homebrew, that the original LaunchAgent plist file is a symlink and you’ll want to copy the contents to a new file. I numbered my extra nodes, so the filenames were homebrew.mxcl.elasticsearch2.plist, and so on. I altered the plist file to have a couple of extra arguments, specified in the ProgramArguments node. I removed the xml nodes that specify keeping the worker alive. The results were:

I set custom node names and ports via the plist file. Note that on a Linux based machine, you would later the service to run multiple ElasticSearch nodes with those custom parameters.

Finally, since I didn’t want to run those scripts individually each time, I create a script to launch all nodes at once. Note that you’ll want to add execute permissions and put it somewhere in your path.

Finally, I recommend installing Marvel, particulary for Sense, a nice tool for running commands on ElasticSearch. You can install Marvel with this command in the ElasticSearch root directory (for me, /usr/local/Cellar/elasticsearch/1.4.0/):

bin/plugin -i elasticsearch/marvel/latest

Now you can bring the cluster up with es_cluster


Working With Social Network APIs

Creating Vicinity Buzz naturally involved working with a the APIs of social networks. That information seemed worth sharing for those of you interested in writing any type of application that would integrate with a social network.

Developer Documentation

Any of the social networking sites you probably want to integrate with have developer api’s that are well documented. Here’s the starting points for a variety of services:

Working With JSON

All of these APIs are best used with JSON. If you’re not familiar, you can read up at It’s the notation for serialization of javascript objects, and object literals.

Where To Make the Call From

If you are working in a standard web page, you could call the api from document.ready (assuming you are using jquery). This is the approach I take on, my personal homepage. There is a twitter feed on the right side.

If you have a bit more of an application, you may want to look at one of the many javascript frameworks that help you route events to actions. These are frameworks like backbone, knockout, spine, etc. There are also commercial variants like kendo, dojo, and sencha.

jQueryMobile is commonly paired with PhoneGap, and in that scenario, using something like backbone is a bit tricky. You may want to bring in a template binding library, but avoid routing.


jQuery.templates was one of the first good javascript template binders that I’m aware of, but there are now many different options. In the jQuery world, most of the momentum seems aimed at jsrender. Recently I’ve considered bring in knockout and only using the binding part, but I’m not far enough in to evaluate that direction.

API Keys

Unless you’re using the most basic parts of the API, you’ll probably need to register your app and get an API key. It’s a token that identifies your application. In the event of API abuse (too many calls, etc), they have information to contact you and analytics around the issue.

Open Authentication

This is a big topic, but if your application wants to use a social network to identify your users, this is possible via open authentication. If you are interested in this, get started here.

What Do You Think?

Are there any particular areas of the APIs that you’d like to see more detail about? Any conceptual parts that would warrant their own post? Let me know what you think below.

Dealing With An ORA-01440 When Altering a Table in Oracle

When altering a table to an Oracle reduce a columns size, you’ll get a ORA-01440 error, indicating that you can’t make the adjustment because of the potential data loss.

I had to do this to recently to a bunch of columns, as I had incorrectly specified the size of numbers that would end up as integers in the application. Note that Oracle can treat any column as an integer by specifying a scale of 0. But for entity framework (via the Devart providers) to map to an long (Int64), you want the precision and scale to be Number(19,0).

Since the fields were left at the default size, they were too large. And being that many of these fields were a primary key or foreign key, constraints were also and issue. So the table had to be backed up, and restored. In order to safely do this, any triggers needed to be disabled.

Rather than do this repeatedly for many tables and risk missing something, or making typos, I created a script generator. You enter the table, and add any foreign key constraints, and then generate a script. All that is left to do is to add the columns to be re-sized to the alter statement in the middle of the script.

The script and generation code is all written in javascript, using jquery and jquery templates. If you’re interested in the code, I have it up on github.

On Terminology: “Single Source of the Truth”

According to Wikipedia, Single Source of the Truth “refers to the practice of structuring information models and associated schemata, such that every data element is stored exactly once” (emphasis is mine). This would mean, for example, a customer’s first name to be stored in once repository, not in every system that refers to the customer.

First, it’s a concept that is both difficult, and subject to various interpretations and implementations. The Wikipedia page does a nice job of mentioning the difficult parts, like dealing with the schemata of Vendor products, etc. As for the variety of implementations, you can enforce this in a dogmatic way where data is truly only stored in one place. Or you can implement with policy, having a location for each piece of data that is considered the master, and other pieces of data are responsible for publishing changes and updating periodically from that source. Either way, it is clear that this is a strategy to choose judiciously.

Additionally, choosing this strategy requires strong consideration of the effects on performance, reliability, and caching. If secondary storage is allowed, then stale data and concurrency issues arise. If secondary storage is prohibited, then you now have a single point of failure for many applications. Using the example of a CRM system being the single source of a customer’s first name, imagine the impact of that CRM system being down if other applications are not allowed to store that data.

So why this post? Why all this time and effort to define the term and discuss some of it’s nuances? In a variety of work places, I’ve seen this catch on as part of the lingua franca between business and IT workers, but used carelessly. And the number one problem is that I’ve seen Type A managers use this term to justify their oversimplified view of information management.

Notice the emphasis on information, and that I emphasized data in the definition “refers to the practice of structuring information models and associated schemata, such that every data element is stored exactly once.” Information is data within a context, and that’s the key problem when you get sloppy with the concept of “Single Source of the Truth.”

In one example, a particular manager had a problem with the fact that weather data was being stored in many different systems across the enterprise. I was part of a team tasked with creating a single consolidated data store and import program for all weather data across the company, because of his goal of having a single source of the truth. Briefly after looking into the other systems, it was clear that he didn’t grasp the ramifications of the concept.

Weather is a key factor in the demand for this customer, and so it is the basis of historical analysis, contract bidding, countless other aspects of their business. To our anonymous manager, that meant it was crucial to consolidate this information and have only one source. He was certain that people were out there using inconsistent sources that were causing efficiency problems, among other things.

Let’s start with the different types of weather data. There are forecasts and actuals. There is daily weather and hourly (and daily sources are peak for some uses, average for others). Finally, it’s worth noting that weather data is often corrected later, when the real-time value provided was measured incorrectly, or some other type of error occurred.

So let’s assume that we’re trying to consolidate hourly actual data. All applications should use this source. And let’s look at a couple of those uses:

  1. A bid for service is based on historical data, where the agent writing the bid used that weather data to evaluate the customers demand sensitivity to weather, and to evaluate the companies supply as trend of weather.
  2. A report on the effect on weather on supply is regularly supplied to operations managers.
  3. The accuracy of this data supplied by an outside vendor is to be regularly audited by Supply Chain.

Now, let’s assume those activities have taken place for the month of April, and it’s the middle of May. Now the vendor comes in with correction data for the middle of April. For the first purpose (the contract), I want to store what information was used to write the contract at the time. It’s the only fair way to evaluate the agent, as he wrote the bid based on the best available information.

Because supply and production is naturally affected by weather, for the second purpose (operations evaluation), I want to rerun those reports based on new, more accurate information.

Even more disruptive is the fact that in order to evaluate the variances in the accuracy of the vendor data, the company should be storing both values.

This leaves you at a decision point: Do you handle this by declaring these as different information, or version the information. In other words, the bid history is linked to uncorrected vendor data, and the updates are used to create a corrected data source that can be used for the operations purpose.

The alternative is that that corrections cause the creation a new set, but all sets are retained. Differentiation is handled with a version number or timestamp, and all the above problems are solved. While this sounds simple, versioned data grows quickly, and is difficult to query and understand.

Due to the timestamp, each record can now be referred to as the single source of weather data, for that location, occurance date and time (date of the weather), as provided on said date and time (import time). But for each location and time, there are multiple potential values as corrections are entered. And there is forecast vs actual data.

So to be precise, I still can’t say “give me Cleveland’s weather for March 7, 2011.” I would have to say “give me the actual weather value for Cleveland on March 7, 2011 that was available when I wrote a bid on April 5th.” Or in the case of an operations manager, they would request “the latest value of actual weather for Cleveland on March 7, 2011.”

Those are different pieces of data. But I don’t think that’s what our manager had in mind when he requested a single source of the truth for weather data. Because he meant weather information. Context / details / reality didn’t fit the mental model he had of weather data.

In the case of this project, we were able to slightly reduce the amount of weather data stored. And we certainly reduced the amount of batch jobs involved in fetching that data from external sources. But we also created a performance and reliability bottleneck. That may or may not have been the right decision. My point is that it is worth taking some time to think through and understand the terms you are using. Sometimes simple answers are great, but sometimes they are really just a sign of naivety.

On Commuting and The Economy

Yesterday, I left downtown Cleveland at 3:45 headed to a 4 o’clock meeting. I was probably going to be 5-10 minutes late. Instead I ended up calling to reschedule, and still didn’t make it home till 6:15. Two and half hours, for a drive that usually takes me 45-55 minutes. Google maps says 38 minutes, but that’s not realistic on a weekday. As I was virtually at a standstill on I-77, and I saw some helicopters coming and going, I assume this was a very bad day for someone up ahead, and so as frustrating as the experience is, delaying my day (and many others) is a small price for life saving flights to the hospital.

But the traffic did get me to thinking, that most traffic jams are a multifaceted problem. I’m referring to those simply caused by congestion, fender-benders, or traffic stops (and the associated gawking). Wikipedia does a nice job of listing the negative effects. But I think in the current context of the US today, it’s worse than what they list. And I think we have the power to mitigate some of this.

As already mentioned in the Wikipedia list, there is opportunity cost, massive pollution increase, and psychological effects to traffic problem. What about our current time period makes this worse? Try the housing market. How? Workers commuting a long distance wanting to avoid the risks of long traffic tie-ups aren’t nearly as free to move closer to their jobs. Or they aren’t looking at far away jobs merely because of the commute. The tie-in between housing (mobility of workers) and jobs is clear, add urban congestion to that fire. Also, construction and maintenance projects that can make for better commutes aren’t exactly popular, particularly when tied to state and local budgets. Unlike the Federal Government, these state and local entities can’t run large deficits during times of tax revenue decline. Finally, consider the wasted fuel (which is getting more expensive with turmoil in the Middle East) and it’s effects on household budgets that are already stressed thin.

So what remedies exist?

The White House has been pushing for high speed rail projects across the country, but some states have turned the money down fearing the investment they would have to put with it. There are a lot of questions about the value, but it’s hard to imagine that making people more fluid is a bad thing for commute times and the job market. With this in mind, I asked a question about the speed of Ohio’s rail on

Telecommuting has gained a lot of momentum, although I expect there has been some reduction during the recession (office space is not as much of an issue with a contracting workforce). While I’ve never been a big fan of working from home, it’s clear that it can save both the individual and company time and dollars.

GPS Systems are increasingly integrating traffic data. Just like emissions standards, having smart traffic systems as mandatory in cars could go a long way to assist in intelligent rerouting of commuters in the event of a backup. How many times have you been in a traffic jam and felt like you were rolling the dice when deciding whether to get off the highway and try another way? There are even social GPS systems like Waze that attempt to address this.

Google has been working on driver-less cars for a while now. Certainly safety, reliability and such are an early concern during testing. Once refined and proven, however, this technology would drastically reduce accidents, traffic stops, and save lives. If everyone were driving such cars (this is a loooong way out), speed limits could be drastically increased with little additional safety risk.

IBM, under their Smart Planet initiative, have been researching and implementing smart traffic systems.

I can only hope that some of these advancements lead to the kind of information available to drivers that is portrayed in this video “A Day Made of Glass” by Corning.

I think the challenge is finding reasonable first steps and getting some coordination between these initiatives. Given a recession, and global competition from rising powers like China and India, the US could gain a lot output from simple efforts to improve traffic scenarios. And maybe civil engineers (specifically transportation engineers) are on top of these ideas, but for now it certainly doesn’t look like the US is leading the way with solving these issues. Even if commute times aren’t drastically reduced, with solutions like the Google car or the high speed rail, imagine the productivity increase of commuting free time with internet available. You could use the time to pay your bill, catch up on news, do correspondence course work, etc. For some commuters, this is already a reality.

Some of the stimulus money was aimed at these kinds of projects, but in my mind, not enough. The long term economic effects of a mobile workforce are undeniable. And these efforts could payoff in terms of global competition for years to come.

What do you think? Does your city have solutions or efforts under way for this? Do you see a particular effort or company leading the way with this? I see mostly efforst coming from technology companies, but are there other significant efforts to address these issues?