Tag Archives: data

Working With Social Network APIs

Creating Vicinity Buzz naturally involved working with a the APIs of social networks. That information seemed worth sharing for those of you interested in writing any type of application that would integrate with a social network.

Developer Documentation

Any of the social networking sites you probably want to integrate with have developer api’s that are well documented. Here’s the starting points for a variety of services:

Working With JSON

All of these APIs are best used with JSON. If you’re not familiar, you can read up at json.org. It’s the notation for serialization of javascript objects, and object literals.

Where To Make the Call From

If you are working in a standard web page, you could call the api from document.ready (assuming you are using jquery). This is the approach I take on hoolihan.net, my personal homepage. There is a twitter feed on the right side.

If you have a bit more of an application, you may want to look at one of the many javascript frameworks that help you route events to actions. These are frameworks like backbone, knockout, spine, etc. There are also commercial variants like kendo, dojo, and sencha.

jQueryMobile is commonly paired with PhoneGap, and in that scenario, using something like backbone is a bit tricky. You may want to bring in a template binding library, but avoid routing.


jQuery.templates was one of the first good javascript template binders that I’m aware of, but there are now many different options. In the jQuery world, most of the momentum seems aimed at jsrender. Recently I’ve considered bring in knockout and only using the binding part, but I’m not far enough in to evaluate that direction.

API Keys

Unless you’re using the most basic parts of the API, you’ll probably need to register your app and get an API key. It’s a token that identifies your application. In the event of API abuse (too many calls, etc), they have information to contact you and analytics around the issue.

Open Authentication

This is a big topic, but if your application wants to use a social network to identify your users, this is possible via open authentication. If you are interested in this, get started here.

What Do You Think?

Are there any particular areas of the APIs that you’d like to see more detail about? Any conceptual parts that would warrant their own post? Let me know what you think below.

Boiling the Ocean (or Attempting to Keep Up With Tech)

I was asked by a fellow developer how I keep up to date on the variety of technologies that I’m expected to understand when doing my job. I intended to write a short answer, and generate a blog post as a longer answer. I got this idea from Scott Hanselman, who in his talks about productivity mentions saving keystrokes, and the idea that when you answer a question for someone, you should share that in a way that is useful to you and others. With that in mind, here’s the advice I would give if you find yourself taking on a job that requires a broad knowledge of the technology and software.

Question: Somebody in your position obviously needs to understand a lot of technologies to be able to pick the best solution for a project. How do you approach learning and having an understanding of all the new stuff that is constantly coming out? Obviously you can’t sit down and learn every tiny detail, but do you just obtain a high level understanding of how things work and then flush out the details once you start a project in a technology that you’ve never used before?

I use Google Reader a lot, I just checked and I have 121 rss/atom feeds going into it. They are not all work related, and it’s not like I read every article. But I star stuff I want to read, or put it in Instapaper and then go back to it. I try to read a mix of updates on what I already know and interesting things about what I don’t.

As for how to learn and retain, and relate technology, that is a harder question. I’d love to tell you that you just get a high level exposure first and then fill in the details when needed, but my mind doesn’t work that way. I hated classes like Systems Analysis in college that only talked about systems in generic terms, because I couldn’t relate those terms to specific examples.

My own take is that in the beginning of your career, you just have to learn practical skills and learn the tools you work with well. When you start wanting to learn about systems and being able to evaluate, compare, select, recommend, etc, then you pick up as follows:

Let’s say I’ve never used a database, and I’m assigned to work on a project that will use Ruby on Rails and PostgreSQL. Take the on the job time to learn the tool as well as you can, focusing on the aspects you need for your job. Where possible, try to understand and separate product names / features from conceptual names / features. In this example, that means understanding what database and schema mean in generic terms, and why they mean different things in the PostgreSQL than they do in some other products (like Oracle or SQL Server). Spend a little bit of off time playing around with an easily attainable (usually open source) alternative. Spend a night or two doing those same types of tasks in MySQL. You’re not looking to be a dba, just doing similar tasks as to what you do. Finally, talk to people who use other products and compare. This works great with products that are hard to get a hold of because of cost, etc. In this example, talk to an Oracle DBA or App Developer. What features do they like about their product that would prevent them from switching. If you don’t know one, ask the question online. Quora, LinkedIn, Twitter are all great places for this kind of question.

At this point, you’re out of work time (so not counting your project time working with PostgreSQL because you would have done that anyway) is 4-6 hours on MySQL and maybe another hour in conversation. But you should have a really good idea of the concepts surrounding relational databases.And you should know what the different projects compete on in that arena and what some of their strengths and weaknesses. If you spend time away from that field and come back in, you should have the conceptual understanding and just need to buff up on the implementation details and latest trends.

Two other quick pieces of advice. First – After getting the practical experience on a type of tool, read the Wikipedia page. Most pages are at most a 15 min read and they lay out the purpose and strategy behind any tool type, and usually list the major players in that area. Second – Try to keep personal opinion from having to much sway. We all prefer different tools, but every tool can be criticized. That’s important to do, and know it’s weakness, but don’t dismiss the tool outright. They were built for a reason, a context. And many of the weaknesses relate to some tradeoff the programmers / vendor made that has a good reason behind it.

On Terminology: “Single Source of the Truth”

According to Wikipedia, Single Source of the Truth “refers to the practice of structuring information models and associated schemata, such that every data element is stored exactly once” (emphasis is mine). This would mean, for example, a customer’s first name to be stored in once repository, not in every system that refers to the customer.

First, it’s a concept that is both difficult, and subject to various interpretations and implementations. The Wikipedia page does a nice job of mentioning the difficult parts, like dealing with the schemata of Vendor products, etc. As for the variety of implementations, you can enforce this in a dogmatic way where data is truly only stored in one place. Or you can implement with policy, having a location for each piece of data that is considered the master, and other pieces of data are responsible for publishing changes and updating periodically from that source. Either way, it is clear that this is a strategy to choose judiciously.

Additionally, choosing this strategy requires strong consideration of the effects on performance, reliability, and caching. If secondary storage is allowed, then stale data and concurrency issues arise. If secondary storage is prohibited, then you now have a single point of failure for many applications. Using the example of a CRM system being the single source of a customer’s first name, imagine the impact of that CRM system being down if other applications are not allowed to store that data.

So why this post? Why all this time and effort to define the term and discuss some of it’s nuances? In a variety of work places, I’ve seen this catch on as part of the lingua franca between business and IT workers, but used carelessly. And the number one problem is that I’ve seen Type A managers use this term to justify their oversimplified view of information management.

Notice the emphasis on information, and that I emphasized data in the definition “refers to the practice of structuring information models and associated schemata, such that every data element is stored exactly once.” Information is data within a context, and that’s the key problem when you get sloppy with the concept of “Single Source of the Truth.”

In one example, a particular manager had a problem with the fact that weather data was being stored in many different systems across the enterprise. I was part of a team tasked with creating a single consolidated data store and import program for all weather data across the company, because of his goal of having a single source of the truth. Briefly after looking into the other systems, it was clear that he didn’t grasp the ramifications of the concept.

Weather is a key factor in the demand for this customer, and so it is the basis of historical analysis, contract bidding, countless other aspects of their business. To our anonymous manager, that meant it was crucial to consolidate this information and have only one source. He was certain that people were out there using inconsistent sources that were causing efficiency problems, among other things.

Let’s start with the different types of weather data. There are forecasts and actuals. There is daily weather and hourly (and daily sources are peak for some uses, average for others). Finally, it’s worth noting that weather data is often corrected later, when the real-time value provided was measured incorrectly, or some other type of error occurred.

So let’s assume that we’re trying to consolidate hourly actual data. All applications should use this source. And let’s look at a couple of those uses:

  1. A bid for service is based on historical data, where the agent writing the bid used that weather data to evaluate the customers demand sensitivity to weather, and to evaluate the companies supply as trend of weather.
  2. A report on the effect on weather on supply is regularly supplied to operations managers.
  3. The accuracy of this data supplied by an outside vendor is to be regularly audited by Supply Chain.

Now, let’s assume those activities have taken place for the month of April, and it’s the middle of May. Now the vendor comes in with correction data for the middle of April. For the first purpose (the contract), I want to store what information was used to write the contract at the time. It’s the only fair way to evaluate the agent, as he wrote the bid based on the best available information.

Because supply and production is naturally affected by weather, for the second purpose (operations evaluation), I want to rerun those reports based on new, more accurate information.

Even more disruptive is the fact that in order to evaluate the variances in the accuracy of the vendor data, the company should be storing both values.

This leaves you at a decision point: Do you handle this by declaring these as different information, or version the information. In other words, the bid history is linked to uncorrected vendor data, and the updates are used to create a corrected data source that can be used for the operations purpose.

The alternative is that that corrections cause the creation a new set, but all sets are retained. Differentiation is handled with a version number or timestamp, and all the above problems are solved. While this sounds simple, versioned data grows quickly, and is difficult to query and understand.

Due to the timestamp, each record can now be referred to as the single source of weather data, for that location, occurance date and time (date of the weather), as provided on said date and time (import time). But for each location and time, there are multiple potential values as corrections are entered. And there is forecast vs actual data.

So to be precise, I still can’t say “give me Cleveland’s weather for March 7, 2011.” I would have to say “give me the actual weather value for Cleveland on March 7, 2011 that was available when I wrote a bid on April 5th.” Or in the case of an operations manager, they would request “the latest value of actual weather for Cleveland on March 7, 2011.”

Those are different pieces of data. But I don’t think that’s what our manager had in mind when he requested a single source of the truth for weather data. Because he meant weather information. Context / details / reality didn’t fit the mental model he had of weather data.

In the case of this project, we were able to slightly reduce the amount of weather data stored. And we certainly reduced the amount of batch jobs involved in fetching that data from external sources. But we also created a performance and reliability bottleneck. That may or may not have been the right decision. My point is that it is worth taking some time to think through and understand the terms you are using. Sometimes simple answers are great, but sometimes they are really just a sign of naivety.