Twitter data stream redux

A couple of weeks ago I posted my slides from Social Media Jungle Boston on Twitter as a Universal Data Stream. Jared Rosoff responded with a series of thoughtful questions and let me know it was fine for me to post and answer his questions here.

Jared's questions are in this red color, and my responses are in purple, flagged RMD.

1) Centralized vs. Distributed event stream architecture

-- Unclear twitter can scale to handle the flood of data that would come with "automated publishing" from things like sensors and systems. Not sure if this is a systemic problem of this kind of architecture, or rather a twitter specific deficiency.

RMD: clearly Twitter can't scale to handle its own organic growth right now. I think this is a Twitter specific deficiency. However, putting "all the world's messages" (automated and not) onto Twitter would increase the load by several orders of magnitude. This may take the load to somewhere with real systemic problems. Hopefully this is not the case ... see Sigma portfolio company Tervela... perhaps they can help.

-- Do we really as a community want to endow any single company with all of the value that that data holds?

RMD: Great question. Google already has "all" the public data (effectively), but this goes to a new level when I suggest proprietary streams be published using Twitter as well. Answer, probably not, at least not without assurance that Twitter does not have ownership rights which they can assert. However, charging for access to such a rich *and integrated* data stream seems OK to me. If the data provider wants to charge for access, perhaps Twitter can be a pass through biller (with appropriate markup). Perhaps the source data provider would also provide other means of access to assure a competitive marketplace for the data. Then Twitter's specific value becomes the fact "all" the data is available in one place.

-- Corollary: Are companies that have valuable data going to be willing to hand over the keys to someone else?

RMD: probably not! However, if the value to their customers of integrating their own data with a wealth of other data is sufficient, then perhaps.

-- With twitter's focus on social updates, do you think a different player will emerge that will focus on more "machine readable" event streams?
RMD: very possibly ... want to pitch a startup?

2) Search / Filter methodology

-- Search on twitter is pretty limited. To do the kind of analytical reasoning you suggest in your prezo, it strikes me that you need a more sophisticated query language. Something beyond keyword search. Need to be able to deal with structured and semi-structured data as well as calculate statistics over the set of data in the streams. Do you have thoughts on what kind of search technology is needed for a system like twitter?
RMD: You are right. The 140 character limit is also terrible for this... back to "want to pitch a startup?" The character limit is probably what does this in ... a longer (or unlimited) message can be tagged (either with folksonomy based hash tags, or with references to published ontologies). Once properly tagged, the data becomes searchable, and we are back to "just" a scale problem.

-- Is analytical reasoning separate from the event stream aggregator? Or does it need to be part of it? In other words, can I plug a tool like Visual Sciences or SAS into the data I get out of the event-stream or am I limited to the analytical tools that twitter (or a twitter like business) provides me? Benefit of having the tools at twitter are that it can access the "whole data set" whereas if I'm using SAS or VS, I'm probably working on some subset of data that I've downloaded...
RMD: I definitely want to be able to use third-party tools to do the search/filter/analysis. My original assertion was based on Twitter's very simple, easy and *open* API, which (assuming volume access is allowed) enables you to use whatever tools you want. Your notion of tools built in to Twitter makes sense (another revenue idea for them). If this all ever really happened I imagine you would be able to subscribe to sub-streams from Twitter for real time analysis, as well as going back later for data exploration.

All great questions - thanks Jared!

No comments: