Data Streams and Big Data Feeds

As data science evolves, and Open Data initiatives, notably in the US and the UK, are gaining enough traction to feature on the headlines, we need to take a step back and rethink our paradigm. The "download, clean, analyse, report" paradigm is too minor a variation on old practices. We have non-linearised and high-dimensionalised our modelling tools, we have built huge industries around automated data preprocessing, we have switched from dull reports to dynamic infographics on iPads, and we can certainly handle more data than ever before. But we still largely think of data analysis as a procedure with a beginning (the data) and an end (the report).

And yet we know we can do better. We don't really need to look at the web giants such as Facebook, Twitter and Google to appreciate the massive added value of constantly revised, up-to-date insights, versus that of occasional analyses. Far more traditional industries have been playing the "on-the-fly" game for a long time - adaptive monitoring and control of complex systems in real-time is the norm in signal processing and industrial process control, and traders have been employing "sliding windows" to view data in the-most-recent-chunk format long before machine learning and Hadoop came into being. Indeed some basic ideas borrowed from such fields can take you a surprisingly long way in this field - but, naturally, it only gets interesting after the low-hanging fruit have been plucked.

I am enthusiastic about this field, and it's what my research is all about. But it presupposes constant access to up-to-date data - and that's a surprisingly rare animal. I don't need larger datasets, but data feeds, instead - much like upgrading from a puddle to a swimming pool doesn't do you much good if what you need is a river source. In high frequencies, data feeds are often referred to as data streams - but high-frequency is less crucial than continual updates (even weekly updates can prove challenging if the data is big enough). To my mind, the criterion that differentiates a data feed from a dataset is operational: can I build an analytics tool around it that will operate continually without my having to rebuild the thing after each data update? A naive representation of a data feed could be a file where each line is one observation, and one line is added every day, in exactly the same format, continually. But naturally real datasets do not look like this. Formats inevitable change, variables are revised, new information is added and some information is aggregated. And yet I am sure that such modifications could pose altogether much less hassle to the data scientist, if implemented with care and consideration.

So the Continual Data Blog is about the Data Feed revolution: the type of data sources that we can rely on to build "always-on" analytics on top of, and enable computational intelligence to permeate our environment in the same seamless way that weather forecasts, news headlines and twitter feeds have. For those of you that care to join me, our objectives will be to: a) identify existing data feeds (as opposed to one-off open data releases), b) brainstorm on how existing practices in this field can be improved.

1 comment:

George Dementis11 March 2012 at 09:14
I came across the Open Data Protocol (http://www.odata.org/). It helps filtering data returned from a data source. Microsoft supports it in order to simplify Web Api's (facebook API for instance is a Web API) )http://weblogs.asp.net/scottgu/archive/2012/02/23/asp-net-web-api-part-1.aspx) and i would think such API's could be used to implement data streamish services. Not sure if the protocol is supports the type of functionality you have in mind. Have a look an let us know if it's in the right direction.

Friday, 24 February 2012

1 comment: