Structured data vs. unstructured data: The two success pillars of big data analysis

There’s so much noise and misunderstanding in the tech world now that it can be hard to sort through it all.

Most of the time, this kind of information becomes self-correcting, by simply changing through the cycle: if a service or innovation doesn’t work well, it falls out of fashion, while others replace it.

But then you get events like the current data dump of Facebook users. One thing that seems clear is that we are drowning in data, and we need data scientists to manage it. And they face enormous problems.

Consider, for example, the statistics surrounding developing economies.

Big data enables both good news and bad news. The bad news is that, not so long ago, forecasting earthquakes meant flicking through seismograms.

Now, thanks to their ability to observe through the terabytes of data that drive global communications, these natural disasters can be classified into asynchronous and synchronous seismic earthquakes and their associated flows, and so are forecasted to occur simultaneously.

Demographic and economic issues

In India, geo-demographic and economic issues combine to create a great deal of noise. Pakistan’s army helped spread the government’s message in 2008 by setting up a YouTube channel, which received millions of views.

India now has hundreds of dozens of similar channels, which have been hugely influential in influencing public opinion in India.

But the problem here is that the boundary between unstructured and structured data is a big one. Much of the information being monitored for that purpose can be fully detailed but useless to the data scientist. For example, the Irish government’s data on gun control was so detailed that I believed I had access to a fictitious gunman called “Phyllis Ellen.”

In a fully structured form, which means that the data is found in “traditional” sources such as official government archives and census records, Phyllis could have been identified. But unstructured data may be geocoded and digitally inserted into different sources. How does a data scientist know, for example, that one holocaust survivor living in a highly urbanised city in China’s city of Xi’an has the same ethnic group as his or her later-arrived grandchildren in an equally densely populated village in the western province of Hunan?

There are lots of interesting places to look for the answer to this question, but it shows how tricky the discipline of big data analytics is.

Given a huge number of challenges that big data poses, there is one thing in particular that seems very promising: models for analysing unstructured data, which can be compiled into a coherent, predictable and effective set of model outputs.

General purpose models

There are several general purpose models that can be used to use data from multiple sources and reconcile them. This is a big advantage in the era of mega data because a single dataset can be used to extract a wide range of conclusions and conclusions from data that would be unusable on its own.

But these general purpose models are controlled by hardware and software tools that we now have at our disposal, and they are usually also locked up.

One potential solution is that data science should be spun off from this specialized pool of expertise and connected with the larger superstructure of analytic tools.

We could create a school of data science that could be funded through a foundation, much the way that there is a school of biomedical science. If there were such a school, it would be funded because the data science community sees this as a vital adjunct to innovation.

If we do this, a similar school of data science would immediately be created and would seamlessly integrate with other sets of tools used by other faculties. And those tools would now be liberated from command-and-control servers and closed off from developers.

The planned implications of this change are significant: more sophisticated tools would be freed, and new innovations would spring up because new problems could be foreseen.

There is, however, a huge catch. The situation here would be similar to the problem that lots of sports science students have when they are out on the field or the courts: can the experts identify the full extent of the problem?

Related posts

Leave a Comment