What is Big Data?
Posted On: Nov 24, 2014
Big Data refers to data and information which is much larger in magnitude as compared to traditional kilobytes, megabytes, gigabytes or even terabytes. Big Data is all about petabytes (1000 terabytes), exabytes (1000 petabytes), zettabytes(1000 exabytes), yottabyte (1000 zettabytes) and so on.
In the data and information age, with the invent of powerful data storage and analysis mechanisms, businesses have greatly profited and are continuing to make valuable inferences with the help of archived data, thus the need of Big Data. Every byte of data is important and advancement in data processing engines has given way to Big Data. It’s not just about the magnitude of data, big data is about four dimensions, called the 4 V’s – Volume, Velocity, Variety, and Veracity.
Big Data is always large in volume, some petabytes to yottabytes in size. The problem is simple, although the storage capacities of hard drives has increased significantly over the years, but the access speeds, i.e. the rate at which data can be read from drives has not increased proportionately. The obvious way to reduce the time is to read from multiple disks at once. In order to store and retrieve large amount of data in less amount of time (that is increase the velocity of data fetching) a hybrid model is needed. For this purpose big data is stored in chunks, and processors work in parallel so that all the chunks of data can be fetched in less amount of time. Big Data processing techniques also include tools that can run and handle a wide variety of data, ranging from structured (tabular format, comma separated text etc), unstructured, and semi-structured data (audio\video stream). And the last dimension to big data is Veracity, which means a big data system must be smart enough to segregate useful data and junk, so that a decision can be made about which data must be kept and the rest discarded.
What may concern us in first place is hardware failure because as soon as we employ multiple segments of hardware, the probability that one may fail is high. A typical way of avoiding data loss is by replication, redundant copies of the data are kept in the system so that in case of failure, there is another copy available. Another concern is that most data analysis procedures need to be able to combine the data in some way, and data read from one of the hardware segments may need to be combined with the data from any of the other hardware. Various distributed systems allow data to be combined from multiple sources, but doing this appropriately is a bit difficult.
There are many Big Data programming models available today that have all the big data dimensions and can be utilized to solve above stated concerns.