Dummies guide to Big data and it’s impact on today’s world
My title would have already given you an idea about the content about this Blog. So i am starting this blog with a question .
How many of you guys reading this blog currently use Google or say Instagram or Facebook ?
Mostly all of us .Right.
But have any of you guys ever wondered the searches we make on google ,the pictures we post on Instagram ,the posts we like or the status we upload on Facebook ,where it actually goes ?
After all, it’s all a type of data and it must be stored in order for any one of us to view it in future .
And In the last two years alone over 90% of the world’s data was created, and with that, it is very clear that the future is filled with more data, which can also mean more data problems.
But now you would say, yeah there will be huge stacks of data but they would also be making huge amounts of profits out of it too .Right?
Well yes, you are absolutely right but also you are missing a point these companies are doing business here .That means they like more money but the constantly increasing data requires more requirements which takes more money out of their pockets .
So these problems which the companies face due to Huge stack of data are known as Big Data problems .
Types of Big data Problems
Well basically the Primary Big data Problems includes:
Solution: Well the solution to these problems should also be in a cost effective way .Cause like i said,it’s all business.
So for tackling the issues of 4vs and for structured data are now most cheaply delivered through big data technologies like Hadoop clusters which works on the concept of Distributed data storage(
Distributed Computing together with management and parallel processing principles allow us to acquire and analyze intelligence from Big Data making Big Data Analytics a reality).
Why Hadoop?
Well i say why not. Platform Infrastructure: The big data “platform” is typically the collection of functions that comprise high-performance processing of big data. The platform includes capabilities to integrate, manage, and apply sophisticated computational processing to the data. And so for that reason big data platforms include a Hadoop (or similar open source project) foundation. Hadoop was designed and built to optimize complex manipulation of large amounts of data while vastly exceeding the price/performance of traditional databases. As you need more storage or computing capacity, all you need to do is add more nodes to the cluster. Hadoop is a unified storage and processing environment that is highly scalable to large and complex data volumes.
Example of Hadoop being cost effective:
One company’s cost comparison, for example, estimated that the cost of storing one terabyte for a year was $37,000 for a traditional relational database, $5,000 for a database appliance, and only $2,000 for a Hadoop cluster.
Other Big Data problems Includes:
1.Low Quality and Inaccurate Data
If you are receiving large chunks of data then it comes up with no surprise that the data you may be dealing with could be of low quality.
But what exactly is a low quality data you may ask?
Well anything from inconsistent formatted data to maybe duplicate or inaccurate or missing data can be termed or could be taken into the category of Low Quality Data.
Solution :Well anything from cleansing the data to managing it regularly before analysis can solve this problem.
2.Paying loads of Money
Even if one go for on ground premises or opt for a Cloud solution .There’s a huge amount of requirement for any of the methods which can take huge amount of money out of your wallet .Moreover, in both cases, you’ll need to allow for future expansions to avoid big data growth getting out of hand and costing you a fortune.
Solution: The particular technology for a specific company would depend on its requirements .
For Instance: Companies who want flexibility would get benefit from cloud. While companies with extremely harsh security requirements go on-premises.
There is also another solution to it :Hybrid Solution in which parts of data are stored and processed in cloud and parts — on-premises, which can also be cost-effective. And resorting to data lakes or algorithm optimizations (if done properly) can also save money.
All in all it is to analyze your needs and then choosing a corresponding option for solving the challenge.
3.Lack of skills
Also one of the highly faced issues is the lack of skills found in employees of the company regarding Big Data technologies .
Solution : Hold workshops ,make them aware and encourage them to self learn about the technology.
4.Issues with Upscaling
In this constant growth in data size ,well one simply cant rely on its old framework or architecture. You could have sort out the new processing and storage requirements .But you also have to keep in mind with this growing complexity of the data ,your systems performance is not declining and also is kept in budget too.
Solution: You can take care of this problem by keeping in mind the future upscale while designing your big data solution architecture and its algorithms. But besides that, you also need to plan for your system’s maintenance and support so that any changes related to data growth are properly attended to. And on top of that, holding systematic performance
audits can help identify weak spots and timely address them.
Case Study: Tricky process of converting big data into valuable insights
Here’s an example: your super-cool big data analytics looks at what items in combination people buy (say, a needle and thread) solely based on your historical data about customer behavior. Meanwhile, on Instagram, a certain soccer player posts his new look, and the two characteristic things he’s wearing are white Nike sneakers and a beige cap. He looks good in them, and people who see that want to look this way too. Thus, they try to recreate the same look. But in your store, you have only the sneakers. As a result, you lose revenue and maybe some loyal customers.
Solution:
The reason that you failed to have the needed items in stock is that your big data tool doesn’t analyze data from social networks or competitor’s web stores. While your rival’s big data among other things does note trends in social media in near-real time. And their shop has both items and even offers a 15% discount if you buy both.
The idea here is that you need to create a proper system of factors and data sources, whose analysis will bring the needed insights, and ensure that nothing falls out of scope. Such a system should often include external sources, even if it may be difficult to obtain and analyze external data.
Case Study :Facebook Hadoop Architecture
Goals of HDFS
1.Very Large Distributed File System
– 10K nodes, 100 million files, 10–100 PB
2.Assumes Commodity Hardware
– Files are replicated to handle hardware failure
– Detect failures and recovers from them
3.Optimized for Batch Processing
– Data locations exposed so that computations can move to where data resides
– Provides very high aggregate bandwidth
4.User Space, runs on heterogeneous OS
Lots of data is generated on Facebook
– 300+ million active users
– 30 million users update their statuses at least once each day
– More than 1 billion photos uploaded each month
– More than 10 million videos uploaded each month –
More than 1 billion pieces of content (web links, news stories, blog posts, notes, photos, etc.) shared each week
Statistics per day:
– 4 TB of compressed new data added per day
– 135TB of compressed data scanned per day
– 7500+ Hive jobs on production cluster per day
– 80K compute hours per day
With these numbers one can estimate what amount and variety of data Facebook deals with .Even so Hadoop architecture is helping them manage their data at a cost effective way.
Wrap Up
You would have noticed, most of the reviewed challenges can be foreseen and dealt with, if your big data solution has a decent, well-organized and thought-through architecture.
In today’s World, there are more mobile devices than people on the planet. Harnessing data from a range of new technology sources gives companies a richer understanding of consumer behaviors and preferences — irrespective of whether those consumers are existing or future customers. Big data technologies not only scale to larger data volumes more cost effectively, they support a range of new data and device types. The flexibility of these technologies is only as limited as the organization’s vision.
So would there be a simpler way / technology which would solve all these problems in a much simpler way ?Think….
–Arjun Nigam