Big Data Definition
What is Data?
Data is a set of characters used to collect, store and transmit information for a specific purpose. Data can be in any form, i.e., text, image, audio, etc. Data comes from the Latin word 'Datum', which means 'something given'. When the data is processed, it is termed as 'Information'.
What is Big Data?
Big Data refers to a collection of a very large and complicated set of data for which it becomes difficult to process using traditional or manual database management tools. The size of data grows exponentially and is generally in terabytes or more.
For example- over 500 million Tweets are generated on Twitter daily; Netflix has over 220 million paid memberships globally; there are over 2 billion daily users of Facebook. These statistics are quite large in numbers and increasing exponentially every year and thus can be classified as Big Data.
How Do We Classify Any Data as Big Data?
To classify any data set as Big Data, 3V's of Big Data were introduced in 2001, which later got updated to 5V's. These 5V's are:
- Volume: Volume refers to the 'size'or amount of data. For instance, YouTube has over 2.6 billion monthly active users and generates a large amount of data daily, which can't be processed manually; thus, modern techniques and tools are used to handle such voluminous data.
- Velocity: Velocity refers to the 'speed'or rate with which the data is accumulated. In 2010, YouTube had 200 million monthly active users, which increased to 2.6 billion in 2022.
- Variety: Variety refers to the 'heterogeneity' or diversity of data. The data can be structured, unstructured, or semi-structured.
- Veracity: Veracity refers to the 'trustworthiness'or quality of data. It means whether the data is free from various ambiguities or not.
- Value: Value refers to the 'Insights' gained from the data. It means whether the given data set is producing any useful result. Data, in its raw form, gives no valuable result, but once processed efficiently, it can give us important insights that could help us in decision-making.
Types of Big Data
There are three types of Big Data: Structured, Semi-structured and Unstructured data.
- Structured Data: Any data in a fixed format is known as structured data. It can only be accessed, stored, or processed in a particular format. This type of data is stored in the form of tables with rows and columns. Any Excel file or SQL file is an example of structured data.
- Unstructured Data: Unstructured data do not have a fixed format. These are stored in an unknown format. Such type of data is known as unstructured data. An example of unstructured data is a web page with text, images, videos, etc.
- Semi-structured Data: Semi-structured data is the combination of structured as well as unstructured forms of data. It does not contain any table to show relations; it contains tags or other markers to show hierarchy. JSON files, XML files, and CSV files (Comma-separated files) are semi-structured data examples. The e-mails we send or receive are also an example of semi-structured data.
Use Cases of Big Data
- Social Media and Entertainment: You must have witnessed streaming service apps such as Netflix recommending shows and movies based on your previous searches and what you have watched. It is done using the concept of Big Data. Netflix and other streaming service apps create a custom user profile, where they store the data of users, including their search history, their history, which genre they watch the most, at what time of day they prefer to watch the most, their streaming time per day, etc. analyze it and accordingly gives recommendations. It helps in a better streaming experience for the users.
- Shopping: Websites like Amazon, Flipkart, etc., also use Big Data to recommend products based on your previous purchases, search history, and interests. It is done to maximize their profits and provide a better shopping experience to their customers.
- Education: Big Data helps in analyzing and monitoring the behavior and activities of students, like the time they need to answer a question, the number of questions skipped, and the difficulty level of the questions that are skipped, and thus helps students to analyze their overall preparation, weak topics, strong topics, etc.
- Healthcare: Healthcare sectors use Big Data to track and analyze the health and fitness of the patients, the number of visits, the number of skipped appointments a patient, etc. Mass outbreaks of diseases can be predicted by analyzing the data and using algorithms.
- Transportation: Traffic control by collecting and analyzing the data from several sensors and cameras installed on roads and highways. Accident-prone areas can be detected with the help of Big Data analysis; thus, required measures can be taken to avoid accidents.
Evolution of Big Data
- The earliest record to track and analyze data was not decades back but thousands of years back when accounting was first introduced in Mesopotamia.
- In the 20th century, IBM developed the first large-scale data project, punch carding systems, which tracked the information of millions of Americans.
- With the emergence of the World Wide Web and supercomputers in the 1990s, the creation of data on a large scale started to grow at an exponential rate. It was in the early 1990s when the term 'Big Data' was first used.
- The two main challenges regarding 'Big Data' were storing and processing such a huge volume of data.
- In 2005, Yahoo created the open-source framework Hadoop, which stores and processes large data sets.
- The storage solution in Hadoop was named HDFS (Hadoop Distributed File System), and the processing solution was named MapReduce.
- Later, Hadoop was handed over to an open-source and non-profitable corporation: Apache Software Foundation.
- In 2008, Cloudera became the first company to provide commercial Hadoop distribution.
- In 2013, the Creators of Apache Spark founded a company, Databricks, which offers a platform for Big Data and Machine Learning solutions.
- Over the past few years, top Cloud providers such as Microsoft, Google, and Amazon also started to provide Big Data solutions. These Cloud providers made it much easier for users and companies to work on Big Data.
Did You Know?
In 2009, the Indian government stored fingerprints and iris scans of all its citizens in the largest database ever created.
A Brief Introduction to Hadoop
Founded by Doug Cutting and Mike Cafarella in 2005, Hadoop is an open-source framework that efficiently stores and processes Big Data. Hadoop is a Java-based framework. Apache Software Foundation manages Hadoop. The main components of Hadoop are HDFS (Hadoop Distributed File System) & MapReduce. Being an open-source platform, Hadoop is cost-efficient. Its speed and capacity to store large volumes of data make it popular among many top-tier companies. Companies such as Facebook, Twitter, LinkedIn, etc., use Hadoop to handle Big Data.
Importance of Big Data
- A better understanding of market conditions.
- Time and cost saving.
- Solving advertisers' problems.
- Offering better market insights.
- Boosting customer acquisition and retention.
Applications of Big Data
Big Data finds applications in various sectors, such as-
- Banking and Security
- Social Media and Entertainment
- E-commerce websites
Big Data Analytics
Big Data Analytics uses modern tools and techniques to extract valuable insights, trends, hidden patterns, and relations with the help of large sets of data, which can be structured, semi-structured, or unstructured. It helps in better decision-making and optimizes business operations.
Let's consider the example of YouTube, which has over 2.6 billion monthly active users. It generates a huge amount of data every day. With the help of this data, it recommends videos based on what you have watched previously, your likes, shares, etc. What enables this is the tools and frameworks resulting from Big Data Analytics.
Types of Big Data Analytics
- Descriptive Analytics: This type of analytics summarizes or extracts insights based on the incoming We came up with a description based on the data. For example, insights drawn for your YouTube channel are based on the data such as likes, shares, and views on your videos.
- Predictive Analytics: This type of analytics predicts what might happen. Questions such as 'how' and 'why' reveal particular patterns that help predict future trends. Machine Learning concepts are also used for such types of analysis. For example, prediction of weather, prediction of malfunctioning in the parts of an airplane, etc.
- Prescriptive Analytics: These types of analytics are based on rules and recommendations and thus prescribe an analytical path. The analysis is generally based on the question, 'what actions should be taken?' Google's self-driving car is an example of prescriptive analysis.
- Diagnostic Analytics: These analytics look into past trends and diagnose questions such as how and why something happened. It is also called behavioral analytics. This analysis aims to answer the question, 'why did this happen?' For example, if the sales report of a company shows a rise in sales, then the company can analyze the internal and external causes responsible for the increase.