Big Data Analytics is “the process of examining large data sets containing a variety of data types – i.e., Big Data – to uncover hidden patterns, unknown correlations, market trends, customer preferences, and other useful information.”
Big Data Analytics offers a nearly endless source of business and informational insight, that can lead to operational improvement and new opportunities for companies to provide unrealized revenue across almost every industry. From use cases like customer personalization, to risk mitigation, to fraud detection, to internal operations analysis, and all the other new use cases arising near daily, the Value hidden in company data has companies looking to create a cutting-edge analytics operation.
Discovering value within raw data poses many challenges for IT teams. Every company has different needs and different data assets. Business initiatives change quickly in an ever-accelerating marketplace and keeping up with new directives can require agility and scalability. On top of that, a successful Big Data Analytics operation requires enormous computing resources, technological infrastructure, and highly skilled personnel.
All of these challenges can cause many operations to fail before they deliver value. In the past, a lack of computing power and access to automation made a true production-scale analytics operation beyond the reach of most companies: Big Data was too expensive, with too much hassle, and no clear ROI. With the rise of cloud computing and new technologies in compute resource management, Big Data tools are more accessible than ever before.
Where did Big Data originate from?
Big Data emerged from the early-2000s data boom, driven forward by many of the early internet and technology companies. Software and hardware capabilities could, for the first time in history, keep up with the massive amounts of unstructured information produced by consumers. New technologies like search engines, mobile devices, and industrial machines provided as much data as companies could handle—and the scale continues to grow.
In a study conducted by IDC, the Market Intelligence firm estimated that the global production of data would grow 10x between 2015 and 2020.
With the astronomical growth in collectible data, it soon became evident that traditional data technologies such as data warehouses and relational databases were not well-suited to operate with the influx of unstructured data. The early Big Data innovation projects were open sourced under the Apache Software Foundation, with most significant contributions coming from the likes of Google, Yahoo, Facebook, IBM, academia, and others. Some of the most widely used engines are:
- Apache Hive/Hadoop (developed at Yahoo!, Google, and Facebook) is the workhorse for complex ETL and data preparation that services information to many analytics environments or data stores for further analysis.
- Apache Spark (developed at University of California, Berkeley) tends to be used with heavy compute jobs that are typically batch ETL and ML workloads but is also used in conjunction with technologies such as Apache Kafka.
- Presto (developed by Facebook) is a SQL engine that is lighting fast and reliable for reporting and ad-hoc analytics.
What is different with Big Data today?
As data grows exponentially, enterprises need to continuously scale their infrastructure to maximize the economic value of the data. In the early years of Big Data (roughly 2008), when Hadoop was first getting recognition by larger enterprises, it was extremely expensive and inefficient to stand up a useful production system. Using Big Data also meant that there needed to be the right people and software technology, as well as hardware to handle the data and velocity of queries coming in. Aligning everything to operate synchronously was an extremely daunting task and caused many Big Data projects to fail.
By 2013, the notion of the enterprise cloud for analytics was becoming popularized by Amazon Web Services (AWS) and a few numbers of other Silicon Valley companies (VMWare, Microsoft, and IBM) started emerging with their take of enterprise solutions for companies to take advantage of leveraging cloud computing. It wasn’t until AWS announced their earnings in 2015 of nearly $5 billion in revenue for the year, that the world truly started to take notice.
The cloud has shaped into a market-changer today as businesses, large and tiny, can have instantaneous access to infrastructure and advanced technologies with a few clicks.
- Volume– information is growing and data has an expiration date with value, having cheap cloud storage enables companies to take on massive amounts of data without worrying about what is and isn’t valuable.
- Variety – demand for analyzing on unstructured data is growing, which is driving the need different frameworks such as Deep Learning in order to process. Ephemeral cloud computing servers allow companies to test different big data engines against the same data iteratively.
- Velocity – complexity of analytics problems require several steps of big data (e.g. Machine Learning is estimated to be ~80% ETL in compute resources), which cloud computing companies can scale up/down according to demand.
- Value – demand for AI driven applications is pushing demand for modern big data architectures, which allow applications, storage and compute resources each to be scaled out individually.
Big Data Analytics vs. Business Intelligence
Business Intelligence is often times referred to as the first two descriptive and diagnostic stages of 4 steps to big data. BI is often hosted in a data warehouse where data is very structured in nature and only explains “what, where, and how” something happened (for example: 10 of the same shoes were purchased from 3 different stores that ran the same promotion, while the other 2 stores sold no shoes). This data is often used in reporting and gathering insights into popularity trends and interactions based on recent events.
Big Data Analytics takes this a step further, as the technology can access a variety of both structured and unstructured datasets (such as user behavior or images). Big data analytics tools can bring this data together with the historical information to determine what the probability of an event were to happen based on past experiences.
Why You Need Big Data in the Cloud Today?
The 4 V’s have been a well-known catalyst for the growth of Big Data analysis in last decade. Moreover, we have entered into a new era where new challenges are evolving like “variety” of open source technologies, Machine Learning use cases, and the rapid development across the big data ecosystem. These have added new challenges around how to keep up with the ever-growing information, while balancing how to ensure the effectiveness of advanced analytics in such a noisy environment.
Predictive and Prescriptive analytics is in a transient state and requires modern infrastructure that traditional data warehouses can’t service. Having a big data platform that enables team’s appropriate self-service access to unstructured data, enables companies to have more innovative data operations.
- Descriptive analytics (What Happened and When) – This is common in traditional Business Intelligence and reporting analytics.
- Diagnostic analytics (Where and How it Happened) – This takes Business Intelligence a step further, where the end user could be given a report or have a set of actions sent to them based on the results of the data.
- Predictive analytics (What Will Happen and How) – Where a model is applied to the data and a decision or probability score is given based on historical events. This data can also be fed back into Business Intelligence systems to help with future decision making.
- Prescriptive analytics (What Should We Do) – Takes the predicted output of the data and places it into a practical application that makes recommendations or alerts end-users (such as with fraud detection or ecommerce shopping). This data usually needs to be put into a data mart that can feed out to an application in near-real time.
Use Cases (Data Science, ETL, interactive analytics, BI)
Qubole is useful at any scale because the technology operates on the notion of separating storage from compute, and furthermore having managed autoscaling for Apache Hadoop, Apache Spark, Presto, and TensorFlow. The software automates cloud infrastructure provisioning, which saves a ton of time from data teams getting bogged down in administrative tasks such as cluster configuration and workload monitoring.
- ETL – build and schedule pipelines for recurring data transformations
- Data Science and Machine Learning – explore, develop and test models at scale before putting them into production
- Interactive Analytics – enable data teams and less technical users to analyze less structured or raw data that otherwise can’t fit in a data warehouse
- Data Visualization – format and present analytics for business insight and intuitive dashboards