Hacking Hive Metadata for Presto and Hadoop, 7Park Data’s Head Engineer Talks Shop
Following the recent data architecture redesign at 7Park Data, Mike MacMillan, the Head of Engineering, shares the latest innovations in scalability.
Watching huge volumes of complex data come to life within seconds is magical. Transforming that data into actionable insight is empowering. While our engineering team at 7Park Data makes it look effortless, we know there’s a lot of thought, planning and execution that goes into processing massive amounts of data that our clients receive as accurate, timely and ready-to-use insight. Following the recent data architecture redesign, Mike MacMillan, the Head of Engineering at 7Park Data, shares the latest innovations in scalability.
What are the recent changes in data architecture at 7Park Data?
The volume of data powering our data intelligence products has increased over 1500% in the last three months alone, so we needed every component of our architecture to be horizontally scalable to keep up. We switched to Hadoop, Spark, and Presto directly deployed on Amazon Elastic Compute Cloud (Amazon EC2), which immediately allowed us to unlock massive amounts of our data previously buried behind slow processing times or technology lock-in. Our primary goal with that migration was to make data accessible to many different technologies and employees, and, ultimately, deliver products and insights to our clients faster than ever before.
Our team spent several weeks evaluating massive in-memory SQL analysis technologies. We tried Cloudera Impala and Spark. Then came our test of Presto, an in-memory querying engine that connects to many different data storage sources (called “catalogs”) — it connects to Hive, MySQL, PostgreSQL, and others. Facebook built it to solve their need to query petabytes and open-sourced it for others. Presto enables us to query any data wherever it lives, instead of having to put it into yet another system. It’s lightning fast. Presto runs as a separate cluster outside Hadoop, but is able to reach inside to get answers faster than Hive or many other solutions.
While we decided to build our own primary Hadoop cluster, we have gained immensely from having Amazon Elastic MapReduce (EMR) Platform at our disposal to augment our workloads. In an upcoming blog post we’ll talk about our design decisions and the pros and cons of different options.
What were we doing previously?
Similar to other fast-moving teams, we built much of our processing with some combination of Cron, Python, and AWS’s PaaS products (Redshift, PostgreSQL). They were great for enabling us to serve client needs quickly without a steep learning curve. Being able to deploy a 40-node Amazon Redshift cluster in 15 minutes is a big advantage to any size organization, and the speed of data querying and processing on Redshift made it an attractive choice. Though these products allowed fast growth with minimal complexity, they only gave us one tool (SQL) to conduct analysis and couldn’t scale for our use case.
Conducting data processing on Redshift came with a steep performance penalty for users also trying to perform analysis. We needed a different solution that gave us more options out of the gate — the Hadoop community has many, so that’s where we began.
What can we do now that we couldn’t do before?
We have trillions of rows of mobile user behavior data and insights, but our data tools allowed us to explore only 5-10% of them reliably. Presto (on Hive, Redshift, and MySQL) gives us the ability to avoid unnecessary ETL’s to get copies of data into new systems for analysis. Having less steps in our pipeline means faster turnaround to our clients and ability to make decisions based on what the data actually indicates.
How is 7Park Data different from other research companies?
Now that all our tools can interact with the data in Hadoop, our product capabilities extend beyond descriptive analytics (what happened in the past) to predictive analytics (what we expect to happen in the future) on massive amounts of data at scale. Our technology and architecture allow us to continuously innovate and maximize analytics capabilities in ways no other firm can. And, unlike other research firms, we also empower our clients with the tools and know-how to conduct their own analysis of data.
What does future hold for 7Park Data now that we have rolled out Presto platform?
In 2016, we’re operating at a breakneck pace, leveraging technology developed and battle-hardened by data engineering teams at Facebook, Netflix, AirBnb, and Spotify. The most important thing to us is to have ways to consistently and reliably look at data and deliver meaningful insights from it as fast as possible. Our stack enables our machine learning and data scientists to explore data without the help of engineers. What’s even more exciting is that we’ll be bringing the same tools and techniques to our clients, who will be able to better understand large amounts of data and extract invaluable insight on their own.
Want More Like This?
The best technology makes big problems trivial and enables users to make sophisticated decisions using simple inputs. That’s our mission for the technology powering 7Park Data. Our clients can expect to see even more power unleashed in our API, Dashboard, and other mediums in the coming weeks and months.