Data Lakes: Beyond the Buzz Words

The biggest challenge with understanding big data is not the technology itself. It’s simply getting past the jargon.

Jason Byrne
FloSports Engineering

--

I walked into the AWS Summit in Atlanta and forced myself to sit through sessions about data warehouses and ETL. It was a topic that I had never had the need to delve into, but I was feeling increasingly compelled to cure my ignorance of it. It is such an important and growing aspect in engineering!

After attending session number one, then I walked the vendor floor. I boldly jumped into conversations with big data sales engineers. Within seconds I was out of my element, and I’m sure I looked like an idiot. Several more sessions and many hours of research followed until I finally felt self-assured in the concepts!

But the thing is… they are actually nothing any senior level engineer can’t quickly acclimate to. Largely the same concepts you need to know about building and scaling applications carry over. It’s just these darn buzz words that make us feel completely dumb around these data engineers!

So let’s demystify some of the big ones…

Big Data — Most large web applications might have data in the tens of gigabyte kind of range. Big data is going to be at least in the hundreds of gigabytes scale if not TB or even PB. It is often coming from lots of sources and may not all be consistent or structured, so it can be hard to work with.

Data Lake — A big repository where you just dump a bunch of raw or massaged data. It is not a true database that you would typically query, it’s more like a file store. Often this will be an S3 bucket. That’s it!

Data Warehouse — This refers to large amounts of data that is structured and in a state that is typically ready to be queried.

ETL — Extract, transform, load. Pull the data from somewhere, do something with to normalized and in a useable format, and then loading the data into a database or data warehouse you can query… or sometimes just another data lake. Pretty simple.

Enrichment — Take your raw data and make it more valuable. It could mean something as simple as the raw data has user ids and you get the user email address from another system and push it into your data set so it’s available to query against later.

BI — Stands for business intelligence. This represents taking all of that raw data, turning it into something useful, and then ultimately actionable. Make it so that it is understandable to management and tells a story that they can use to make decisions. Often BI will refer to the analytics/presentation system that gives them a user interface to get pretty graphs and charts.

Schema — A set of rules and structure that is placed on data so that it can be queried and accessed in a consistent way.

Schema-on-write — The data either comes already structured into the prescribed format or it gets transformed and then is written into the data warehouse with that schema already applied, ready to be queried.

Schema-on-read — The data is stored in a data lake in a certain predictable format, but without any queryable data structure (maybe CSV or JSON). An application layer will pull out a subset of that semi-structured data, transform it on-the-fly into a more structured format that can be queried.

MPP-Stands for massively parallel processing. Basically taking a data processing job and splitting it into a bunch of asynchronous mini-jobs in order to get through a huge amount of data quickly.

MapReduce — Data most often exists in many jobs in an unstructured or semi-structured format. In order to get meaningful data from it, you need to first give go through it all to process it into something usable and then apply some formula to it. So let’s say we have server logs, one file per day, and wanted to find out how many 500 errors we had on average. A MapReduce product would create a bunch of map jobs that would parse the server logs asynchronously and compile the count of each http status code into a key-value pair. And then the reducer would take that compiled data and get you the average number of 500s per day like you wanted.

Hadoop — A well-known open source product that at its core is a MapReduce framework, but with other features like storage and other tools built in.

AWS RedShift — A database product by Amazon based on Postgres. It differs from a standard database in its ability to scale out a query horizontally to let it query large datasets quickly.

AWS Kinesis — This is an Amazon product that allows you to take in a large amount of data from an input source and set up a work flow for it. It has various settings that can be configured through the console of where the data should come from, in what size batches it should be processed in, pass it off to a Lambda for custom transformation and then either query it realtime from there and/or push it off into other storage like an S3 bucket, RedShift or even something like Elastic Search.

AWS Glue — A managed ETL product by Amazon. It can act in different ways, but basically it has crawlers to extract the data from the source, then can transform it on the fly and then catalog it for storage and querying from other products like Athena. It can also push that transformed data into another permanent data store. Basically its purpose is to allow you to set up a complicate data work flow in an automated or wizard-like way that would otherwise take a lot of custom programming of many separate pieces.

AWS EMR — Stands for Elastic MapReduce. So this is just a managed map reducer product by Amazon.

AWS Athena — A schema-on-read product from Amazon. So basically what you do is you define a certain data schema, but it just sits there as a generic definition not applied to the semi-structured data which likely is in S3 (maybe in CSV, JSON, or Parquet format). Once you tell it the source bucket and prefix of that data source, you run a SQL query based on your schema. Athena will grab the appropriate data, filter it, and apply your query as if it were already sitting there in a SQL database.

Firehose — A source that pushes a lot of data at you at one time. Drinking from the firehose!

Streams — A data source that maintains a connection and sends a continual (and often predictable) series of data.

Time Series Database — A database that is optimized to store data indexed by time and able to query it quickly based on a time range. Often the data stored in it is immutable, meaning it can not be later altered or updated.

Parquet — An open source data storage format. It separates data into columns. In a data lake scenario, the raw data might be transformed into Parquet and then you’d see one file per column of data. This allows the data to be queried more efficiently when searched on the value in a single column because it can easily ignore (for filtering purposes) the other unneeded data columns.

This article is by no means complete or textbook. Its purpose is just to give a quick understanding so, as a software engineer, you can walk into these big data conversations with your company’s data engineers, sporting a little confidence. That way you avoid that stupefying and perhaps embarrassing feeling of buzzword overload that I had in Atlanta!

--

--