ADLS Gen2 combines Gen1 storage accounts with Blob Storage. The really hot feature is hierarchical namespaces. When the feature is enabled, the data store can be treated like a file system. If you’ve used Blob Storage prior to Gen2, you’ll be familiar with this.
There are a lot of integration options between other Azure products and Gen2 storage. I hope the title didn’t give it away, though, because I’m going to focus on integrating with Azure Databricks.
Why use Gen2?
Azure Databricks can integrate with a handful of data stores. Some are niche, like the Event Hubs connector. Others, like Azure SQL Database and Cosmos DB are more general purpose. So why should you choose ADLS Gen2 storage over anything else?
Cost
First, it’s cheap. Redundancy is the primary driver behind spend, but
Usability
Second, it behaves like a file system. This is particularly important when you want to work with Databricks File System. I’ll get into some juicy stuff on this topic later. Basically, you can treat your Gen2 storage as a drive mounted to your Databricks workspace. You’ll never want to toy with other connectors once you can say spark.read.json('mnt/source/*.json', multiline=True)
.
Drawbacks
Obviously there are drawbacks. For me, the problem is invariably about configuration. I can spin up Databricks workspaces and all the storage accounts I want but, when I need them to talk to communicate, I always have to relearn how to do it.
The official documentation, true to brand, is comprehensive. It will get you where you need to go but the article is dense. I’d rather not read a long how-to article when all I need is a reminder of that one sticking point and how to get around it.
To wit
Databricks is great. Treating Azure storage accounts like file systems is great. Azure Databricks documentation is great…until it isn’t.
I’ll be back with at least one article about what I keep getting stuck on. Spoiler alert: service principals are awful.