ADLS Gen2 and Azure Databricks – Part 4 – Mounting to DBFS

Now you know why I use Gen2 with Databricks, my struggle with service principals, and how I configure the connection between the two. I’m finally going to mount the storage account to the Databricks file system (DBFS) and show a couple of things I do once the mount is available.

Databricks Utilities

Save a few keystrokes and access the file system utilities directly using dbfs. The method names will be familiar to those who work in the command line but fear not; it’s all fairly self-describing.

Mounting Gen2 storage

For my own sanity, I always unmount from a location before trying to mount to it.

def mount(source_url, configs, mount_point):
  try:
    dbfs.unmount(mount_point)
  except:
    print("The mount point didn't have anything mounted to it.")
  finally:
    dbfs.mount(source=source_url,
               mount_point=mount_point,
               extra_configs=configs)

I went over creating the source URL and the configs map in the previous installment. The mount point is a *nix style path specification, the root is conventionally /mnt. Tweak on this method to manage your own exception handling, then call it for however many file systems you want to mount.

Navigating mounted storage

The path names will not exactly match the paths in your file system which is a side effect of the name of the mount point in DBFS. Say you’ve got a storage account called transactionvendors001, a file system named awesome-vendor, and a mount point called /mnt/awesome.

When navigating Gen2 in something like Azure Storage Explorer, your paths will be transactionvendors001/awesome-vendor/…/… but the equivalent in Databricks will be /mnt/awesome/…/…

A really nice feature is wildcard syntax in path specifications, which is basically a requirement for directories containing some arbitrary number of files. You might read raw JSON in using something like clients_df = spark.read.json('/mnt/awesome/clients/*.json', multiLine = True)

Series recap

I sang the praises of ADLS Gen2 storage combined with Azure Databricks, illustrated how I suck at service principals (and how to keep you from sucking at them), set up a boilerplate configuration for the Databricks handshake with storage accounts, and now actually mounted storage to DBFS.

This is all basically tax we’ve got to pay to get on the highway. It’s not fun and it doesn’t show off any of the awesome stuff Databricks can actually do per se. Stay tuned and I’ll shovel up some more exciting stuff from a novice perspective.

ADLS Gen2 and Azure Databricks – Part 3 – Spark Configuration

I went over why I use ADLS Gen2 with Databricks and how to set up a service principal to mediate permissions between them. Now we’ll configure the connection between Databricks and the storage account.

Config

This part is simple and mostly rinse-and-repeat. You need to set up a map of config values to use which identify the Databricks workspace with the storage account.

configs = {
    "fs.azure.account.auth.type": "OAuth",
    "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
    "fs.azure.createRemoteFileSystemDuringInitialization": "true",
    "fs.azure.account.oauth2.client.id": "<SP application ID>",
    "fs.azure.account.oauth2.client.secret": dbutils.secrets.get(scope = "gen2", key = "databricks workspace"),
    "fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<AAD tenant ID>/oauth2/token"
}

There’s nothing too magical going on here. I’ve never had a reason to change lines 2-4 though, honestly, I’ve never needed Databricks to tell the storage account to create the file system so I could probably do away with that line.

Lines 5-7 might need some tweaks depending on how many service principals, tenants, and secrets you use. I work in the same tenant for everything and I use the same service principal for all of the assorted ETL stuff I work in. Lastly, I have a one-to-one relationship between secrets and Databricks workspaces. With that arrangement, I can use a single config map for all of my storage accounts.

Take the above with a grain of salt. I don’t know if this could be more secure. Make sure you understand your own requirements.

The fill-in-the-blanks stuff threw me for a loop the first time. I always have trouble figuring out which settings or text boxes want which IDs.

Client ID

"fs.azure.account.oauth2.client.id" is the application ID of your service principal.

  • Go to the Azure portal
  • Go to Azure Active Directory
  • Select “App registrations” in the left blade
  • Select your service principal from the list of registrations
  • Copy the “Application (client) ID”
You’ll need the “Application (client) ID” GUID

Client endpoint

"fs.azure.account.oauth2.client.endpoint" is the same basic URL every time, you just plug in your Azure Active Directory tenant ID where the code snippet specifies.

  • Go to the Azure portal
  • Go to Azure Active Directory
  • Select “Properties” in the left blade
  • Copy the “Directory ID”
You’ll end up with something like https://login.microsoftonline.com/c559656c-8c6a-43b2-8ac4-c7f169fca936/oauth2/token when you insert your Directory ID

Client secret

"fs.azure.account.oauth2.client.secret" is the secret you made in the Azure portal for the service principal. You can type this in clear text but even I know enough about security to tell you not to do that. Databricks secrets are really easy and might be the only turnkey way to do this.

Managing secrets is a whole process on its own. I’ll dive in to that soon.

Storage location

This was another “duh” moment for me once I sorted it out. My problem was figuring out the protocol for the URL.

"abfss://file-system@storageaccount.bfs.core.windows.net/"

  • You can use abfs instead of abfss if you want to forego SSL
  • file-system is the name you gave a file system within a storage account
  • storageaccount is the name of the storage account file-system belongs to

Recap

This time, we created the Spark configuration. We didn’t persist or use the config yet: look for an upcoming installment on mounting storage in Databricks.

ADLS Gen2 and Azure Databricks – Part 2 – Service Principals

In part one, I gave a quick justification for using ADLS Gen2 with Azure Databricks. I also promised some juicy details about getting these players to cooperate.

There’s a great guide out there and I’m not going to try to replace it. Instead, I want to supplement the guide. The rest of this series will focus on the problems I keep running into and how I get around them.

I’ll cover creating a service principal and a secret but I won’t go into how to add that secret to Databricks as part of this article.

Service principals

I’ve had a difficult relationship with service principals. This may not be the canonical vocab, but I think of it as creating an application identity that can be granted assorted roles in one or more Azure products. Products can then use that identity to perform handshakes with each other.

Create the service principal

  • Sign in to the Azure Portal
  • Navigate to Azure Active Directory
  • Select “App registrations” in the left blade
  • Select the “+ New registration” button in the top toolbar
  • Provide a name for the app
  • Select the “Register” button at the bottom of the form
“+ New registration” is near the top-center

Create a secret

You should’ve been redirected to the app’s Overview page after you registered it. If not, you can get there from the App registrations window in Azure Active Directory.

  • Click “Certificates & secrets” in the app’s blade
  • Click “+ New client secret”
  • Choose a description and expiration for the secret
  • Click “Add”
  • Copy the secret value
  • Add the secret to your Databricks workspace
If you don’t copy the secret before you navigate away from the secrets list, you won’t be able to retrieve it again.

You can store this secret any way you like. If you want a turnkey and secure method, consider using a password manager. Key Vault is a good option for some use cases but this isn’t one of them.

Assign a role for the service principal

The storage accounts need to have the service principal added to their ACLs as a Contributor. Giving Databricks the secret isn’t enough.

I’ve had some trouble finding the correct role assignment. The guide I linked in the intro says Storage Blob Data Contributor is the correct role but that didn’t work for me.

Contributor may be too permissive for your business rules. I encourage you to investigate which roles will do the trick, if there are any others.

Recap

Service principals are a necessary evil to give certain Azure products an identity they can use with one another. Databricks can assume that identity using a secret generated by the service principal and, assuming the storage account has assigned the correct role to it, Databricks will then have access to that storage.

ADLS Gen2 and Azure Databricks – Part 1 – Overview

ADLS Gen2 combines Gen1 storage accounts with Blob Storage. The really hot feature is hierarchical namespaces. When the feature is enabled, the data store can be treated like a file system. If you’ve used Blob Storage prior to Gen2, you’ll be familiar with this.

There are a lot of integration options between other Azure products and Gen2 storage. I hope the title didn’t give it away, though, because I’m going to focus on integrating with Azure Databricks.

Why use Gen2?

Azure Databricks can integrate with a handful of data stores. Some are niche, like the Event Hubs connector. Others, like Azure SQL Database and Cosmos DB are more general purpose. So why should you choose ADLS Gen2 storage over anything else?

Cost

First, it’s cheap. Redundancy is the primary driver behind spend, but region and file structure also affect the cost. Use the pricing calculator to estimate your own usage.

This is a fairly high-cost configuration. Storage is dirt cheap

Usability

Second, it behaves like a file system. This is particularly important when you want to work with Databricks File System. I’ll get into some juicy stuff on this topic later. Basically, you can treat your Gen2 storage as a drive mounted to your Databricks workspace. You’ll never want to toy with other connectors once you can say spark.read.json('mnt/source/*.json', multiline=True).

Drawbacks

Obviously there are drawbacks. For me, the problem is invariably about configuration. I can spin up Databricks workspaces and all the storage accounts I want but, when I need them to talk to communicate, I always have to relearn how to do it.

The official documentation, true to brand, is comprehensive. It will get you where you need to go but the article is dense. I’d rather not read a long how-to article when all I need is a reminder of that one sticking point and how to get around it.

To wit

Databricks is great. Treating Azure storage accounts like file systems is great. Azure Databricks documentation is great…until it isn’t.

I’ll be back with at least one article about what I keep getting stuck on. Spoiler alert: service principals are awful.