In part one, I gave a quick justification for using ADLS Gen2 with Azure Databricks. I also promised some juicy details about getting these players to cooperate.
There’s a great guide out there and I’m not going to try to replace it. Instead, I want to supplement the guide. The rest of this series will focus on the problems I keep running into and how I get around them.
Service principals
I’ve had a difficult relationship with service principals. This may not be the canonical vocab, but I think of it as creating an application identity that can be granted assorted roles in one or more Azure products. Products can then use that identity to perform handshakes with each other.
Create the service principal
- Sign in to the Azure Portal
- Navigate to Azure Active Directory
- Select “App registrations” in the left blade
- Select the “+ New registration” button in the top toolbar
- Provide a name for the app
- Select the “Register” button at the bottom of the form
Create a secret
You should’ve been redirected to the app’s Overview page after you registered it. If not, you can get there from the App registrations window in Azure Active Directory.
- Click “Certificates & secrets” in the app’s blade
- Click “+ New client secret”
- Choose a description and expiration for the secret
- Click “Add”
- Copy the secret value
- Add the secret to your Databricks workspace
You can store this secret any way you like. If you want a turnkey and secure method, consider using a password manager. Key Vault is a good option for some use cases but this isn’t one of them.
Assign a role for the service princi pal
The storage accounts need to have the service principal added to their ACLs as a Contributor. Giving Databricks the secret isn’t enough.
Contributor may be too permissive for your business rules. I encourage you to investigate which roles will do the trick, if there are any others.
Recap
Service principals are a necessary evil to give certain Azure products an identity they can use with one another. Databricks can assume that identity using a secret generated by the service principal and, assuming the storage account has assigned the correct role to it, Databricks will then have access to that storage.