Doing, knowing, and grokking

“Grok” is a great word we’ve co-opted from a great novel. It adds some very important nuance when we talk about how intimately we understand what we’re doing.

This is why I write.

Script kiddies and code monkeys

First off, let me say that we’re all script kiddies and code monkeys. I wouldn’t believe anyone who says they’ve skipped this stage for everything they’ve done.

Many people won’t get past this point. Some don’t need or want to. These people do; they copy snippets and scripts and hope they work out of the box. They don’t really grasp why all of it works.

This is a great place to be. Unless you found a way to get paid only for knowing stuff (call me if you have), you’ll have to do something with what you know. On the flip side, you need a way from doing to knowing.

Knowing what you do

What the hell was I doing?

If there was a word cloud for my thoughts, this would probably be the biggest of them all. By a long shot.

I usually hit a downward spiral of reading commit messages, looking for documentation I generated, and rereading stuff I found online.

See the source image
I don’t know anything yet

Eventually, I spend enough time in the traffic circle that I learn to get in and out more quickly. Just like real life, I usually have to go through the same circle at least twice before I know for sure how best to navigate it.

Maybe I’m a bad driver. But anyway.

Once I can tighten up this loop of seeing a problem and solving it, I feel pretty confident that I finally know at least the fundamentals.

Grokking in fullness

Eventually, I can visualize specific traffic circles, recognize patterns, and move between the general and the specific. Being able to make accurate assumptions about traffic circles I’ve never seen is the first indicator that I’m grokking the problem.

Grokking is an intuitive understanding, often hard-won.

I’m a writer, I guess?

I write a lot of content. I probably write about what I do more than I actually…do what I do. I usually gain insights into what I’m writing about and I always get further into grokking it.

My naïve motivation is to get other to understand what I’ve done. Hopefully my work is self-descriptive enough that I don’t need to write volumes to explain what’s going on or why I think my approach is good or appropriate.

A mature approach is writing to teach about the underlying problem and how solutions can be approached. It’s a whole “teaching someone to fish” thing, I guess. If I’ve successfully taught an audience, I probably grok what I’m writing or talking about.

So here I am, writing to learn, to teach, and to entertain at least as much as dad jokes.

– Hey, I was thinking.

– I thought I smelled something burning!

ADLS Gen2 and Azure Databricks – Part 2 – Service Principals

In part one, I gave a quick justification for using ADLS Gen2 with Azure Databricks. I also promised some juicy details about getting these players to cooperate.

There’s a great guide out there and I’m not going to try to replace it. Instead, I want to supplement the guide. The rest of this series will focus on the problems I keep running into and how I get around them.

I’ll cover creating a service principal and a secret but I won’t go into how to add that secret to Databricks as part of this article.

Service principals

I’ve had a difficult relationship with service principals. This may not be the canonical vocab, but I think of it as creating an application identity that can be granted assorted roles in one or more Azure products. Products can then use that identity to perform handshakes with each other.

Create the service principal

  • Sign in to the Azure Portal
  • Navigate to Azure Active Directory
  • Select “App registrations” in the left blade
  • Select the “+ New registration” button in the top toolbar
  • Provide a name for the app
  • Select the “Register” button at the bottom of the form
“+ New registration” is near the top-center

Create a secret

You should’ve been redirected to the app’s Overview page after you registered it. If not, you can get there from the App registrations window in Azure Active Directory.

  • Click “Certificates & secrets” in the app’s blade
  • Click “+ New client secret”
  • Choose a description and expiration for the secret
  • Click “Add”
  • Copy the secret value
  • Add the secret to your Databricks workspace
If you don’t copy the secret before you navigate away from the secrets list, you won’t be able to retrieve it again.

You can store this secret any way you like. If you want a turnkey and secure method, consider using a password manager. Key Vault is a good option for some use cases but this isn’t one of them.

Assign a role for the service principal

The storage accounts need to have the service principal added to their ACLs as a Contributor. Giving Databricks the secret isn’t enough.

I’ve had some trouble finding the correct role assignment. The guide I linked in the intro says Storage Blob Data Contributor is the correct role but that didn’t work for me.

Contributor may be too permissive for your business rules. I encourage you to investigate which roles will do the trick, if there are any others.

Recap

Service principals are a necessary evil to give certain Azure products an identity they can use with one another. Databricks can assume that identity using a secret generated by the service principal and, assuming the storage account has assigned the correct role to it, Databricks will then have access to that storage.

ADLS Gen2 and Azure Databricks – Part 1 – Overview

ADLS Gen2 combines Gen1 storage accounts with Blob Storage. The really hot feature is hierarchical namespaces. When the feature is enabled, the data store can be treated like a file system. If you’ve used Blob Storage prior to Gen2, you’ll be familiar with this.

There are a lot of integration options between other Azure products and Gen2 storage. I hope the title didn’t give it away, though, because I’m going to focus on integrating with Azure Databricks.

Why use Gen2?

Azure Databricks can integrate with a handful of data stores. Some are niche, like the Event Hubs connector. Others, like Azure SQL Database and Cosmos DB are more general purpose. So why should you choose ADLS Gen2 storage over anything else?

Cost

First, it’s cheap. Redundancy is the primary driver behind spend, but region and file structure also affect the cost. Use the pricing calculator to estimate your own usage.

This is a fairly high-cost configuration. Storage is dirt cheap

Usability

Second, it behaves like a file system. This is particularly important when you want to work with Databricks File System. I’ll get into some juicy stuff on this topic later. Basically, you can treat your Gen2 storage as a drive mounted to your Databricks workspace. You’ll never want to toy with other connectors once you can say spark.read.json('mnt/source/*.json', multiline=True).

Drawbacks

Obviously there are drawbacks. For me, the problem is invariably about configuration. I can spin up Databricks workspaces and all the storage accounts I want but, when I need them to talk to communicate, I always have to relearn how to do it.

The official documentation, true to brand, is comprehensive. It will get you where you need to go but the article is dense. I’d rather not read a long how-to article when all I need is a reminder of that one sticking point and how to get around it.

To wit

Databricks is great. Treating Azure storage accounts like file systems is great. Azure Databricks documentation is great…until it isn’t.

I’ll be back with at least one article about what I keep getting stuck on. Spoiler alert: service principals are awful.