Reading from MongoDB with Azure Functions

As part of my Cardalog web app, I’m writing an API to interact with a MongoDB using Azure Functions. In the series introduction, I set up the app’s components. In this installment, I’ll implement a Function to read all cards from the database.

I’ve got the API and UI repositories out on GitHub. This Gist contains the complete implementation I’ll reference throughout this post.

Connecting to MongoDB

First up, install the Mongo driver package with func extensions install -p MongoDB.Driver -v 2.9.3.

Since I’m using Functions, I’m not retaining any state or resources like connections to the database. The connection is easy to spin up, so there’s just a little boilerplate needed.

var client = new MongoClient("mongodb://127.0.0.1:27017");
var db = client.GetDatabase("cardalog");
var coll = db.GetCollection<BsonDocument>("cards");

Once I mature the APIs a bit, I won’t use magic strings for configuration details and I’ll harden access rights. As long as I’m prototyping, though, I’ll do it the quick and dirty way.

coll is how I’ll interact with the cards collection. For my use case, this collection is similar to a relational DB table. There are some fundamental differences worth knowing. For now, just keep in mind that the collection doesn’t have a schema.

Seeding the database

Before going further, I’ll seed the database with some dummy data I created with Mockaroo. You can use and copy the schema or run curl "https://api.mockaroo.com/api/bb6e1fd0?count=1000&key=d0846c90" > "MtG.json" in a terminal.

Once you have the mocked data, use mongoimport -d cardalog -c cards .\MtG.json to write it to the cards collection.

Compass has an “import data” option but, for whatever reason, the data never appeared in my collection. The CLI command has always worked for me.

Be careful with the word “schema” in this context. I mentioned that Mongo is schemaless but, for Mockaroo to generate data, I had to define a schema. You can create a different Mockaroo schema to mock up data from a different card game and those can live side-by-side in the same collection.

Reading cards

This first read implementation will be sort of naive. I want it to respond with all cards in JSON format so I’ll just create a list of BsonDocument, convert it to JSON, and ship it back to the client.

var projection = Builders<BsonDocument>.Projection.Exclude("_id");
var cardsBson = await coll
  .Find(new BsonDocument())
  .Project(projection).ToListAsync();
var cards = new BsonArray(cardsBson);
return (ActionResult)new OkObjectResult(cards.ToJson());

I had some trouble deserializing the auto-generated _id so I used a projection to exclude that from the results. Then I used IMongoCollection<T>.Find with no filter specified to get everything back.

cardsBson is a List<BsonDocument>. I can’t return a collection to the client, so I fit the collection into a single BsonArray object, then use ToJson which serializes the array of cards as a valid JSON string.

I intend to have thousands of cards in here so I’m going to implement paging eventually. For now, I’m happy sending everything back.

Routing the read call

Before I can move on to implementing the write Function, I want to make a couple of tweaks for the request parameter’s decoration. The Function template allows GET and POST calls and doesn’t specify the route. This is done via HttpTriggerAttribute.

[HttpTrigger(AuthorizationLevel.Anonymous, "get", "post", Route = null)]

If the route is left null, Functions will use the FunctionNameAttribute to create the URI. This first Function would then live at http://localhost:7071/api/ReadCards. I’m doing my best to adhere to REST conventions so I want to take the verb out.

Change null to "cards" and remove "post" because this Function is only for reading the collection. I now have this.

[HttpTrigger(AuthorizationLevel.Anonymous, "get", Route = "cards")]

Test it out

The code’s in place so all that’s left is to give it a whirl. From the CLI, run the app with func host start. The runner will print the URI to any Functions in the project once they’re listening. This first one listens on ReadCards: [GET] http://localhost:7071/api/cards. Fire up a browser or your favorite tool for playing with HTTP requests (I like Postman) and send a GET request to http://localhost:7071/api/cards.

Recap

I’ve got my database seeded with dummy data and implemented a basic “read everything” Function which crams all of my cards into a JSON array. I’ll need to add paging at some point and my error handling is really primitive but I’m ready to create my first UI to display my cards.

Cardalog: Blazor, MongoDB, and Azure Functions

I’ve played collectible card games (CCGs) for over twenty years and, in that time, have gotten a lot of cards. My collection is a mess of abandoned cataloging implementations so finding any particular card is next to impossible.

Instead of relying on some complex physical storage scheme for my cards, I’m going to make an API to manage the data about my collection and bolt on other applications and products to do stuff with that data. All together, this is my Cardalog.

The application stack

I tried a few combinations of technologies before settling into Blazor WebAssembly for the web GUI, Azure Functions for a serverless API, and MongoDB for storage. By the end of this article, I’ll have installed all the necessary tech and I’ll be ready to start the real implementation.

While this is the stack I’m starting out with, I’m not going to paint myself into a corner with it. I’ve got a lot of ideas bouncing around about other potential front ends and some side projects I can implement once I’ve got a good volume of data.

Unless otherwise noted, I use Visual Studio Code and PowerShell as my primary development tools.

Blazor WebAssembly

Blazor is a framework for building SPAs using C# instead of JavaScript. WebAssembly (Wasm) provides the means for .NET code to run in browsers. A key advantage is, after the initial download of binaries from the server, the application executes as quickly as native applications.

Blazor WebAssembly is in preview.
  • Get started by installing the .NET Core 3.1 preview SDK
  • Install the Blazor Wasm template at the terminal with dotnet new -i Microsoft.AspNetCore.Blazor.Templates::3.1.0-preview3.19555.2
  • Navigate to the directory you want the project to live in and execute the command dotnet new blazorwasm -o Cardalog
  • Run the application from the terminal with dotnet run

The Blazor Wasm template scaffolds a couple of things for you: a home page, a “counter” app, and a weather app which reads JSON from a local file. Check out the code and the site to see some of the basic concepts available. They’re a decent launchpad into working with Blazor but we’ll throw it all away very soon.

Azure Functions: a serverless API

There’s a lot to say about Functions Apps and serverless tech in general. While I love the managed product, the thing that really brings me back to Functions is that it forces me to think atomically. Instead of god classes and huge services, each Function in a Functions App is meant to serve one purpose. The first one I’ll create, for instance, is meant only to read all of the cards from my database.

  • First, install the Azure Functions extension for VS Code.
  • Open the command palette in Code (ctrl + shift + p) and run the Azure Functions: Create New Project... command. It’ll take you through an interactive process for the rest.
  • Choose your root directory (I chose the same directory the Cardalog GUI lives in) and create a folder called Cardalog.Api
  • Choose C# as the language unless you want to translate the API to another language.
  • Select HttpTrigger as the template for your first Function.
  • Name the Function ReadCards
  • Use the namespace Cardalog.Api
  • Select Anonymous for the access rights

We’ll be throwing out all of the scaffolded code when I go over the read implementation in my next article.

MongoDB

Lastly, install MongoDB Community Server and Compass. The server, by default, listens on port 27017 and requires no authentication. Open Compass and connect to it by filling out the form like so.

After you’ve connected, find the green “CREATE DATABASE” button near the top left of the main window. Name the database cardalog and the first collection cards. Later in the series, we’ll seed the collection with some JSON.

Recap

By now, we’ve got the Blazor front end, Azure Functions API, and MongoDB created. None of the components are talking yet, but it doesn’t take a lot to get there.

Coming up, I’ll seed the database with some test data, write a Function which replies to the caller with those data, and write a new page in the UI to get the card data from the API.

Learning Databricks – Part 2 – Getting Data

There are a bunch of ways to make data available but, if you’re just starting out or building a pet project, you probably want to keep it simple. If you don’t have your own data to start with, check out Kaggle, get some datasets, and start cooking.

Small data sample

Databricks is powerful – it’s made to be used on large and complex data but that doesn’t mean it’s no good for tiny data. For illustration, my samples throughout this series will use the following tiny CSVs.

id,name,description
1,Widget,"Made from cheap plastic"
2,Gadget,
id,widget,amount,currency
1,1,0.25,USD
2,1,0.20,EUR
3,1,27,JPY
4,2,1.75,USD
5,2,1.40,EUR
6,2,189,JPY

Uploading the Data

You can save and upload that CSV, for example, as Items.csv and Pricing.csv.

The first stop is the Add Data button and Upload File interface. It’s a standard upload experience so I won’t cover it.

Once you’ve uploaded the file, it’ll give you the chance to create a table in the UI or a notebook. The notebook option spits out some boilerplate code which can create a persistent table from the file but it’s very basic. It’s a good idea to see the boilerplate but, even for a toy project, you’ll probably want to do a little more.

The important thing to remember is your files will be available at /FileStore/tables/Items.csv and /FileStore/tables/Pricing.csv.

Serializing the data

The files are now available via Databricks’ built-in storage but they still need to be read into language-level objects to be useful. If you checked the boilerplate, you saw one of the options for the reader was infer_schema.

This can be useful if you want to read in the file with very few assumptions and start tweaking the structure on the fly. I do this when I read in different datasets for every job and hand off the results to some other subsystem.

In this case, however, I’m going to use the same data across multiple notebooks using different languages.

Schema specification

These schema specify column names, types, and nullability. The pyspark.sql.types module is necessary for type specification. Read up on it to see what else it offers.

from pyspark.sql.types import StructType,
  StructField,
  StringType,
  IntegerType,
  DoubleType
items_schema = StructType([
  StructField('ItemId', IntegerType(), nullable = False),
  StructField('Name', StringType(), nullable = False),
  StructField('Description', StringType(), nullable = True)
])
pricing_schema = StructType([
  StructField('PriceId', IntegerType(), nullable = False),
  StructField('ItemPriceId', IntegerType(), nullable = False),
  StructField('Amount', DoubleType(), nullable = False),
  StructField('Currency', StringType(), nullable = False)
])
Column specifications are positional. You could switch ‘PriceId’ and ‘ItemPriceId’ and you’d end up with misnamed columns. Even worse, Spark might try to cram a value into an ineligible type.

A schema object’s structure can be arbitrarily complex. In this case, both schema are just a few columns of simple types, but it’s possible to embed structures within structures which can model semi-structured data like JSON.

Creating DataFrames

The CSV files are next read into DataFrames. This step applies the schema to the dataset and creates a language-level object we can interact with.

item_df = spark.read.csv('/FileStore/tables/Items.csv',
  schema = items_schema,
  header = True)

price_df = spark.read.csv('/FileStore/tables/Pricing.csv',
  schema = pricing_schema,
  header = True)

There isn’t much magic at play. DataFrameReader can read a bunch of stuff in addition to CSVs. The first parameter is the location of the data to read, the second is the schema I defined, and the third indicates that the first row of the file is a header.

Creating Tables

Finally, the DataFrames know how to create tables out of their data.

item_df.write.format('parquet') \
  .mode('overwrite') \
  .saveAsTable('items')

price_df.write.format('parquet') \
  .mode('overwrite') \
  .saveAsTable('pricing')

That’s all self-describing but, for reference, modes are documented. Other available formats include JSON, text, CSV, and ORC.

Recap

I essentially took the raw data and moved it as-is into tables named items and pricing. From here I can perform transformations, create visualizations, and store the data in new tables.

Check out my series on ADLS Gen2 and Azure Databricks to see one example of using Databricks as one component in a larger system.

I published a complete notebook using large(ish) datasets with wide schemas. Databricks kills those links after six months so it’ll go down eventually. Rest assured the samples in this article cover all the same stuff and are much more readable.

Learning Databricks – Part 1 – A Brief Introduction

Databricks is one of my favorite tools. It’s a managed Apache Spark platform used for data engineering and data science. Before I dive in to using it, I’ll highlight some of the key features.

Jupyter is similar to Databricks in a lot of respects so, if you’re familiar with the former, I doubt this article will be informative. Keep an eye out for further installments in this series.

A full-fledged IDE

The Databricks web GUI is a first-class IDE. Files, folders, and dependencies are all navigable. Data connections, jobs, and clusters can be managed and inspected.

IDE navigation

There are separate Users and Shared areas in the workspace. When I’m working on a team project, I like to do my prototyping, debugging, and scratch pad stuff in my Users area and, once code is ready to go to a live environment, I put it in Shared.

This is an awesome feature for collaborative development and, conveniently, means you don’t need an IDE on your local machine. It also drastically reduces the chances of a “works on my machine” situation.

Notebooks

Notebooks are special files which the Databricks interface parses into command windows. The actual files are a combination of the code and markup you write in the GUI and metadata which, among other things, indicate how the commands are divided. In the notebook analogy, command windows are like pages.

A basic notebook example

Each notebook has a default programming language: one of Python, Scala, SQL, Java, or R. The window’s language can be specified using a magic command like %sql on the first line.

You can’t switch languages in the middle of a command window.

Languages run on separate kernels. If you create a DataFrame (a handy data abstraction) in Python, it won’t be available to commands in other languages as-is.

Version control

As a bonus, notebooks have built-in version control and can be connected to Git repositories. This comes complete with diff views between any revision and its immediate predecessor.

I added line 2 to demonstrate the in-line diff view

The built-in version control capabilities are somewhat limited when compared to Git. I definitely recommend integrating with a Git repository for a mature and flexible process.

Try it out

If you’ve been thinking of trying it out, there’s a free community edition. It’s limited to a low-end cluster and doesn’t provide certain features like Git integration, but it has all the basic stuff needed to see if Databricks is right for you.

Next up

I’ll cover practical topics like acquiring and serializing/deserializing data, transformation, and analysis. I hope to get into machine learning and visualizations too. If you like a “come learn with me” experience, look out for those.

Scalability in Cosmos DB

Even when just sitting at idle, Cosmos DB will rack up charges. You set the throughput as Request Units (RUs) and that power is constantly provisioned so Cosmos won’t experience “warm up time”. Naturally, you get billed by the hour.

I’ve spent more money than I had to by manually provisioning the throughput or, more often, failing to scale down during non-peak hours. For my ETL to run effectively without racking up charges, I decided to set up some programmatic scaling.

Use case: scaling before large writes

My Cosmos consumers don’t need a lot of power so I only need to ramp up the RUs for roughly a 90-minute window in the early morning when I throw a lot of writes at it. All together, I must have the following so the throughput is only increased while it’s needed.

  • My load subsystem must be able to increase the throughput just before it starts loading.
  • It must be able to decrease the throughput as soon as it’s done loading.
  • Since Azure Data Factory orchestrates my ETL, it must be able to invoke the solution.

Cosmos APIs

I had a lot of choices for programmatically interacting with Cosmos including a RESTful API, the Azure CLI, PowerShell, and .NET. Look for other options in the Quickstarts in the official documentation.

Some of these, like PowerShell, can’t modify the throughput. Make sure to RTFM.

Azure Functions as a proxy

ADF has a first-class Azure Functions Activity. Functions are a lightweight compute product, which makes them perfect for my on-demand requirement. As a managed product, I don’t have to worry about configuration beyond choosing my language

It was easiest for me to implement this in C#, so I chose the .NET library and created a .NET Core Function with an HTTP trigger.

Implementation

My implementation is stateless and is not coupled to any of my ETL subsystems. Microsoft.Azure.DocumentDB.Core is the only dependency not already bundled with the Functions App template. Ultimately, the app is plug-and-play.

Here are a couple of edited excerpts from the Function to show the most important parts. The repo is linked in the last section for your free use.

Communicating with Cosmos

Cosmos references are wired up in two stages. First, the DocumentClient (i.e. a reference to the Cosmos DB) is spun up using the Cosmos instance’s URL.

using (
  var client = new DocumentClient(new Uri(dbUrl),
  key,
  new ConnectionPolicy { UserAgentSuffix = " samples-net/3" }))
{
  client.CreateDatabaseIfNotExistsAsync(new Database { Id = dbName }).Wait();
  var coll = CreateCollection(client).Result;
  ChangeThroughput(client, coll).Wait();
}

CreateCollection is a helper method which gets a reference to a collection within the DB.

async Task<DocumentCollection> CreateCollection(DocumentClient client) {
  return await client.CreateDocumentCollectionIfNotExistsAsync(
    UriFactory.CreateDatabaseUri(dbName),
    new DocumentCollection {
      Id = collectionName
    },
    new RequestOptions {
      OfferThroughput = throughput
    });
}

CreateDocumentCollectionIfNotExistsAsync is part of the Cosmos .NET API. As the name implies, it conditionally creates the collection. This is necessary even if you know the collection already exists – it’s how the Function gets a reference to the collection.

Modifying the offer

The throughput is changed first by getting the existing offer for the Cosmos DB, then replacing that offer with the throughput you want.

OfferV2.OfferType is always “Invalid”. It doesn’t mean the Function failed to fetch or replace the offer.
async Task ChangeThroughput(DocumentClient client,
  DocumentCollection simpleCollection)
{
  var offer = client.CreateOfferQuery().Where(o =>
    o.ResourceLink == simpleCollection.SelfLink)
    .AsEnumerable().Single();
  var newOffer = await client.ReplaceOfferAsync(
    new OfferV2(offer, throughput));
}

There’s some more stuff to glue it together and some error handling I’ve added. The repo layers all of that in.

Using the Function

You’ll need to send along five values in the request headers.

  • The name of the Cosmos DB
  • The name of the collection
  • The Cosmos’ URL
  • One of the read-write keys
  • The number of RUs you want to set.
    • This must be a multiple of 100.
The read-write key is transmitted in plain text in my current implementation. This is an obvious security flaw which I haven’t gotten around to fixing.

Source code

The source is on GitHub. Feel free to use it. If you have ideas for improvements, I’d love to hear them.

ADLS Gen2 and Azure Databricks – Part 4 – Mounting to DBFS

Now you know why I use Gen2 with Databricks, my struggle with service principals, and how I configure the connection between the two. I’m finally going to mount the storage account to the Databricks file system (DBFS) and show a couple of things I do once the mount is available.

Databricks Utilities

Save a few keystrokes and access the file system utilities directly using dbfs. The method names will be familiar to those who work in the command line but fear not; it’s all fairly self-describing.

Mounting Gen2 storage

For my own sanity, I always unmount from a location before trying to mount to it.

def mount(source_url, configs, mount_point):
  try:
    dbfs.unmount(mount_point)
  except:
    print("The mount point didn't have anything mounted to it.")
  finally:
    dbfs.mount(source=source_url,
               mount_point=mount_point,
               extra_configs=configs)

I went over creating the source URL and the configs map in the previous installment. The mount point is a *nix style path specification, the root is conventionally /mnt. Tweak on this method to manage your own exception handling, then call it for however many file systems you want to mount.

Navigating mounted storage

The path names will not exactly match the paths in your file system which is a side effect of the name of the mount point in DBFS. Say you’ve got a storage account called transactionvendors001, a file system named awesome-vendor, and a mount point called /mnt/awesome.

When navigating Gen2 in something like Azure Storage Explorer, your paths will be transactionvendors001/awesome-vendor/…/… but the equivalent in Databricks will be /mnt/awesome/…/…

A really nice feature is wildcard syntax in path specifications, which is basically a requirement for directories containing some arbitrary number of files. You might read raw JSON in using something like clients_df = spark.read.json('/mnt/awesome/clients/*.json', multiLine = True)

Series recap

I sang the praises of ADLS Gen2 storage combined with Azure Databricks, illustrated how I suck at service principals (and how to keep you from sucking at them), set up a boilerplate configuration for the Databricks handshake with storage accounts, and now actually mounted storage to DBFS.

This is all basically tax we’ve got to pay to get on the highway. It’s not fun and it doesn’t show off any of the awesome stuff Databricks can actually do per se. Stay tuned and I’ll shovel up some more exciting stuff from a novice perspective.

ADLS Gen2 and Azure Databricks – Part 3 – Spark Configuration

I went over why I use ADLS Gen2 with Databricks and how to set up a service principal to mediate permissions between them. Now we’ll configure the connection between Databricks and the storage account.

Config

This part is simple and mostly rinse-and-repeat. You need to set up a map of config values to use which identify the Databricks workspace with the storage account.

configs = {
    "fs.azure.account.auth.type": "OAuth",
    "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
    "fs.azure.createRemoteFileSystemDuringInitialization": "true",
    "fs.azure.account.oauth2.client.id": "<SP application ID>",
    "fs.azure.account.oauth2.client.secret": dbutils.secrets.get(scope = "gen2", key = "databricks workspace"),
    "fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<AAD tenant ID>/oauth2/token"
}

There’s nothing too magical going on here. I’ve never had a reason to change lines 2-4 though, honestly, I’ve never needed Databricks to tell the storage account to create the file system so I could probably do away with that line.

Lines 5-7 might need some tweaks depending on how many service principals, tenants, and secrets you use. I work in the same tenant for everything and I use the same service principal for all of the assorted ETL stuff I work in. Lastly, I have a one-to-one relationship between secrets and Databricks workspaces. With that arrangement, I can use a single config map for all of my storage accounts.

Take the above with a grain of salt. I don’t know if this could be more secure. Make sure you understand your own requirements.

The fill-in-the-blanks stuff threw me for a loop the first time. I always have trouble figuring out which settings or text boxes want which IDs.

Client ID

"fs.azure.account.oauth2.client.id" is the application ID of your service principal.

  • Go to the Azure portal
  • Go to Azure Active Directory
  • Select “App registrations” in the left blade
  • Select your service principal from the list of registrations
  • Copy the “Application (client) ID”
You’ll need the “Application (client) ID” GUID

Client endpoint

"fs.azure.account.oauth2.client.endpoint" is the same basic URL every time, you just plug in your Azure Active Directory tenant ID where the code snippet specifies.

  • Go to the Azure portal
  • Go to Azure Active Directory
  • Select “Properties” in the left blade
  • Copy the “Directory ID”
You’ll end up with something like https://login.microsoftonline.com/c559656c-8c6a-43b2-8ac4-c7f169fca936/oauth2/token when you insert your Directory ID

Client secret

"fs.azure.account.oauth2.client.secret" is the secret you made in the Azure portal for the service principal. You can type this in clear text but even I know enough about security to tell you not to do that. Databricks secrets are really easy and might be the only turnkey way to do this.

Managing secrets is a whole process on its own. I’ll dive in to that soon.

Storage location

This was another “duh” moment for me once I sorted it out. My problem was figuring out the protocol for the URL.

"abfss://file-system@storageaccount.bfs.core.windows.net/"

  • You can use abfs instead of abfss if you want to forego SSL
  • file-system is the name you gave a file system within a storage account
  • storageaccount is the name of the storage account file-system belongs to

Recap

This time, we created the Spark configuration. We didn’t persist or use the config yet: look for an upcoming installment on mounting storage in Databricks.

Doing, knowing, and grokking

“Grok” is a great word we’ve co-opted from a great novel. It adds some very important nuance when we talk about how intimately we understand what we’re doing.

This is why I write.

Script kiddies and code monkeys

First off, let me say that we’re all script kiddies and code monkeys. I wouldn’t believe anyone who says they’ve skipped this stage for everything they’ve done.

Many people won’t get past this point. Some don’t need or want to. These people do; they copy snippets and scripts and hope they work out of the box. They don’t really grasp why all of it works.

This is a great place to be. Unless you found a way to get paid only for knowing stuff (call me if you have), you’ll have to do something with what you know. On the flip side, you need a way from doing to knowing.

Knowing what you do

What the hell was I doing?

If there was a word cloud for my thoughts, this would probably be the biggest of them all. By a long shot.

I usually hit a downward spiral of reading commit messages, looking for documentation I generated, and rereading stuff I found online.

See the source image
I don’t know anything yet

Eventually, I spend enough time in the traffic circle that I learn to get in and out more quickly. Just like real life, I usually have to go through the same circle at least twice before I know for sure how best to navigate it.

Maybe I’m a bad driver. But anyway.

Once I can tighten up this loop of seeing a problem and solving it, I feel pretty confident that I finally know at least the fundamentals.

Grokking in fullness

Eventually, I can visualize specific traffic circles, recognize patterns, and move between the general and the specific. Being able to make accurate assumptions about traffic circles I’ve never seen is the first indicator that I’m grokking the problem.

Grokking is an intuitive understanding, often hard-won.

I’m a writer, I guess?

I write a lot of content. I probably write about what I do more than I actually…do what I do. I usually gain insights into what I’m writing about and I always get further into grokking it.

My naïve motivation is to get other to understand what I’ve done. Hopefully my work is self-descriptive enough that I don’t need to write volumes to explain what’s going on or why I think my approach is good or appropriate.

A mature approach is writing to teach about the underlying problem and how solutions can be approached. It’s a whole “teaching someone to fish” thing, I guess. If I’ve successfully taught an audience, I probably grok what I’m writing or talking about.

So here I am, writing to learn, to teach, and to entertain at least as much as dad jokes.

– Hey, I was thinking.

– I thought I smelled something burning!

ADLS Gen2 and Azure Databricks – Part 2 – Service Principals

In part one, I gave a quick justification for using ADLS Gen2 with Azure Databricks. I also promised some juicy details about getting these players to cooperate.

There’s a great guide out there and I’m not going to try to replace it. Instead, I want to supplement the guide. The rest of this series will focus on the problems I keep running into and how I get around them.

I’ll cover creating a service principal and a secret but I won’t go into how to add that secret to Databricks as part of this article.

Service principals

I’ve had a difficult relationship with service principals. This may not be the canonical vocab, but I think of it as creating an application identity that can be granted assorted roles in one or more Azure products. Products can then use that identity to perform handshakes with each other.

Create the service principal

  • Sign in to the Azure Portal
  • Navigate to Azure Active Directory
  • Select “App registrations” in the left blade
  • Select the “+ New registration” button in the top toolbar
  • Provide a name for the app
  • Select the “Register” button at the bottom of the form
“+ New registration” is near the top-center

Create a secret

You should’ve been redirected to the app’s Overview page after you registered it. If not, you can get there from the App registrations window in Azure Active Directory.

  • Click “Certificates & secrets” in the app’s blade
  • Click “+ New client secret”
  • Choose a description and expiration for the secret
  • Click “Add”
  • Copy the secret value
  • Add the secret to your Databricks workspace
If you don’t copy the secret before you navigate away from the secrets list, you won’t be able to retrieve it again.

You can store this secret any way you like. If you want a turnkey and secure method, consider using a password manager. Key Vault is a good option for some use cases but this isn’t one of them.

Assign a role for the service principal

The storage accounts need to have the service principal added to their ACLs as a Contributor. Giving Databricks the secret isn’t enough.

I’ve had some trouble finding the correct role assignment. The guide I linked in the intro says Storage Blob Data Contributor is the correct role but that didn’t work for me.

Contributor may be too permissive for your business rules. I encourage you to investigate which roles will do the trick, if there are any others.

Recap

Service principals are a necessary evil to give certain Azure products an identity they can use with one another. Databricks can assume that identity using a secret generated by the service principal and, assuming the storage account has assigned the correct role to it, Databricks will then have access to that storage.

ADLS Gen2 and Azure Databricks – Part 1 – Overview

ADLS Gen2 combines Gen1 storage accounts with Blob Storage. The really hot feature is hierarchical namespaces. When the feature is enabled, the data store can be treated like a file system. If you’ve used Blob Storage prior to Gen2, you’ll be familiar with this.

There are a lot of integration options between other Azure products and Gen2 storage. I hope the title didn’t give it away, though, because I’m going to focus on integrating with Azure Databricks.

Why use Gen2?

Azure Databricks can integrate with a handful of data stores. Some are niche, like the Event Hubs connector. Others, like Azure SQL Database and Cosmos DB are more general purpose. So why should you choose ADLS Gen2 storage over anything else?

Cost

First, it’s cheap. Redundancy is the primary driver behind spend, but region and file structure also affect the cost. Use the pricing calculator to estimate your own usage.

This is a fairly high-cost configuration. Storage is dirt cheap

Usability

Second, it behaves like a file system. This is particularly important when you want to work with Databricks File System. I’ll get into some juicy stuff on this topic later. Basically, you can treat your Gen2 storage as a drive mounted to your Databricks workspace. You’ll never want to toy with other connectors once you can say spark.read.json('mnt/source/*.json', multiline=True).

Drawbacks

Obviously there are drawbacks. For me, the problem is invariably about configuration. I can spin up Databricks workspaces and all the storage accounts I want but, when I need them to talk to communicate, I always have to relearn how to do it.

The official documentation, true to brand, is comprehensive. It will get you where you need to go but the article is dense. I’d rather not read a long how-to article when all I need is a reminder of that one sticking point and how to get around it.

To wit

Databricks is great. Treating Azure storage accounts like file systems is great. Azure Databricks documentation is great…until it isn’t.

I’ll be back with at least one article about what I keep getting stuck on. Spoiler alert: service principals are awful.