Weekly Workbench Tip: Query parquet datasets from ADLS (Python)

anandbisen · Posted 07-24-2024 08:53 PM

Datasets stored in Azure Blob Storage could be queried and analyzed in place using Python from Workbench (In a subsequent post I will show how this could be performed using SAS Code). This is a simple way to analyze large datasets without having to download them to your local machine. In this tech tip, I will show you how to use the DuckDB Python API to query and analyze parquet datasets stored in Azure Blob Storage. DuckDB provides a Python API that allows you to interact with DuckDB from Python. This API is based on the Python DB API 2.0 specification. The API is designed to be simple and easy to use, while still providing a powerful set of features for interacting with DuckDB.

Here is an example of how you can use the DuckDB Python API to query and analyze parquet datasets stored in Azure Blob Storage. This approach could be easily used to perform first pass in place on the raw datasets and then only the relevant data could be downloaded for further analysis.

First you need to install the DuckDB Python package using pip:

pip install duckdb

Here we are going to use the New York Taxi dataset, which is stored in a parquet file in Azure Blob Storage. We will use the DuckDB Python API to query and analyze this dataset. I downloaded all of the dataset for Yellow Cab for the year 2012 from the [NYC Taxi and Limousine Commission](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page) and uploaded it to Azure Blob Storage. The dataset consisted for 12 files, one for each month of the year.

import duckdb

# Instantiate a DuckDB connection
conn = duckdb.connect(":memory:")

# Azure Blob Store Connection String
# Replace this with your Azure Blob Connection String
blob_connection_string = "<read azure blob connection string from secret file>"


# Install/Load Azure Extension and Authenticate
queries = [ "INSTALL azure;", "LOAD azure;" 
            f"CREATE SECRET secret1 ( TYPE AZURE, CONNECTION_STRING 
            '{blob_connection_string}' );"
]
for q in queries: # execute the queries
    r = conn.execute(q)

# Create a table from the parquet file in Azure Blob Storage

container_name = "<your container name>" # <-- Replace this with your container name
table = f"'az://{container_name}/path/to/your/data/*.parquet'" # <-- Replace this with the path to your parquet file


# Total Revenue in 2012 NYC Yellow Cab By Month
q = f"""
    SELECT 
         datepart('month', tpep_pickup_datetime) as month,
         sum(total_amount) as revenue,
         count(*) as num_rides
    FROM {table}
    GROUP BY month
    ORDER BY month
;
"""
df = conn.execute(q).fetch_df()

And here is the result for the query:

    month       revenue  num_rides 
0       1  1.542250e+08   13058348 
1       2  1.598778e+08   13361345 
2       3  1.987735e+08   16146923 
3       4  1.664712e+08   13427802 
4       5  1.771515e+08   13965253 
5       6  1.914485e+08   15096468 
6       7  1.793094e+08   14379307 
7       8  1.816333e+08   14381752 
8       9  2.136541e+08   14546854 
9      10  2.153781e+08   14522315 
10     11  2.001106e+08   13776030 
11     12  2.143429e+08   14696610

Weekly Workbench Tip: Query parquet datasets from ADLS (Python)

Register Today!