Laurie Smith
(Cincinnati Children's Hospital Medical Center)
Big Data is everywhere and available to the public. It is easy for us to access this data by use of Application Programming Interface (API). From United States census data to FacebookTM and GoogleTM data, there is also much public data useful for studying Health Sciences. In addition to social media and other sites, healthdata.gov allows access via API to city, state, and country wide CoVID-19, Hospital-Acquired Infections, and Influenza Vaccination Rates data, amongst many other datasets. There are many health related datasets available to access through API. Analysis on this type of data can be incorporated in many studies and can be accessed very easily using simple base SAS® code.
Watch Big Public Data as presented by the author on the SAS Users YouTube Channel.
This paper came about due to my personal interest in learning how to use SAS® to access data from a database using API. Many processes are using this method to extract data. Many GoogleTM searches and SAS® papers later, I discovered I can access data available to the public via API using very simple Base SAS® code. I have previously written a SAS® papers that accessed data from a webpage by scraping the html code. The method investigated here seems to be a much easier process when available.
Below are a couple of examples of public data that can be accessed through API.
The United States Census Bureau data, like many sites that offer access to data via API, do offer some assistance for a novice by providing examples of how to build the url needed for access. Although it is open to the public, you must register as a developer and obtain an API key (http://api.census.gov/data/key_signup.html). Here is a link to a list of datasets available https://api.census.gov/data.html. The 2020 census data should be available soon, but currently there is data available from past years.
A dataset to access was selected randomly: “Population Estimates: Estimates by Age, Sex, Race, and Hispanic Origin from 2019”. Here’s what the data dictionary looks like, also available on the site:
Name |
Label |
Concept |
Required |
Attributes |
Limit |
Predicate Type |
Group |
AGE |
Age |
|
default displayed |
|
0 |
int |
N/A |
AGE_DESC |
Description of AGE values |
|
not required |
|
0 |
string |
N/A |
DATE_CODE |
Estimate Date |
|
default displayed |
|
0 |
int |
N/A |
DATE_DESC |
Description of DATE values |
|
not required |
|
0 |
string |
N/A |
for |
Census API FIPS 'for' clause |
Census API Geography Specification |
predicate-only |
0 |
fips-for |
N/A |
|
GEO_ID |
Geography |
not required |
NAME |
0 |
string |
N/A |
|
GEOCOMP |
GEO_ID Component |
default displayed |
0 |
string |
N/A |
||
HISP |
Hispanic Origin |
default displayed |
0 |
int |
N/A |
||
in |
Census API FIPS 'in' clause |
Census API Geography Specification |
predicate-only |
0 |
fips-in |
N/A |
|
LASTUPDATE |
Last Update |
not required |
0 |
string |
N/A |
||
NATION |
Geography |
not required |
0 |
(not a predicate) |
N/A |
||
POP |
Population |
not required |
0 |
(not a predicate) |
N/A |
||
RACE |
Race |
default displayed |
0 |
int |
N/A |
||
SEX |
Sex |
default displayed |
0 |
int |
N/A |
||
STATE |
Geography |
not required |
0 |
(not a predicate) |
N/A |
||
SUMLEVEL |
Summary Level code |
not required |
0 |
string |
N/A |
||
ucgid |
Uniform Census Geography Identifier clause |
Census API Geography Specification |
predicate-only |
0 |
ucgid |
N/A |
|
UNIVERSE |
Population Universe |
not required |
0 |
string |
N/A |
Table 1. Data Dictionary for Population Estimates: Estimates by Age, Sex, Race, and Hispanic Origin from 2019
Here is what the SAS code could look like:
filename cens temp;
Define url with variables to include in resulting table (this table is limited to the bolded variables, also where the key is used):
proc http
url='https://api.census.gov/data/2019/pep/charage?get=POP,AGE,AGE_DESC,DATE_CODE,DATE_DESC,HISP,RACE,SEX&for=us:*&key=05520f15ee12351332aedee15265113e5bd6e88c'
method="GET"
out=cens;
run;
Assign a JSON library to the HTTP response:
libname resp JSON fileref=cens;
Copy datasets from JSON library to work:
proc datasets library=resp;
copy out=work;
run;
quit;
The resulting dataset is named ‘Root’ with 132192 observations (there is also a long version of data created named ‘AllData’).
Figure 1. Dataset root.sas7bdat
Figure 2. Dataset alldata.sas7bdat
The SAS® dataset ‘Root’ is not in the format expected but additional simple SAS® code can be used to manipulate the structure to give a more useful dataset. The following code can be added and executed to transform the data into a more usable format:
Use AllData.sas7bdat to help rename variables that correspond to the variable names in the first row of Root:
proc sort nodupkey data=alldata out=alldataX; by P1; run;
proc sql noprint ;
select catx('=',P1,Value)
into :rename separated by ' '
from alldataX ;
quit;
Reformat variables that are character that should be numeric:
data excensus (drop=ordinal_root us iage ipop);
set root (firstobs=2
rename=(&rename pop=ipop age=iage));
pop=input(strip(trim(ipop)),best12.);
age=input(strip(trim(iage)),best12.);
run;
Clear the response library reference:
libname resp clear;
The resulting dataset has variables defined as expected in the data dictionary:
Figure 3. Dataset root.sas7bdat
The United States Department of Health & Human Services provides health data accessible through API. Unlike the US Census Bureau data, no API Key is needed to access datasets. Here is a link to a listing of the datasets available https://healthdata.gov/browse.
Again, a dataset to access was selected randomly: “Deaths involving coronavirus disease 2019 (COVID-19) by race and Hispanic origin group and age, by state”. Here’s what the data dictionary looks like, also available on the site:
Column Name |
Description |
Type |
Data as of |
Date of analysis |
Date & Time |
Start Date |
First date of data period |
Date & Time |
End Date |
Last date of data period |
Date & Time |
State |
Jurisdiction of occurrence |
Plain Text |
Age group |
Age group |
Plain Text |
Race and Hispanic Origin Group |
Race or Hispanic Origin Group |
Plain Text |
COVID-19 Deaths |
Deaths involving COVID-19 (ICD-code U07.1) |
Number |
Total Deaths |
Deaths from all causes of death |
Number |
Pneumonia Deaths |
Pneumonia Deaths (ICD-10 codes J12.0-J18.9) |
Number |
Pneumonia and COVID-19 Deaths |
Deaths with Pneumonia and COVID-19 (ICD-10 codes J12.0-J18.9 and U07.1) |
Number |
Influenza Deaths |
Influenza Deaths (ICD-10 codes J09-J11) |
Number |
Pneumonia, Influenza, or COVID-19 Deaths |
Deaths with Pneumonia, Influenza, or COVID-19 (ICD-10 codes U07.1 or J09-J18.9) |
Number |
Footnote |
Suppressed counts (1-9) |
Plain Text |
Table 2. Data Dictionary for Deaths involving coronavirus disease 2019 (COVID-19) by race and Hispanic origin group and age, by state
Here is what the very simple SAS code could look like:
filename hdata temp;
Define url to access the data. This time the url is much less complicated importing all defined variables from the data dictionary:
proc http
url='https://data.cdc.gov/resource/ks3g-spdg.json'
method= "GET"
out=hdata;
run;
Assign a JSON library to the HTTP response:
libname hresp JSON fileref=hdata;
Copy datasets from JSON library to work:
proc datasets library=hresp;
copy out=work;
run;
quit;
Clear the response library reference:
libname hresp clear;
The resulting dataset is named ‘Root’ with 1000 observations (there is also a long version of data created named ‘AllData’). The only structural changes that may be needed are conversion of some variables from character to numeric).
Figure 5. Dataset root.sas7bdat
There are many sources of big public data that can be accessed using very simple base SAS® code through an API, requiring very little manipulation of the data obtained for use in analysis.
Henry, Joseph. 2020. “REST Just Got Easy with SAS® and PROC HTTP”, Virtual: SAS Global Forum.
Available at https://www.sas.com/content/dam/SAS/support/en/sas-global-forum-proceedings/2020/4426-2020.pdf.
de Walick, Laurent. 2020. “A Beginners Guide to Consuming RESTful Web Services in SAS®”, Virtual: SAS Global Forum.
Available at https://www.sas.com/content/dam/SAS/support/en/sas-global-forum-proceedings/2020/4209-2020.pdf.
United States Department of Health & Human Services. 2021. “HealthData.gov” Accessed April 15, 2021. https://healthdata.gov
Marr, Bernard. Feb 12, 2016. “Big Data: 33 Brilliant And Free Data Sources Anyone Can Use” Accessed April 15, 2021. https://www.forbes.com/sites/bernardmarr/2016/02/12/big-data-35-brilliant-and-free-data-sources-for-...
United States Census Bureau. 2021. “Available APIs” Accessed April 15, 2021. https://www.forbes.com/sites/bernardmarr/2016/02/12/big-data-35-brilliant-and-free-data-sources-for-...
Your comments and questions are valued and encouraged. Contact the author at:
Laurie Smith
Cincinnati Children’s Hospital Medical Center
Laurie.Bishop@cchmc.org
/*United States Census Data*/
filename cens temp;
/*Define api with variables to incllude in resulting table*/
proc http
url='https://api.census.gov/data/2019/pep/charage?get=POP,AGE,AGE_DESC,DATE_CODE,DATE_DESC,HISP,RACE,SEX&for=us:*&key=05520f15ee12351332aedee15265113e5bd6e88c'
method= "GET"
out=cens;
run;
/* Assign a JSON library to the HTTP response */
libname resp JSON fileref=cens;
/*Copy datasets from JSON library to work*/
proc datasets library=resp;
copy out=work;
run;
quit;
/*Use AllData to help rename variables given variable names in the first row*/
proc sort nodupkey data=alldata out=alldataX; by P1; run;
proc sql noprint ;
select catx('=',P1,Value)
into :rename separated by ' '
from alldataX ;
quit;
/*Format variables that should be numeric and drop variable name observation*/
data excensus (drop=ordinal_root us iage ipop);
set root (firstobs=2
rename=(&rename pop=ipop age=iage));
pop=input(strip(trim(ipop)),best12.);
age=input(strip(trim(iage)),best12.);
run;
libname resp clear;
/*HealthData.gov*/
filename hdata temp;
/*Define url to access data*/
proc http
url='https://data.cdc.gov/resource/ks3g-spdg.json'
method= "GET"
out=hdata;
run;
/* Assign a JSON library to the HTTP response */
libname hresp JSON fileref=hdata;
/*Copy datasets from JSON library to work*/
proc datasets library=hresp;
copy out=work;
run;
quit;
libname hresp clear;
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Ready to level-up your skills? Choose your own adventure.