Paper 1169-2021
Authors
Laurie Smith
(Cincinnati Children's Hospital Medical Center)
Abstract
Big Data is everywhere and available to the public. It is easy for us to access this data by use of Application Programming Interface (API). From United States census data to Facebook TM and Google TM data, there is also much public data useful for studying Health Sciences. In addition to social media and other sites, healthdata.gov allows access via API to city, state, and country wide CoVID-19, Hospital-Acquired Infections, and Influenza Vaccination Rates data, amongst many other datasets. There are many health related datasets available to access through API. Analysis on this type of data can be incorporated in many studies and can be accessed very easily using simple base SAS ® code.
Watch the presentation
Watch Big Public Data as presented by the author on the SAS Users YouTube Channel.
INTRODUCTION
This paper came about due to my personal interest in learning how to use SAS® to access data from a database using API. Many processes are using this method to extract data. Many GoogleTM searches and SAS® papers later, I discovered I can access data available to the public via API using very simple Base SAS® code. I have previously written a SAS® papers that accessed data from a webpage by scraping the html code. The method investigated here seems to be a much easier process when available.
Below are a couple of examples of public data that can be accessed through API.
UNITED STATES CENSUS
The United States Census Bureau data, like many sites that offer access to data via API, do offer some assistance for a novice by providing examples of how to build the url needed for access. Although it is open to the public, you must register as a developer and obtain an API key (http://api.census.gov/data/key_signup.html). Here is a link to a list of datasets available https://api.census.gov/data.html. The 2020 census data should be available soon, but currently there is data available from past years.
A dataset to access was selected randomly: “Population Estimates: Estimates by Age, Sex, Race, and Hispanic Origin from 2019”. Here’s what the data dictionary looks like, also available on the site:
Name
Label
Concept
Required
Attributes
Limit
Predicate Type
Group
AGE
Age
default displayed
0
int
N/A
AGE_DESC
Description of AGE values
not required
0
string
N/A
DATE_CODE
Estimate Date
default displayed
0
int
N/A
DATE_DESC
Description of DATE values
not required
0
string
N/A
for
Census API FIPS 'for' clause
Census API Geography Specification
predicate-only
0
fips-for
N/A
GEO_ID
Geography
not required
NAME
0
string
N/A
GEOCOMP
GEO_ID Component
default displayed
0
string
N/A
HISP
Hispanic Origin
default displayed
0
int
N/A
in
Census API FIPS 'in' clause
Census API Geography Specification
predicate-only
0
fips-in
N/A
LASTUPDATE
Last Update
not required
0
string
N/A
NATION
Geography
not required
0
(not a predicate)
N/A
POP
Population
not required
0
(not a predicate)
N/A
RACE
Race
default displayed
0
int
N/A
SEX
Sex
default displayed
0
int
N/A
STATE
Geography
not required
0
(not a predicate)
N/A
SUMLEVEL
Summary Level code
not required
0
string
N/A
ucgid
Uniform Census Geography Identifier clause
Census API Geography Specification
predicate-only
0
ucgid
N/A
UNIVERSE
Population Universe
not required
0
string
N/A
Table 1. Data Dictionary for Population Estimates: Estimates by Age, Sex, Race, and Hispanic Origin from 2019
Here is what the SAS code could look like:
filename cens temp;
Define url with variables to include in resulting table (this table is limited to the bolded variables, also where the key is used):
proc http
url='https://api.census.gov/data/2019/pep/charage?get=POP,AGE,AGE_DESC,DATE_CODE,DATE_DESC,HISP,RACE,SEX&for=us:*&key=05520f15ee12351332aedee15265113e5bd6e88c'
method="GET"
out=cens;
run;
Assign a JSON library to the HTTP response:
libname resp JSON fileref=cens;
Copy datasets from JSON library to work:
proc datasets library=resp;
copy out=work;
run;
quit;
The resulting dataset is named ‘Root’ with 132192 observations (there is also a long version of data created named ‘AllData’).
Figure 1. Dataset root.sas7bdat
Figure 2. Dataset alldata.sas7bdat
The SAS ® dataset ‘Root’ is not in the format expected but additional simple SAS ® code can be used to manipulate the structure to give a more useful dataset. The following code can be added and executed to transform the data into a more usable format:
Use AllData.sas7bdat to help rename variables that correspond to the variable names in the first row of Root:
proc sort nodupkey data=alldata out=alldataX; by P1; run;
proc sql noprint ;
select catx('=',P1,Value)
into :rename separated by ' '
from alldataX ;
quit;
Reformat variables that are character that should be numeric:
data excensus (drop=ordinal_root us iage ipop);
set root (firstobs=2
rename=(&rename pop=ipop age=iage));
pop=input(strip(trim(ipop)),best12.);
age=input(strip(trim(iage)),best12.);
run;
Clear the response library reference:
libname resp clear;
The resulting dataset has variables defined as expected in the data dictionary:
Figure 3. Dataset root.sas7bdat
HEALTHDATA.GOV
The United States Department of Health & Human Services provides health data accessible through API. Unlike the US Census Bureau data, no API Key is needed to access datasets. Here is a link to a listing of the datasets available https://healthdata.gov/browse.
Again, a dataset to access was selected randomly: “Deaths involving coronavirus disease 2019 (COVID-19) by race and Hispanic origin group and age, by state”. Here’s what the data dictionary looks like, also available on the site:
Column Name
Description
Type
Data as of
Date of analysis
Date & Time
Start Date
First date of data period
Date & Time
End Date
Last date of data period
Date & Time
State
Jurisdiction of occurrence
Plain Text
Age group
Age group
Plain Text
Race and Hispanic Origin Group
Race or Hispanic Origin Group
Plain Text
COVID-19 Deaths
Deaths involving COVID-19 (ICD-code U07.1)
Number
Total Deaths
Deaths from all causes of death
Number
Pneumonia Deaths
Pneumonia Deaths (ICD-10 codes J12.0-J18.9)
Number
Pneumonia and COVID-19 Deaths
Deaths with Pneumonia and COVID-19 (ICD-10 codes J12.0-J18.9 and U07.1)
Number
Influenza Deaths
Influenza Deaths (ICD-10 codes J09-J11)
Number
Pneumonia, Influenza, or COVID-19 Deaths
Deaths with Pneumonia, Influenza, or COVID-19 (ICD-10 codes U07.1 or J09-J18.9)
Number
Footnote
Suppressed counts (1-9)
Plain Text
Table 2. Data Dictionary for Deaths involving coronavirus disease 2019 (COVID-19) by race and Hispanic origin group and age, by state
Here is what the very simple SAS code could look like:
filename hdata temp;
Define url to access the data. This time the url is much less complicated importing all defined variables from the data dictionary:
proc http
url='https://data.cdc.gov/resource/ks3g-spdg.json'
method= "GET"
out=hdata;
run;
Assign a JSON library to the HTTP response:
libname hresp JSON fileref=hdata;
Copy datasets from JSON library to work:
proc datasets library=hresp;
copy out=work;
run;
quit;
Clear the response library reference:
libname hresp clear;
The resulting dataset is named ‘Root’ with 1000 observations (there is also a long version of data created named ‘AllData’). The only structural changes that may be needed are conversion of some variables from character to numeric).
Figure 5. Dataset root.sas7bdat
CONCLUSION
There are many sources of big public data that can be accessed using very simple base SAS® code through an API, requiring very little manipulation of the data obtained for use in analysis.
References
Henry, Joseph. 2020. “REST Just Got Easy with SAS ® and PROC HTTP”, Virtual: SAS Global Forum.
Available at https://www.sas.com/content/dam/SAS/support/en/sas-global-forum-proceedings/2020/4426-2020.pdf.
de Walick, Laurent. 2020. “A Beginners Guide to Consuming RESTful Web Services in SAS ® ”, Virtual: SAS Global Forum.
Available at https://www.sas.com/content/dam/SAS/support/en/sas-global-forum-proceedings/2020/4209-2020.pdf.
United States Department of Health & Human Services. 2021. “HealthData.gov” Accessed April 15, 2021. https://healthdata.gov
Marr, Bernard. Feb 12, 2016. “Big Data: 33 Brilliant And Free Data Sources Anyone Can Use” Accessed April 15, 2021. https://www.forbes.com/sites/bernardmarr/2016/02/12/big-data-35-brilliant-and-free-data-sources-for-2016
United States Census Bureau. 2021. “Available APIs” Accessed April 15, 2021. https://www.forbes.com/sites/bernardmarr/2016/02/12/big-data-35-brilliant-and-free-data-sources-for-2016
CONTACT INFORMATION
Your comments and questions are valued and encouraged. Contact the author at:
Laurie Smith
Cincinnati Children’s Hospital Medical Center
Laurie.Bishop@cchmc.org
ResearchGate
LinkedIn
FULL CODE:
/*United States Census Data*/
filename cens temp;
/*Define api with variables to incllude in resulting table*/
proc http
url='https://api.census.gov/data/2019/pep/charage?get=POP,AGE,AGE_DESC,DATE_CODE,DATE_DESC,HISP,RACE,SEX&for=us:*&key=05520f15ee12351332aedee15265113e5bd6e88c'
method= "GET"
out=cens;
run;
/* Assign a JSON library to the HTTP response */
libname resp JSON fileref=cens;
/*Copy datasets from JSON library to work*/
proc datasets library=resp;
copy out=work;
run;
quit;
/*Use AllData to help rename variables given variable names in the first row*/
proc sort nodupkey data=alldata out=alldataX; by P1; run;
proc sql noprint ;
select catx('=',P1,Value)
into :rename separated by ' '
from alldataX ;
quit;
/*Format variables that should be numeric and drop variable name observation*/
data excensus (drop=ordinal_root us iage ipop);
set root (firstobs=2
rename=(&rename pop=ipop age=iage));
pop=input(strip(trim(ipop)),best12.);
age=input(strip(trim(iage)),best12.);
run;
libname resp clear;
/*HealthData.gov*/
filename hdata temp;
/*Define url to access data*/
proc http
url='https://data.cdc.gov/resource/ks3g-spdg.json'
method= "GET"
out=hdata;
run;
/* Assign a JSON library to the HTTP response */
libname hresp JSON fileref=hdata;
/*Copy datasets from JSON library to work*/
proc datasets library=hresp;
copy out=work;
run;
quit;
libname hresp clear;
... View more