BookmarkSubscribeRSS Feed

Big Public Data

Started ‎04-15-2021 by
Modified ‎05-10-2021 by
Views 1,364
Paper 1169-2021
Authors 

Laurie Smith 

(Cincinnati Children's Hospital Medical Center)

Abstract

Big Data is everywhere and available to the public.  It is easy for us to access this data by use of Application Programming Interface (API).  From United States census data to FacebookTM and GoogleTM data, there is also much public data useful for studying Health Sciences.  In addition to social media and other sites, healthdata.gov allows access via API to city, state, and country wide CoVID-19, Hospital-Acquired Infections, and Influenza Vaccination Rates data, amongst many other datasets.  There are many health related datasets available to access through API.  Analysis on this type of data can be incorporated in many studies and can be accessed very easily using simple base SAS® code.

Watch the presentation

Watch Big Public Data as presented by the author on the SAS Users YouTube Channel.

 

INTRODUCTION

This paper came about due to my personal interest in learning how to use SAS® to access data from a database using API. Many processes are using this method to extract data. Many GoogleTM searches and SAS® papers later, I discovered I can access data available to the public via API using very simple Base SAS® code. I have previously written a SAS® papers that accessed data from a webpage by scraping the html code. The method investigated here seems to be a much easier process when available.


Below are a couple of examples of public data that can be accessed through API.

 

UNITED STATES CENSUS

The United States Census Bureau data, like many sites that offer access to data via API, do offer some assistance for a novice by providing examples of how to build the url needed for access.  Although it is open to the public, you must register as a developer and obtain an API key (http://api.census.gov/data/key_signup.html).  Here is a link to a list of datasets available https://api.census.gov/data.html.  The 2020 census data should be available soon, but currently there is data available from past years.

A dataset to access was selected randomly:  “Population Estimates: Estimates by Age, Sex, Race, and Hispanic Origin from 2019”.  Here’s what the data dictionary looks like, also available on the site:

 

Name

Label

Concept

Required

Attributes

Limit

Predicate Type

Group

AGE

Age

 

default displayed

 

0

int

N/A

AGE_DESC

Description of AGE values

 

not required

 

0

string

N/A

DATE_CODE

Estimate Date

 

default displayed

 

0

int

N/A

DATE_DESC

Description of DATE values

 

not required

 

0

string

N/A

for

Census API FIPS 'for' clause

Census API Geography Specification

predicate-only

 

0

fips-for

N/A

GEO_ID

Geography

 

not required

NAME

0

string

N/A

GEOCOMP

GEO_ID Component

 

default displayed

 

0

string

N/A

HISP

Hispanic Origin

 

default displayed

 

0

int

N/A

in

Census API FIPS 'in' clause

Census API Geography Specification

predicate-only

 

0

fips-in

N/A

LASTUPDATE

Last Update

 

not required

 

0

string

N/A

NATION

Geography

 

not required

 

0

(not a predicate)

N/A

POP

Population

 

not required

 

0

(not a predicate)

N/A

RACE

Race

 

default displayed

 

0

int

N/A

SEX

Sex

 

default displayed

 

0

int

N/A

STATE

Geography

 

not required

 

0

(not a predicate)

N/A

SUMLEVEL

Summary Level code

 

not required

 

0

string

N/A

ucgid

Uniform Census Geography Identifier clause

Census API Geography Specification

predicate-only

 

0

ucgid

N/A

UNIVERSE

Population Universe

 

not required

 

0

string

N/A

 

Table 1. Data Dictionary for Population Estimates: Estimates by Age, Sex, Race, and Hispanic Origin from 2019

 

Here is what the SAS code could look like:

filename cens temp;

 

Define url with variables to include in resulting table (this table is limited to the bolded variables, also where the key is used):

proc http 
  url='https://api.census.gov/data/2019/pep/charage?get=POP,AGE,AGE_DESC,DATE_CODE,DATE_DESC,HISP,RACE,SEX&for=us:*&key=05520f15ee12351332aedee15265113e5bd6e88c'
  method="GET"
  out=cens;
run;

 

Assign a JSON library to the HTTP response:

libname resp JSON fileref=cens;

 

Copy datasets from JSON library to work:

proc datasets library=resp;
	copy out=work;
run;
quit;

 

The resulting dataset is named ‘Root’ with 132192 observations (there is also a long version of data created named ‘AllData’).

 

Figure1.png

Figure 1. Dataset root.sas7bdat

 

Figure2.png

Figure 2. Dataset alldata.sas7bdat

 

The SAS® dataset ‘Root’ is not in the format expected but additional simple SAS® code can be used to manipulate the structure to give a more useful dataset.  The following code can be added and executed to transform the data into a more usable format:

 

Use AllData.sas7bdat to help rename variables that correspond to the variable names in the first row of Root:

proc sort nodupkey data=alldata out=alldataX; by P1; run;
proc sql noprint ;
  select catx('=',P1,Value) 
    into :rename separated by ' '
    from alldataX ;
quit;

 

Reformat variables that are character that should be numeric:

data excensus (drop=ordinal_root us iage ipop);
  set root (firstobs=2
		rename=(&rename pop=ipop age=iage));
pop=input(strip(trim(ipop)),best12.);
age=input(strip(trim(iage)),best12.);
run;

 

Clear the response library reference:

libname resp clear;

 

The resulting dataset has variables defined as expected in the data dictionary:

Figure3.png

Figure 3. Dataset root.sas7bdat

 

HEALTHDATA.GOV

The United States Department of Health & Human Services provides health data accessible through API.  Unlike the US Census Bureau data, no API Key is needed to access datasets.  Here is a link to a listing of the datasets available https://healthdata.gov/browse.   

 

Again, a dataset to access was selected randomly:  “Deaths involving coronavirus disease 2019 (COVID-19) by race and Hispanic origin group and age, by state”.  Here’s what the data dictionary looks like, also available on the site:

Column Name

Description

Type

Data as of

Date of analysis

Date & Time

Start Date

First date of data period

Date & Time

End Date

Last date of data period

Date & Time

State

Jurisdiction of occurrence

Plain Text

Age group

Age group

Plain Text

Race and Hispanic Origin Group

Race or Hispanic Origin Group

Plain Text

COVID-19 Deaths

Deaths involving COVID-19 (ICD-code U07.1)

Number

Total Deaths

Deaths from all causes of death

Number

Pneumonia Deaths

Pneumonia Deaths (ICD-10 codes J12.0-J18.9)

Number

Pneumonia and COVID-19 Deaths

Deaths with Pneumonia and COVID-19 (ICD-10 codes J12.0-J18.9 and U07.1)

Number

Influenza Deaths

Influenza Deaths (ICD-10 codes J09-J11)

Number

Pneumonia, Influenza, or COVID-19 Deaths

Deaths with Pneumonia, Influenza, or COVID-19 (ICD-10 codes U07.1 or J09-J18.9)

Number

Footnote

Suppressed counts (1-9)

Plain Text

 

Table 2. Data Dictionary for Deaths involving coronavirus disease 2019 (COVID-19) by race and Hispanic origin group and age, by state

 

Here is what the very simple SAS code could look like:

 

filename hdata temp;

 

Define url to access the data.  This time the url is much less complicated importing all defined variables from the data dictionary:

 

proc http 
  url='https://data.cdc.gov/resource/ks3g-spdg.json'
  method= "GET"
  out=hdata;
run;

 

Assign a JSON library to the HTTP response:

libname hresp JSON fileref=hdata;

 

Copy datasets from JSON library to work:

proc datasets library=hresp;
	copy out=work;
run;
quit;

 

Clear the response library reference:

libname hresp clear;

 

The resulting dataset is named ‘Root’ with 1000 observations (there is also a long version of data created named ‘AllData’).  The only structural changes that may be needed are conversion of some variables from character to numeric).

 

Figure4.pngFigure5.png

Figure 5. Dataset root.sas7bdat

 

CONCLUSION

There are many sources of big public data that can be accessed using very simple base SAS® code through an API, requiring very little manipulation of the data obtained for use in analysis.

 

References

Henry, Joseph. 2020. “REST Just Got Easy with SAS® and PROC HTTP”, Virtual: SAS Global Forum.

Available at https://www.sas.com/content/dam/SAS/support/en/sas-global-forum-proceedings/2020/4426-2020.pdf.

 

de Walick, Laurent. 2020. “A Beginners Guide to Consuming RESTful Web Services in SAS®”, Virtual: SAS Global Forum.

Available at https://www.sas.com/content/dam/SAS/support/en/sas-global-forum-proceedings/2020/4209-2020.pdf.

 

United States Department of Health & Human Services. 2021. “HealthData.gov” Accessed April 15, 2021.  https://healthdata.gov

 

Marr, Bernard. Feb 12, 2016. “Big Data: 33 Brilliant And Free Data Sources Anyone Can Use” Accessed April 15, 2021.  https://www.forbes.com/sites/bernardmarr/2016/02/12/big-data-35-brilliant-and-free-data-sources-for-...

 

United States Census Bureau. 2021. “Available APIs” Accessed April 15, 2021.  https://www.forbes.com/sites/bernardmarr/2016/02/12/big-data-35-brilliant-and-free-data-sources-for-...

 

CONTACT INFORMATION

Your comments and questions are valued and encouraged. Contact the author at:

Laurie Smith

Cincinnati Children’s Hospital Medical Center

Laurie.Bishop@cchmc.org

ResearchGate

LinkedIn

 

FULL CODE:

/*United States Census Data*/
filename cens temp;
 
/*Define api with variables to incllude in resulting table*/
proc http 
  url='https://api.census.gov/data/2019/pep/charage?get=POP,AGE,AGE_DESC,DATE_CODE,DATE_DESC,HISP,RACE,SEX&for=us:*&key=05520f15ee12351332aedee15265113e5bd6e88c'
method= "GET"
 out=cens;
run;
 
/* Assign a JSON library to the HTTP response */
libname resp JSON fileref=cens;

/*Copy datasets from JSON library to work*/
proc datasets library=resp;
	copy out=work;
run;
quit;


/*Use AllData to help rename variables given variable names in the first row*/
proc sort nodupkey data=alldata out=alldataX; by P1; run;
proc sql noprint ;
  select catx('=',P1,Value) 
    into :rename separated by ' '
    from alldataX ;
quit;

/*Format variables that should be numeric and drop variable name observation*/
data excensus (drop=ordinal_root us iage ipop);
  set root (firstobs=2
				rename=(&rename pop=ipop age=iage));
pop=input(strip(trim(ipop)),best12.);
age=input(strip(trim(iage)),best12.);
run;

libname resp clear;

/*HealthData.gov*/
filename hdata temp;
 
/*Define url to access data*/
proc http 
  url='https://data.cdc.gov/resource/ks3g-spdg.json'
method= "GET"
 out=hdata;
run;
 
/* Assign a JSON library to the HTTP response */
libname hresp JSON fileref=hdata;

/*Copy datasets from JSON library to work*/
proc datasets library=hresp;
	copy out=work;
run;
quit;

libname hresp clear;
Version history
Last update:
‎05-10-2021 05:28 PM
Updated by:
Contributors

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Article Labels
Article Tags