Solved: COLLEGE SCORECARD CSV data import

mccl4488 · Posted 10-06-2015 06:10 PM

Multiple yearly csv files are available at https://collegescorecard.ed.gov/data/, each with 1729 columns & 7,000ish rows each. My goal is to reconstruct/stack the 10 CSV files along with a new field in each identifying the year into a single dataset. Many of the standard solutions from Google searches seem to involve a PROC IMPORT and adjusting the resulting code, however, that seems unreasonable given the large number of columns to format.

One attempt using PROC IMPORT: imports each year then stacks the results in a data step....

PROC IMPORT OUT= WORK.MERGE_20XX
DATAFILE= "I:\Wes\SCORECARD\CollegeScorecard_Raw_Data\MERGED20XX_PP.csv"
DBMS=CSV REPLACE;
GETNAMES=YES;
DATAROW=2;
GUESSINGROWS=40000;
RUN;

DATA ALL: set merge_2009 merge_2010 ...merge_20xx ; run;

Problems/Issues:

1. Numeric columns contain text, i.e., "DataSuppressed" and "NULL". (So increasing GUESSINGROWS is not helpful in assigning numeric/character although it does help to actually read in the file since otherwise you get invalid data errors.)

2. Apparently, there are so many fields that SAS is not reading them all in and some of the later fields are labelled VAR1727, VAR1728, VAR1729, etc. along with the error message: "Number of names found is less than number of variables found."

Question: Is it possible to read in these many-columned datasets with their proper variable names and field types without manually adjusting each variable? At present, I am using a large value for GUESSINGROWS (which results in almost all character fields), then stacking and manually adjusting each variable to numeric where appropriate, applying longer lengths to get the full variable name, etc., but this seems really painstaking and I would like to know how an experienced SAS programmer would tackle this problem.

Thank-you for any suggestions.

ballardw · Posted 10-07-2015 11:35 AM

1. Numeric columns contain text, i.e., "DataSuppressed" and "NULL". (So increasing GUESSINGROWS is not helpful in assigning numeric/character although it does help to actually read in the file since otherwise you get invalid data errors.)

This is actually one of the reasons to create data step code. When you identify the offending variables you change in the informat for them to a custom numeric informat that assigns missing to those values.

Example:

Proc format;
invalue BadVal (upcase just) /* upcase treats all of the values as upper cass for comparison and assignment in case you get NULL, Null, null etc*/
"DATASUPPRESSED" = .D
"NULL" = .N
_other_ = [best32.]
;
run;

and use BADVAL. as the informat for those variables in the data step.

2. Apparently, there are so many fields that SAS is not reading them all in and some of the later fields are labelled VAR1727, VAR1728, VAR1729, etc. along with the error message: "Number of names found is less than number of variables found."

This is two issues. The VAR names come about because your column headers have duplicate values exceeding the length of SAS variable names ( in some cases by what appears to be close to 100 characters if I read that txt file correctly). In a data step you either create the variable names you want (once) and/or assign meaningful labels using the information from the txt file.

The "number of names" issue may mean that you do not have a column heading for some of your columns OR that your data contains Pipes somewhere in the body where you didn't expect it, and possibly both. Possible solutions include: adding column header(s) and if there are pipe characters in the data using another delimiter.

With long header text it often is not easy to see where one may miss. Copy the header row and then paste transposed into a different spreadsheet. Look for blank rows. That will tell about where a column heading may be missing.

View solution in original post

Reeza · Posted 10-06-2015 06:15 PM

PROC IMPORT and adjusting the resulting code 😄

Do you have a data dictionary that shows the variables types/formats?

mccl4488 · Posted 10-06-2015 06:44 PM

Yes, I've attached it (after resaving as .txt) and here's what it looks like in general as a .csv (no column for width):

Reeza · Posted 10-07-2015 04:40 PM

I would use that to generate a SAS data step input statement and then fix any fields that weren't importing correctly.

ballardw · Posted 10-07-2015 11:35 AM

1. Numeric columns contain text, i.e., "DataSuppressed" and "NULL". (So increasing GUESSINGROWS is not helpful in assigning numeric/character although it does help to actually read in the file since otherwise you get invalid data errors.)

This is actually one of the reasons to create data step code. When you identify the offending variables you change in the informat for them to a custom numeric informat that assigns missing to those values.

Example:

Proc format;
invalue BadVal (upcase just) /* upcase treats all of the values as upper cass for comparison and assignment in case you get NULL, Null, null etc*/
"DATASUPPRESSED" = .D
"NULL" = .N
_other_ = [best32.]
;
run;

and use BADVAL. as the informat for those variables in the data step.

2. Apparently, there are so many fields that SAS is not reading them all in and some of the later fields are labelled VAR1727, VAR1728, VAR1729, etc. along with the error message: "Number of names found is less than number of variables found."

This is two issues. The VAR names come about because your column headers have duplicate values exceeding the length of SAS variable names ( in some cases by what appears to be close to 100 characters if I read that txt file correctly). In a data step you either create the variable names you want (once) and/or assign meaningful labels using the information from the txt file.

The "number of names" issue may mean that you do not have a column heading for some of your columns OR that your data contains Pipes somewhere in the body where you didn't expect it, and possibly both. Possible solutions include: adding column header(s) and if there are pipe characters in the data using another delimiter.

With long header text it often is not easy to see where one may miss. Copy the header row and then paste transposed into a different spreadsheet. Look for blank rows. That will tell about where a column heading may be missing.

COLLEGE SCORECARD CSV data import

Re: COLLEGE SCORECARD CSV data import

Re: COLLEGE SCORECARD CSV data import

Re: COLLEGE SCORECARD CSV data import

Re: COLLEGE SCORECARD CSV data import

Re: COLLEGE SCORECARD CSV data import

COLLEGE SCORECARD CSV data import

Re: COLLEGE SCORECARD CSV data import

Re: COLLEGE SCORECARD CSV data import

Re: COLLEGE SCORECARD CSV data import

Re: COLLEGE SCORECARD CSV data import

Re: COLLEGE SCORECARD CSV data import

Click image to register for webinar

Classroom Training Available!