BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
mccl4488
Fluorite | Level 6

Multiple yearly csv files are available at https://collegescorecard.ed.gov/data/, each with 1729 columns & 7,000ish rows each.  My goal is to reconstruct/stack the 10 CSV files along with a new field in each identifying the year into a single dataset.  Many of the standard solutions from Google searches seem to involve a PROC IMPORT and adjusting the resulting code, however, that seems unreasonable given the large number of columns to format.

 

One attempt using PROC IMPORT: imports each year then stacks the results in a data step....

 

PROC IMPORT OUT= WORK.MERGE_20XX
DATAFILE= "I:\Wes\SCORECARD\CollegeScorecard_Raw_Data\MERGED20XX_PP.csv"
DBMS=CSV REPLACE;
GETNAMES=YES;
DATAROW=2;
GUESSINGROWS=40000;
RUN;

 

DATA ALL: set merge_2009 merge_2010 ...merge_20xx ; run; 

 

Problems/Issues: 

  1. Numeric columns contain text, i.e., "DataSuppressed" and "NULL". (So increasing GUESSINGROWS is not helpful in assigning numeric/character although it does help to actually read in the file since otherwise you get invalid data errors.) 

  2. Apparently, there are so many fields that SAS is not reading them all in and some of the later fields are labelled VAR1727, VAR1728, VAR1729, etc. along with the error message: "Number of names found is less than number of variables found."

 

Question: Is it possible to read in these many-columned datasets with their proper variable names and field types without manually adjusting each variable?  At present, I am using a large value for GUESSINGROWS (which results in almost all character fields), then stacking and manually adjusting each variable to numeric where appropriate, applying longer lengths to get the full variable name, etc., but this seems really painstaking and I would like to know how an experienced SAS programmer would tackle this problem.

 

Thank-you for any suggestions.

 

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

 1. Numeric columns contain text, i.e., "DataSuppressed" and "NULL". (So increasing GUESSINGROWS is not helpful in assigning numeric/character although it does help to actually read in the file since otherwise you get invalid data errors.) 

 

This is actually one of the reasons to create data step code. When you identify the offending variables you change in the informat for them to  a custom numeric informat that assigns missing to those values.

Example:

Proc format;
invalue BadVal (upcase just) /* upcase treats all of the values as upper cass for comparison and assignment in case you get NULL, Null, null etc*/
"DATASUPPRESSED" = .D
"NULL" = .N
_other_ = [best32.]
;
run;

 

and use BADVAL. as the informat for those variables in the data step.

 

 2. Apparently, there are so many fields that SAS is not reading them all in and some of the later fields are labelled VAR1727, VAR1728, VAR1729, etc. along with the error message: "Number of names found is less than number of variables found."

This is two issues. The VAR names come about because your column headers have duplicate values exceeding the length of SAS variable names ( in some cases by what appears to be close to 100 characters if I read that txt file correctly). In a data step you either create the variable names you want (once) and/or assign meaningful labels using the information from the txt file.

The "number of names" issue may mean that you do not have a column heading for some of your columns OR that your data contains Pipes somewhere in the body where you didn't expect it, and possibly both. Possible solutions include: adding column header(s) and if there are pipe characters in the data using another delimiter.

 

With long header text it often is not easy to see where one may miss. Copy the header row and then paste transposed into a different spreadsheet. Look for blank rows. That will tell about where a column heading may be missing.

View solution in original post

4 REPLIES 4
Reeza
Super User

PROC IMPORT and adjusting the resulting code 😄

 

Do you have a data dictionary that shows the variables types/formats?

mccl4488
Fluorite | Level 6

Yes, I've attached it (after resaving as .txt) and here's what it looks like in general as a .csv (no column for width):
Capture.JPG

 

 

Reeza
Super User
I would use that to generate a SAS data step input statement and then fix any fields that weren't importing correctly.
ballardw
Super User

 1. Numeric columns contain text, i.e., "DataSuppressed" and "NULL". (So increasing GUESSINGROWS is not helpful in assigning numeric/character although it does help to actually read in the file since otherwise you get invalid data errors.) 

 

This is actually one of the reasons to create data step code. When you identify the offending variables you change in the informat for them to  a custom numeric informat that assigns missing to those values.

Example:

Proc format;
invalue BadVal (upcase just) /* upcase treats all of the values as upper cass for comparison and assignment in case you get NULL, Null, null etc*/
"DATASUPPRESSED" = .D
"NULL" = .N
_other_ = [best32.]
;
run;

 

and use BADVAL. as the informat for those variables in the data step.

 

 2. Apparently, there are so many fields that SAS is not reading them all in and some of the later fields are labelled VAR1727, VAR1728, VAR1729, etc. along with the error message: "Number of names found is less than number of variables found."

This is two issues. The VAR names come about because your column headers have duplicate values exceeding the length of SAS variable names ( in some cases by what appears to be close to 100 characters if I read that txt file correctly). In a data step you either create the variable names you want (once) and/or assign meaningful labels using the information from the txt file.

The "number of names" issue may mean that you do not have a column heading for some of your columns OR that your data contains Pipes somewhere in the body where you didn't expect it, and possibly both. Possible solutions include: adding column header(s) and if there are pipe characters in the data using another delimiter.

 

With long header text it often is not easy to see where one may miss. Copy the header row and then paste transposed into a different spreadsheet. Look for blank rows. That will tell about where a column heading may be missing.

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 4 replies
  • 1719 views
  • 0 likes
  • 3 in conversation