- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Good afternoon, All:
I have two different CSV files, both identical from what I can tell in terms of data structure save out of excel, but one is reporting correctly and the other is not. Long story short there was an original CSV file, I reran the process that produces the file once some new values were added (new values that aren't a part of this particular issue from what I can tell), and now the new data set only brings up the first value on certain fields once it imports into SAS. I assume there is an issue with my CSV, but I have yet to be able to identify what it might be. I just wanted to see if anyone had a solution to the problem. Below is one example of the issue.
Original import:
proc import datafile="\\this_is_the_location_of_my_data\original_file.csv"
out=basefile_original
dbms=csv
replace;
run;
Original log:
/**********************************************************************
* PRODUCT: SAS
* VERSION: 9.4
* CREATOR: External File Interface
* DATE: 09FEB18
* DESC: Generated SAS Datastep Code
* TEMPLATE SOURCE: (None Specified.)
***********************************************************************/
data WORK.BASEFILE_ORIGINAL ;
%let _EFIERR_ = 0; /* set the ERROR detection macro variable */
infile '\\this_is_where_my_data_is\original.csv' delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=2 ;
informat EMPLID best32. ;
informat APPL_NBR best32. ;
informat LAST_NAME $17. ;
informat FIRST_NAME $8. ;
informat MIDDLE_INIT $1. ;
informat SEX $1. ;
informat ETHNIC_CD $1. ;
informat ACAD_CAREER $4. ;
informat ACAD_PROG $5. ;
informat ADMIT_TERM best32. ;
informat ADMIT_TYPE $3. ;
informat ADMITTED $1. ;
informat PROG_ACTION $4. ;
informat PROG_REASON $4. ;
informat PROG_STATUS $2. ;
informat CENSUS_DT mmddyy10. ;
informat MAJOR $9. ;
informat ACADEM_LEV best32. ;
informat FICE best32. ;
informat HS_CODE best32. ;
informat HS_GPA best32. ;
informat CLASS_RANK best32. ;
informat CLASS_SIZE best32. ;
informat ACT_COMP best32. ;
informat ACT_READ best32. ;
informat ACT_MATH best32. ;
informat ACT_WRIT best32. ;
informat SAT_VERB best32. ;
informat SAT_ESSAY $1. ;
informat SAT_MATH best32. ;
informat SAT_WRITE best32. ;
informat SAT_EBRWS best32. ;
informat SAT_MATHS best32. ;
informat SAT_TOT02 best32. ;
informat GRE_VERB $1. ;
informat GRE_QUAN $1. ;
informat GRE_ANLY best32. ;
informat GRE_VERBR best32. ;
informat GRE_QUANR best32. ;
informat GMA_QUAN $1. ;
informat GMA_TOTAL $1. ;
informat GMA_ANLY $1. ;
informat UTT_AMIND best32. ;
informat UTT_ASIAN best32. ;
informat UTT_BLACK best32. ;
informat UTT_HAWAIIAN best32. ;
informat UTT_HISPA best32. ;
informat UTT_NONRESAL best32. ;
informat UTT_NSPEC best32. ;
informat UTT_WHITE best32. ;
format EMPLID best12. ;
format APPL_NBR best12. ;
format LAST_NAME $17. ;
format FIRST_NAME $8. ;
format MIDDLE_INIT $1. ;
format SEX $1. ;
format ETHNIC_CD $1. ;
format ACAD_CAREER $4. ;
format ACAD_PROG $5. ;
format ADMIT_TERM best12. ;
format ADMIT_TYPE $3. ;
format ADMITTED $1. ;
format PROG_ACTION $4. ;
format PROG_REASON $4. ;
format PROG_STATUS $2. ;
format CENSUS_DT mmddyy10. ;
format MAJOR $9. ;
format ACADEM_LEV best12. ;
format FICE best12. ;
format HS_CODE best12. ;
format HS_GPA best12. ;
format CLASS_RANK best12. ;
format CLASS_SIZE best12. ;
format ACT_COMP best12. ;
format ACT_READ best12. ;
format ACT_MATH best12. ;
format ACT_WRIT best12. ;
format SAT_VERB best12. ;
format SAT_ESSAY $1. ;
format SAT_MATH best12. ;
format SAT_WRITE best12. ;
format SAT_EBRWS best12. ;
format SAT_MATHS best12. ;
format SAT_TOT02 best12. ;
format GRE_VERB $1. ;
format GRE_QUAN $1. ;
format GRE_ANLY best12. ;
format GRE_VERBR best12. ;
format GRE_QUANR best12. ;
format GMA_QUAN $1. ;
format GMA_TOTAL $1. ;
format GMA_ANLY $1. ;
format UTT_AMIND best12. ;
format UTT_ASIAN best12. ;
format UTT_BLACK best12. ;
format UTT_HAWAIIAN best12. ;
format UTT_HISPA best12. ;
format UTT_NONRESAL best12. ;
format UTT_NSPEC best12. ;
format UTT_WHITE best12. ;
input
EMPLID
APPL_NBR
LAST_NAME $
FIRST_NAME $
MIDDLE_INIT $
SEX $
ETHNIC_CD $
ACAD_CAREER $
ACAD_PROG $
ADMIT_TERM
ADMIT_TYPE $
ADMITTED $
PROG_ACTION $
PROG_REASON $
PROG_STATUS $
CENSUS_DT
MAJOR $
ACADEM_LEV
FICE
HS_CODE
HS_GPA
CLASS_RANK
CLASS_SIZE
ACT_COMP
ACT_READ
ACT_MATH
ACT_WRIT
SAT_VERB
SAT_ESSAY $
SAT_MATH
SAT_WRITE
SAT_EBRWS
SAT_MATHS
SAT_TOT02
GRE_VERB $
GRE_QUAN $
GRE_ANLY
GRE_VERBR
GRE_QUANR
GMA_QUAN $
GMA_TOTAL $
GMA_ANLY $
UTT_AMIND
UTT_ASIAN
UTT_BLACK
UTT_HAWAIIAN
UTT_HISPA
UTT_NONRESAL
UTT_NSPEC
UTT_WHITE
;
if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR detection macro variable */
run;
Updated import:
proc import datafile="\\this_is_the_location_of_my_data\updated_file.csv"
out=basefile
dbms=csv
replace;
run;
Updated log:
/**********************************************************************
* PRODUCT: SAS
* VERSION: 9.4
* CREATOR: External File Interface
* DATE: 09FEB18
* DESC: Generated SAS Datastep Code
* TEMPLATE SOURCE: (None Specified.)
***********************************************************************/
data WORK.BASEFILE ;
%let _EFIERR_ = 0; /* set the ERROR detection macro variable */
infile '\\this_is_where_my_data_is\updated_file.csv' delimiter = ',' MISSOVER DSD lrecl=32767
firstobs=2 ;
informat EMPLID best32. ;
informat APPL_NBR best32. ;
informat LAST_NAME $16. ;
informat FIRST_NAME $9. ;
informat MIDDLE_INIT $1. ;
informat SEX $1. ;
informat ETHNIC_CD $1. ;
informat ACAD_CAREER $4. ;
informat ACAD_PROG $5. ;
informat ADMIT_TERM best32. ;
informat ADMIT_TYPE $3. ;
informat ADMITTED $1. ;
informat PROG_ACTION $4. ;
informat PROG_REASON $4. ;
informat PROG_STATUS $2. ;
informat CENSUS_DT mmddyy10. ;
informat MAJOR $10. ;
informat ACADEM_LEV best32. ;
informat FICE best32. ;
informat HS_CODE $1. ;
informat HS_GPA $1. ;
informat CLASS_RANK $1. ;
informat CLASS_SIZE $1. ;
informat ACT_COMP best32. ;
informat ACT_READ best32. ;
informat ACT_MATH best32. ;
informat ACT_WRIT $1. ;
informat SAT_VERB best32. ;
informat SAT_ESSAY best32. ;
informat SAT_MATH best32. ;
informat SAT_WRITE best32. ;
informat SAT_EBRWS $1. ;
informat SAT_MATHS $1. ;
informat SAT_TOT02 $1. ;
informat GRE_VERB $1. ;
informat GRE_QUAN $1. ;
informat GRE_ANLY best32. ;
informat GRE_VERBR best32. ;
informat GRE_QUANR best32. ;
informat GMA_QUAN $1. ;
informat GMA_TOTAL $1. ;
informat GMA_ANLY $1. ;
informat UTT_AMIND best32. ;
informat UTT_ASIAN best32. ;
informat UTT_BLACK best32. ;
informat UTT_HAWAIIAN best32. ;
informat UTT_HISPA best32. ;
informat UTT_NONRESAL best32. ;
informat UTT_NSPEC best32. ;
informat UTT_WHITE best32. ;
format EMPLID best12. ;
format APPL_NBR best12. ;
format LAST_NAME $16. ;
format FIRST_NAME $9. ;
format MIDDLE_INIT $1. ;
format SEX $1. ;
format ETHNIC_CD $1. ;
format ACAD_CAREER $4. ;
format ACAD_PROG $5. ;
format ADMIT_TERM best12. ;
format ADMIT_TYPE $3. ;
format ADMITTED $1. ;
format PROG_ACTION $4. ;
format PROG_REASON $4. ;
format PROG_STATUS $2. ;
format CENSUS_DT mmddyy10. ;
format MAJOR $10. ;
format ACADEM_LEV best12. ;
format FICE best12. ;
format HS_CODE $1. ;
format HS_GPA $1. ;
format CLASS_RANK $1. ;
format CLASS_SIZE $1. ;
format ACT_COMP best12. ;
format ACT_READ best12. ;
format ACT_MATH best12. ;
format ACT_WRIT $1. ;
format SAT_VERB best12. ;
format SAT_ESSAY best12. ;
format SAT_MATH best12. ;
format SAT_WRITE best12. ;
format SAT_EBRWS $1. ;
format SAT_MATHS $1. ;
format SAT_TOT02 $1. ;
format GRE_VERB $1. ;
format GRE_QUAN $1. ;
format GRE_ANLY best12. ;
format GRE_VERBR best12. ;
format GRE_QUANR best12. ;
format GMA_QUAN $1. ;
format GMA_TOTAL $1. ;
format GMA_ANLY $1. ;
format UTT_AMIND best12. ;
format UTT_ASIAN best12. ;
format UTT_BLACK best12. ;
format UTT_HAWAIIAN best12. ;
format UTT_HISPA best12. ;
format UTT_NONRESAL best12. ;
format UTT_NSPEC best12. ;
format UTT_WHITE best12. ;
input
EMPLID
APPL_NBR
LAST_NAME $
FIRST_NAME $
MIDDLE_INIT $
SEX $
ETHNIC_CD $
ACAD_CAREER $
ACAD_PROG $
ADMIT_TERM
ADMIT_TYPE $
ADMITTED $
PROG_ACTION $
PROG_REASON $
PROG_STATUS $
CENSUS_DT
MAJOR $
ACADEM_LEV
FICE
HS_CODE $
HS_GPA $
CLASS_RANK $
CLASS_SIZE $
ACT_COMP
ACT_READ
ACT_MATH
ACT_WRIT $
SAT_VERB
SAT_ESSAY
SAT_MATH
SAT_WRITE
SAT_EBRWS $
SAT_MATHS $
SAT_TOT02 $
GRE_VERB $
GRE_QUAN $
GRE_ANLY
GRE_VERBR
GRE_QUANR
GMA_QUAN $
GMA_TOTAL $
GMA_ANLY $
UTT_AMIND
UTT_ASIAN
UTT_BLACK
UTT_HAWAIIAN
UTT_HISPA
UTT_NONRESAL
UTT_NSPEC
UTT_WHITE
;
if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR detection macro variable */
run;
When looking at one result from the original import I get this:
SAT_MATH | SAT_WRITE | SAT_EBRWS | SAT_MATHS | SAT_TOT02 |
550 | 490 | 650 | 590 | 1240 |
All numbers are as desired and match the CSV that I imported. All three fields are informat BEST32., format BEST12.
When looking at the same result in the updated imported data set, I get this:
SAT_MATH | SAT_WRITE | SAT_EBRWS | SAT_MATHS | SAT_TOT02 |
550 | 490 | 6 | 5 | 1 |
As you can see it is only producing the first value of the field for the three columns to the right. The two fields to the left are coming in the same informat and being put into the same format as the original file, but the three fields to the right have an informat of $1. This is where the problem lies, but the updated CSV itself is no different from the original from what I can tell (all fields are in 'General' format when viewed in Excel).
Thank you in advance for your help.
SAS 9.4 TS1M1
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I wasnt't asking for it, I was saying you can solve your problem with it.
But anyways, look at the format/informat/type of SAT_EBRWS in each set of code. You'll see they're different.
PROC IMPORT is a guessing procedure. It can and regularly does guess wrong. In these cases it's best to specify your data types yourself using a data step instead. Rather than writing your own data step from scratch though, you can take one of the imports above and fix it and then use that code on both files, just changing the file reference.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
When you run PROC IMPORT the data step code is generated behind the scenes in the log. Use that code instead and modify as needed.
You can select the code without line numbers by holding down ALT+SHIFT while selecting the code from the log.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
By request I have added the info from the log. Not sure if I added it to the post in the correct way or not, but that's the information the log put out.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I wasnt't asking for it, I was saying you can solve your problem with it.
But anyways, look at the format/informat/type of SAT_EBRWS in each set of code. You'll see they're different.
PROC IMPORT is a guessing procedure. It can and regularly does guess wrong. In these cases it's best to specify your data types yourself using a data step instead. Rather than writing your own data step from scratch though, you can take one of the imports above and fix it and then use that code on both files, just changing the file reference.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Okay. I see what you are saying now. I will do exactly that - specify the formats for the fields as they come in. While I have you though - do you know why PROC IMPORT would guess one way for a particular file and then a completely different way when the structure of the file itself is the same? That just seems rather strange to me since I haven't seen it do that prior to this instance.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
There's an option GUESSINGROWS that sets how many rows are scanned. If it happens that this particular file only has 1 value for say the first 1000 records then SAS will guess incorrectly. I did forget that you can set GUESSINGROWS=MAX in your PROC IMPORT to force it to scan the whole file before importing. This usually fixes most problems but it doesn't guarantee that your fields will be read correctly as numeric/character. For example IDs are often numeric but are really character type variables.
If you're doing this consistently for a file, you should use a data step. This helps in the long run because you can then know that all your variables are the same format/type which helps if you need to combine the data together. If they're different you can run into errors or 'unexpected behaviour'.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
CSV files do not contain any metadata about what the fields contain. The only metadata they have is possible variable names if the file contains a header row. So it must GUESS what data is in the file. It looks like for your problem file the values for the three new variables were probably empty on the first few lines of the file that PROC IMPORT used to make its guess. It can make a better guess if you add guessingrows=max statement to the proc.
But the real solution is just write your own data step to read the file. It is not hard, especially if you don't waste time adding informats and formats to variables that don't need them. You can even force your identifier variables to be read as strings instead of numbers. This is critical if they really are strings, but also useful since then you cannot accidentally generate meaningless statistics on them.
data WORK.BASEFILE_ORIGINAL ;
infile '\\this_is_where_my_data_is\original.csv' dsd firstobs=2 truncover;
length
EMPLID $12
APPL_NBR $12
LAST_NAME $30
FIRST_NAME $30
MIDDLE_INIT $1
SEX $1
ETHNIC_CD $1
ACAD_CAREER $4
ACAD_PROG $5
ADMIT_TERM 8
ADMIT_TYPE $3
ADMITTED $1
PROG_ACTION $4
PROG_REASON $4
PROG_STATUS $2
CENSUS_DT 8
MAJOR $20
ACADEM_LEV 8
FICE 8
HS_CODE 8
HS_GPA 8
CLASS_RANK 8
CLASS_SIZE 8
ACT_COMP 8
ACT_READ 8
ACT_MATH 8
ACT_WRIT 8
SAT_VERB 8
SAT_ESSAY $1
SAT_MATH 8
SAT_WRITE 8
SAT_EBRWS 8
SAT_MATHS 8
SAT_TOT02 8
GRE_VERB $1
GRE_QUAN $1
GRE_ANLY 8
GRE_VERBR 8
GRE_QUANR 8
GMA_QUAN $1
GMA_TOTAL $1
GMA_ANLY $1
UTT_AMIND 8
UTT_ASIAN 8
UTT_BLACK 8
UTT_HAWAIIAN 8
UTT_HISPA 8
UTT_NONRESAL 8
UTT_NSPEC 8
UTT_WHITE 8
;
informat CENSUS_DT mmddyy10. ;
format CENSUS_DT yymmdd10. ;
input EMPLID -- UTT_WHITE ;
run;