BookmarkSubscribeRSS Feed
rhilty
Obsidian | Level 7

I'm struggling with importing a CSV file to SAS. I think my code is mostly correct, but I'm getting a huge string of errors re: duplicate names. It seems like SAS is reading WAY more variables than I actually have in my datafile. I tried adding a keep statement to specify which columns are variables and which are just blank space, but I'm not sure if it's working. 

 

Here's my code:

/*Importing from dataset*/
libname DA "Z:\RI PK Eval RIDE-CT\Data\Raw Data\Double Entry Files\Crosschecks\Direct Assessments\";

proc import 
datafile="Z:\RI PK Eval RIDE-CT\Data\Raw Data\Double Entry Files\Crosschecks\Direct Assessments\DirectAssessmentsFall1.csv"
out = work.DirectAssessmentsFall1 
(keep = child_id	classroom_id_number	program_id_number	wave	date_conducted	
childs_assessment_language	rowpvt_lang_of_administration	rowpvt_ceiling_item_number	rowpvt_number_of_errors	
eowpvt_lang_of_administration	eowpvt_ceiling_item_number	eowpvt_number_of_errors	
htks_language_of_administration	htks_part_1_score	htks_part_2_score	
htks_part_2_not_administered	htks_part_3_score	htks_part_3_not_administered	
wjiv_checklist_convers_prof	wjiv_checklist_cooperation	wjiv_checklist_activity	wjiv_checklist_attention	
wjiv_checklist_self_confidence	wjiv_checklist_care	wjiv_checklist_difficult_tasks	wjiv_lw_number_correct	
wjiv_ap_number_correct	wmiii_lw_number_correct	wmiii_ap_number_correct)
dbms= csv 
REPLACE;
GETNAMES = YES;
DATAROW = 2;
run;

 

And here's a sample of the error log:

1723  proc import
1724  datafile="Z:\RI PK Eval RIDE-CT\Data\Raw Data\Double Entry Files\Crosschecks\Direct
1724! Assessments\DirectAssessmentsFall1.csv"
1725  out = work.DirectAssessmentsFall1
1726  dbms= csv
1727  REPLACE;
1728  GETNAMES = YES;
1729  DATAROW = 2;
1730  run;

Number of names found is greater than number of variables found.
Name wmiii_ap_number_correct
1002 is not a valid SAS name.
Name _ is a duplicate.
Name _ is a duplicate.
Name _ is a duplicate.
Name _ is a duplicate.
Name _ is a duplicate.
Name _ is a duplicate.
Name _ is a duplicate.
Name _ is a duplicate.
Name _ is a duplicate.
Name _ is a duplicate.
Name _ is a duplicate.
Name _ is a duplicate.
Name _ is a duplicate.
Name _ is a duplicate.
Name _ is a duplicate.
Name _ is a duplicate.
Name _0 is a duplicate.
Name _0 is a duplicate.
Name _ is a duplicate.
Name .
1003 is not a valid SAS name.
Name _09_01 is a duplicate.
Name _09 is a duplicate.
Name _ is a duplicate.
Name _0_16_2017 is a duplicate.
[etc.]

Problems were detected with provided names.  See LOG.
1731   /**********************************************************************
1732   *   PRODUCT:   SAS
1733   *   VERSION:   9.3
1734   *   CREATOR:   External File Interface
1735   *   DATE:      02OCT18
1736   *   DESC:      Generated SAS Datastep Code
1737   *   TEMPLATE SOURCE:  (None Specified.)
1738   ***********************************************************************/
1739      data WORK.DIRECTASSESSMENTSFALL1    ;
1740      %let _EFIERR_ = 0; /* set the ERROR detection macro variable */
1741      infile 'Z:\RI PK Eval RIDE-CT\Data\Raw Data\Double Entry Files\Crosschecks\Direct
1741! Assessments\DirectAssessmentsFall1.csv' delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=2
1741! ;
1742         informat child_id best32. ;
1743         informat classroom_id_number best32. ;
1744         informat program_id_number mmddyy10. ;
1745         informat wave best32. ;
1746         informat date_conducted best32. ;
1747         informat childs_assessment_language best32. ;
1748         informat rowpvt_lang_of_administration best32. ;
1749         informat rowpvt_ceiling_item_number best32. ;
1750         informat rowpvt_number_of_errors best32. ;
1751         informat eowpvt_lang_of_administration best32. ;
1752         informat eowpvt_ceiling_item_number best32. ;
1753         informat eowpvt_number_of_errors best32. ;
1754         informat htks_language_of_administration best32. ;
1755         informat htks_part_1_score best32. ;
1756         informat htks_part_2_score best32. ;
1757         informat htks_part_2_not_administered best32. ;
1758         informat htks_part_3_score best32. ;
1759         informat htks_part_3_not_administered best32. ;
1760         informat wjiv_checklist_convers_prof best32. ;
1761         informat wjiv_checklist_cooperation best32. ;
1762         informat wjiv_checklist_activity best32. ;
1763         informat wjiv_checklist_attention best32. ;
1764         informat wjiv_checklist_self_confidence best32. ;
1765         informat wjiv_checklist_care best32. ;
1766         informat wjiv_checklist_difficult_tasks best32. ;
1767         informat wjiv_lw_number_correct best32. ;
1768         informat wjiv_ap_number_correct $6. ;
1769         informat wmiii_lw_number_correct $7. ;
1770         informat VAR29 best32. ;
1771         informat _09_01 best32. ;
1772         informat _09 mmddyy10. ;
1773         informat _ best32. ;
1774         informat _0_16_2017 best32. ;
1775         informat VAR34 best32. ;
1776         informat VAR35 best32. ;
1777         informat _9 best32. ;
1778         informat _7 best32. ;
1779         informat VAR38 best32. ;
1780         informat _0 best32. ;
1781         informat _2 best32. ;
[etc.]

Thanks in advance for any help or suggestions!

 

4 REPLIES 4
Kurt_Bremser
Super User

When proc import can't make sense out of the first line and the file structure, it's better to write the data step yourself, according to the description you got with the csv file.

ballardw
Super User

I have dealt with this many times. There are some pretty poor programs or options that create output "data" with repeated columns like Total. Or from the number of _ you are getting single digit or entirely blank columns(or at least blanks on the header row). If the _ were all at the right of the data then those are easy to ignore, just remove from the Informat, format and input statements. If the values are also all blank then this may be "ghost" columns created when Excel is used. Deleted columns and rows when exported to CSV appear as Excel seems to be programmed such that once a row or column is used then it is always "used" and needs to be included when saved. Usually the _ appears when a character not acceptable as a SAS variable name appears. Variable names can't start with digits for instance. So If you have columns with headers like 1 2 3 4 etc. They are going to all be replaced as _ and get duplicates.

 

One solution is to edit the file before attempting to import it to have better column headers. If you do actually have 1 2 3 and such then prefix them with a character such as C1 C2 C3. If you blocks of related variables you might also have D1 D2 D3, E1 E2 E3 and so on.

I would guess that your _9 are coming from actual column headers like 19 29 39. Same fix, make them C19 or similar.

 

Another potential issue is attempting to create variables from column headings that exceed 32 characters are identical for the first 32 (or more) characters. That is a likely source of your VarXX variables.

Tom
Super User Tom
Super User

Don't use PROC IMPORT.  It has to guess at what is in the file. It has to guess at what to call the variables. If has to guess how to define the variables. It has to guess if you need to use a special informat to read the values from the text file (note most variables do NOT need a special informat attached).  It has to guess if a variable needs a format attached (note most variable do NOT need a format attached).   It has to guess if you want to attach a label to your variable.

 

Write your own data step to read the file.  In many cases that is easier than writing the PROC IMPORT step anyway. And you will definitely get a better dataset. And one that you know how the variables are defined.

 

Tom
Super User Tom
Super User

Seeing more column headers than columns could be related to a few different issues.

First is could just be that the last few columns are completely empty and whoever (whatever process) wrote the data did not bother to add the extra delimiters for the empty values are the end of the lines.

 

Second it could be that the header line includes some type of title information, such as a date when the file was produced, that looked to PROC IMPORT like more headers.

 

Third, and most likely and most troubling, is that the values of one or more headers included the delimiter, but the generator of the file forgot to include quotes around the values with embedded delimiters.  So the extra emedded delimiters make it look like there are more headers than there really were. If it only happened in the header line then just figure out what columns you have and assign your own names to them.  The troubling part would that this might have also happened to the data values themselves and not just to the column headers.  Then the lines of data will become impossible to parse into columns.

 

SAS INNOVATE 2024

Innovate_SAS_Blue.png

Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.

If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website. 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Get the $99 certification deal.jpg

 

 

Back in the Classroom!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 4 replies
  • 3752 views
  • 4 likes
  • 4 in conversation