I'm struggling with importing a CSV file to SAS. I think my code is mostly correct, but I'm getting a huge string of errors re: duplicate names. It seems like SAS is reading WAY more variables than I actually have in my datafile. I tried adding a keep statement to specify which columns are variables and which are just blank space, but I'm not sure if it's working.
Here's my code:
/*Importing from dataset*/ libname DA "Z:\RI PK Eval RIDE-CT\Data\Raw Data\Double Entry Files\Crosschecks\Direct Assessments\"; proc import datafile="Z:\RI PK Eval RIDE-CT\Data\Raw Data\Double Entry Files\Crosschecks\Direct Assessments\DirectAssessmentsFall1.csv" out = work.DirectAssessmentsFall1 (keep = child_id classroom_id_number program_id_number wave date_conducted childs_assessment_language rowpvt_lang_of_administration rowpvt_ceiling_item_number rowpvt_number_of_errors eowpvt_lang_of_administration eowpvt_ceiling_item_number eowpvt_number_of_errors htks_language_of_administration htks_part_1_score htks_part_2_score htks_part_2_not_administered htks_part_3_score htks_part_3_not_administered wjiv_checklist_convers_prof wjiv_checklist_cooperation wjiv_checklist_activity wjiv_checklist_attention wjiv_checklist_self_confidence wjiv_checklist_care wjiv_checklist_difficult_tasks wjiv_lw_number_correct wjiv_ap_number_correct wmiii_lw_number_correct wmiii_ap_number_correct) dbms= csv REPLACE; GETNAMES = YES; DATAROW = 2; run;
And here's a sample of the error log:
1723 proc import 1724 datafile="Z:\RI PK Eval RIDE-CT\Data\Raw Data\Double Entry Files\Crosschecks\Direct 1724! Assessments\DirectAssessmentsFall1.csv" 1725 out = work.DirectAssessmentsFall1 1726 dbms= csv 1727 REPLACE; 1728 GETNAMES = YES; 1729 DATAROW = 2; 1730 run; Number of names found is greater than number of variables found. Name wmiii_ap_number_correct 1002 is not a valid SAS name. Name _ is a duplicate. Name _ is a duplicate. Name _ is a duplicate. Name _ is a duplicate. Name _ is a duplicate. Name _ is a duplicate. Name _ is a duplicate. Name _ is a duplicate. Name _ is a duplicate. Name _ is a duplicate. Name _ is a duplicate. Name _ is a duplicate. Name _ is a duplicate. Name _ is a duplicate. Name _ is a duplicate. Name _ is a duplicate. Name _0 is a duplicate. Name _0 is a duplicate. Name _ is a duplicate. Name . 1003 is not a valid SAS name. Name _09_01 is a duplicate. Name _09 is a duplicate. Name _ is a duplicate. Name _0_16_2017 is a duplicate. [etc.] Problems were detected with provided names. See LOG. 1731 /********************************************************************** 1732 * PRODUCT: SAS 1733 * VERSION: 9.3 1734 * CREATOR: External File Interface 1735 * DATE: 02OCT18 1736 * DESC: Generated SAS Datastep Code 1737 * TEMPLATE SOURCE: (None Specified.) 1738 ***********************************************************************/ 1739 data WORK.DIRECTASSESSMENTSFALL1 ; 1740 %let _EFIERR_ = 0; /* set the ERROR detection macro variable */ 1741 infile 'Z:\RI PK Eval RIDE-CT\Data\Raw Data\Double Entry Files\Crosschecks\Direct 1741! Assessments\DirectAssessmentsFall1.csv' delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=2 1741! ; 1742 informat child_id best32. ; 1743 informat classroom_id_number best32. ; 1744 informat program_id_number mmddyy10. ; 1745 informat wave best32. ; 1746 informat date_conducted best32. ; 1747 informat childs_assessment_language best32. ; 1748 informat rowpvt_lang_of_administration best32. ; 1749 informat rowpvt_ceiling_item_number best32. ; 1750 informat rowpvt_number_of_errors best32. ; 1751 informat eowpvt_lang_of_administration best32. ; 1752 informat eowpvt_ceiling_item_number best32. ; 1753 informat eowpvt_number_of_errors best32. ; 1754 informat htks_language_of_administration best32. ; 1755 informat htks_part_1_score best32. ; 1756 informat htks_part_2_score best32. ; 1757 informat htks_part_2_not_administered best32. ; 1758 informat htks_part_3_score best32. ; 1759 informat htks_part_3_not_administered best32. ; 1760 informat wjiv_checklist_convers_prof best32. ; 1761 informat wjiv_checklist_cooperation best32. ; 1762 informat wjiv_checklist_activity best32. ; 1763 informat wjiv_checklist_attention best32. ; 1764 informat wjiv_checklist_self_confidence best32. ; 1765 informat wjiv_checklist_care best32. ; 1766 informat wjiv_checklist_difficult_tasks best32. ; 1767 informat wjiv_lw_number_correct best32. ; 1768 informat wjiv_ap_number_correct $6. ; 1769 informat wmiii_lw_number_correct $7. ; 1770 informat VAR29 best32. ; 1771 informat _09_01 best32. ; 1772 informat _09 mmddyy10. ; 1773 informat _ best32. ; 1774 informat _0_16_2017 best32. ; 1775 informat VAR34 best32. ; 1776 informat VAR35 best32. ; 1777 informat _9 best32. ; 1778 informat _7 best32. ; 1779 informat VAR38 best32. ; 1780 informat _0 best32. ; 1781 informat _2 best32. ; [etc.]
Thanks in advance for any help or suggestions!
When proc import can't make sense out of the first line and the file structure, it's better to write the data step yourself, according to the description you got with the csv file.
I have dealt with this many times. There are some pretty poor programs or options that create output "data" with repeated columns like Total. Or from the number of _ you are getting single digit or entirely blank columns(or at least blanks on the header row). If the _ were all at the right of the data then those are easy to ignore, just remove from the Informat, format and input statements. If the values are also all blank then this may be "ghost" columns created when Excel is used. Deleted columns and rows when exported to CSV appear as Excel seems to be programmed such that once a row or column is used then it is always "used" and needs to be included when saved. Usually the _ appears when a character not acceptable as a SAS variable name appears. Variable names can't start with digits for instance. So If you have columns with headers like 1 2 3 4 etc. They are going to all be replaced as _ and get duplicates.
One solution is to edit the file before attempting to import it to have better column headers. If you do actually have 1 2 3 and such then prefix them with a character such as C1 C2 C3. If you blocks of related variables you might also have D1 D2 D3, E1 E2 E3 and so on.
I would guess that your _9 are coming from actual column headers like 19 29 39. Same fix, make them C19 or similar.
Another potential issue is attempting to create variables from column headings that exceed 32 characters are identical for the first 32 (or more) characters. That is a likely source of your VarXX variables.
Don't use PROC IMPORT. It has to guess at what is in the file. It has to guess at what to call the variables. If has to guess how to define the variables. It has to guess if you need to use a special informat to read the values from the text file (note most variables do NOT need a special informat attached). It has to guess if a variable needs a format attached (note most variable do NOT need a format attached). It has to guess if you want to attach a label to your variable.
Write your own data step to read the file. In many cases that is easier than writing the PROC IMPORT step anyway. And you will definitely get a better dataset. And one that you know how the variables are defined.
Seeing more column headers than columns could be related to a few different issues.
First is could just be that the last few columns are completely empty and whoever (whatever process) wrote the data did not bother to add the extra delimiters for the empty values are the end of the lines.
Second it could be that the header line includes some type of title information, such as a date when the file was produced, that looked to PROC IMPORT like more headers.
Third, and most likely and most troubling, is that the values of one or more headers included the delimiter, but the generator of the file forgot to include quotes around the values with embedded delimiters. So the extra emedded delimiters make it look like there are more headers than there really were. If it only happened in the header line then just figure out what columns you have and assign your own names to them. The troubling part would that this might have also happened to the data values themselves and not just to the column headers. Then the lines of data will become impossible to parse into columns.
Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.
If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website.
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.