Hello all,
I have hundreds of datasets of the same structure (i.e., all the same variable names) that I need to set into one dataset. I used the very helpful SAS macro to import all of the csv files. They are each named DSN and appended with a count number (DSN1, DSN2, etc. up to 189). Now in attempting to set them, I am running into the dreaded error where a variable (in this case, Reaction_Time) is both character and numeric.
Instead of going into every single dataset and converting the variable before setting, I am hoping to be able to deal with this issue within the Set step.
I've run this code and many tweaks of this code and I am still getting "ERROR: Variable rt has been defined as both character and numeric."
data combined; set dsn: (rename=(Reaction_Time=rt)); rt = input(Reaction_Time, 8.); run;
Can someone please point me in the right direction?
Import while very helpful for single file reading is not the way to read multiple files of the same structure. Proc Import will make separate guesses based on the content of each file as to variable type and length. Different files, different properties.
If all of the files are supposed to be the same then the proper approach would be to write a data step that reads the variables with properties needed, which hopefully you have some documentation as to content to help with setting maximum lengths and proper informats for things like date or time values.
Then once you have that program working replace the PROC IMPORT code in that macro with your data step to read the files and create the sets.
Or using wildcards in the infile statement a single data step can read all the files but which approach to take there can vary a bit depending on the actual content of your files (multiple header rows for example add complexity [as well as typically messing with proc import])
Generate a data step that can read one file. You could start with the code that PROC IMPORT generates, but it is ugly and overcomplicated. You could use the code that this macro generates instead https://github.com/sasutils/macros/blob/master/csv2ds.sas.
Then just modify it to handle all of the files in one step. You might need to adjust the code so that it is compatible with all of the files (change the length of the character variables to handle the largest value for example).
Let's work an example. Say you exported SASHELP.CARS to a CSV file and wanted to read it in. You could use code like:
data fromcsv;
infile '/mydir/cars.csv' dsd truncover firstobs=2 ;
length Make $13 Model $39 Type $6 Origin $6 DriveTrain $5 MSRP 8
Invoice 8 EngineSize 8 Cylinders 8 Horsepower 8 MPG_City 8
MPG_Highway 8 Weight 8 Wheelbase 8 Length 8
;
informat MSRP comma. Invoice comma. ;
format MSRP comma8. Invoice comma8. ;
input Make -- Length ;
run;
To adjust it to read multiple files. You could change the filename to use a wildcard so it matches multiple files. Then add the FILENAME= option to the INFILE statement and remove the FIRSTOBS=2 option. Then add data step code to skip the first line of each file.
data fromcsv;
infile '/mydir/car*.csv' dsd truncover filename=fname ;
length Make $13 Model $39 Type $6 Origin $6 DriveTrain $5 MSRP 8
Invoice 8 EngineSize 8 Cylinders 8 Horsepower 8 MPG_City 8
MPG_Highway 8 Weight 8 Wheelbase 8 Length 8
;
informat MSRP comma. Invoice comma. ;
format MSRP comma8. Invoice comma8. ;
input @;
if fname ne lag(fname) then delete;
input Make -- Length ;
run;
Tom,
That seems like a useful start. The reason I used the other macro is that the csv files do not share a common prefix, so I cannot use a wildcard to read in all csv's in the directory. Do you have a suggestion for how to handle that within the example you posted? [Edit, I see I may be able to do something like infile '(M:\FPST-VR Pilot Testing\*.csv' dsd truncover filename=fname; to gather all csv's.)
My current outstanding issue is that the variable names have spaces within them. How do I use the Length statement with variables that will be renamed by SAS? Do I use the name SAS will give them?
If there is no pattern to the names how do you know which ones you want?
If you can make logic to select the files then just use a list of filenames. You can easily use DREAD() function to get filenames and then apply logic to select the ones you want. Don't do it in macro logic however, keep in it normal data step code so you have a dataset of the filenames and not a bunch of macro variables.
You could then use the data step to drive the process. So assume you had a dataset named LIST with variable named FNAME your code would look like like:
data fromcsv;
set list;
infile dummy dsd truncover firstobs=2 filevar=fname end=eof;
length Make $13 Model $39 Type $6 Origin $6 DriveTrain $5 MSRP 8
Invoice 8 EngineSize 8 Cylinders 8 Horsepower 8 MPG_City 8
MPG_Highway 8 Weight 8 Wheelbase 8 Length 8
;
informat MSRP comma. Invoice comma. ;
format MSRP comma8. Invoice comma8. ;
do while (not eof);
input Make -- Length ;
output;
end;
run;
Just because the column headers in the CSV file have spaces in them there is no reason to make the variable names have spaces. So if the header text is 'First Name' you could name the variable firstname or FirstName or first_name.
If you really do want to make variable names with sapces you will first have to set the VALIDVARNAME option to ANY instead of the normal V7 setting. And then use name literals. Then you could use 'First Name'n as the variable name in your SAS code.
@davidsmarch1 wrote:
Tom,
That seems like a useful start. The reason I used the other macro is that the csv files do not share a common prefix, so I cannot use a wildcard to read in all csv's in the directory. Do you have a suggestion for how to handle that within the example you posted? [Edit, I see I may be able to do something like infile '(M:\FPST-VR Pilot Testing\*.csv' dsd truncover filename=fname; to gather all csv's.)
My current outstanding issue is that the variable names have spaces within them. How do I use the Length statement with variables that will be renamed by SAS? Do I use the name SAS will give them?
If all of the CSV files in the folder have the same structur then M:\FPST-VR Pilot Testing\*.csv is indeed valid.
I tend to place those into a FILENAME statement:
Filename mycsv "M:\FPST-VR Pilot Testing\*.csv" ; /* and read using*/ data mydata; infile mycsv dsd dlm=',' <other infile options>; < rest of the data step> run;
One reason is changing the filename lets me test code with just one (or a subset of the files) without modifying the data step (unless needed).
Pick reasonable names, Add reasonable labels to the variables.
Also, depending on your data you may end up considering custom informats to read certain types of data. If you have values like "Yes"/"No", "True"/"false" you might want to consider an informat that reads them into a numeric 1/0 (for true/false) as there are many things that run much nicer with the numeric values. (Not to mention a properly designed informat will handle random case differences like Yes/yes/yEs and ONE informat can handle the yes/no, true/false values).
Also if you have some source columns that insist on putting things like N/A or NULL into columns that should really be numeric you can fix that with a proper informat to read as numeric and assign missing (not to mention special missing values if you need to know the difference be N/A and NULL (or Refused , Not Collected or similar).
Thank you for all of your suggestions. I ended up taking your advice and using a more controllable method. Here is what worked for me:
data combined;
LENGTH fname $200;
infile 'M:\FPST-VR Pilot Testing\TestLogs Data Collection\*.csv' delimiter = ',' dsd truncover filename=fname;
informat Participant_ID best32. Dominant_Hand $10. Player_Height best32. Trial_Count best32. Ambient_light best32. Debris best32.
Hallway_Length best32. Background $29. Ethnicity $5. Gender $4. Weapon $15. Spawn_Point $3. Distance_From_Player best32.
Target_Height best32. Height_Difference best32. Prefade_Time best32. Shadow_Fade_Time best32. Reaction $15. Reaction_Time $5.
Miliseconds best32. Left_hand_pitch best32. Left_hand_Yaw best32. Left_hand_Roll best32. Left_hand_Trigger_Press__ best32. Left_hand_Grip_Press__ best32.
Left_Hand_position_X best32. Left_Hand_position_Y best32. Left_Hand_position_Z best32. Right_Hand_pitch best32. Right_Hand_Yaw best32.
Right_Hand_Roll best32. Right_Hand_Trigger_Press__ best32. Right_Hand_Grip_Press__ best32. Right_Hand_position_X best32. Right_Hand_position_Y best32.
Right_Hand_position_Z best32. Head_pitch best32. Head_Yaw best32. Head_Roll best32.;
input @;
if eov then input;
input Participant_ID Dominant_Hand Player_Height Trial_Count Ambient_light Debris
Hallway_Length Background Ethnicity Gender Weapon Spawn_Point Distance_From_Player
Target_Height Height_Difference Prefade_Time Shadow_Fade_Time Reaction Reaction_Time
Miliseconds Left_hand_pitch Left_hand_Yaw Left_hand_Roll Left_hand_Trigger_Press__ Left_hand_Grip_Press__
Left_Hand_position_X Left_Hand_position_Y Left_Hand_position_Z Right_Hand_pitch Right_Hand_Yaw
Right_Hand_Roll Right_Hand_Trigger_Press__ Right_Hand_Grip_Press__ Right_Hand_position_X Right_Hand_position_Y
Right_Hand_position_Z Head_pitch Head_Yaw Head_Roll ;
subjpath = fname;
eov=0;
run;
I doubt that will work since you did not include the EOV= option on the INFILE statement.
But you don't need it since you did include the FILENAME= option.
So replace this line:
if eov then input;
With
if fname ne lag(fname) then input;
And remove this line:
eov=0;
Also not sure why you have the INFORMAT statement. None of the variables listed require any special informats to be read properly. The informats you are attaching are what will be used anyway (BEST is really the name of a FORMAT and using it as an INFORMAT name will just result in the normal numeric infomat being used).
It would be much clearer to have a LENGTH (or ATTRIB) statement to directly set the variable type and length instead of forcing SAS to guess what type and length to define them as based on information available at the time they are first used in some other statement (in this case the INFORMAT statement).
A minor point that may save some typing, when you have multiple variables with the same INFORMAT (or format) you can list all of the variables and only type the Informat (or format) name one time.
This currently works and may address some of your suggestions:
data combined;
LENGTH fname $200 Dominant_Hand $10 Background $29 Ethnicity $5 Gender $4 Weapon $15 Spawn_Point $3 Reaction $15 Reaction_Time $5;
infile 'M:\FPST-VR Pilot Testing\TestLogs Data Collection\*.csv' delimiter = ',' dsd truncover filename=fname;
informat Participant_ID best32. Player_Height Trial_Count Ambient_light Debris
Hallway_Length Distance_From_Player Target_Height Height_Difference Prefade_Time
Shadow_Fade_Time Miliseconds Left_hand_pitch Left_hand_Yaw Left_hand_Roll
Left_hand_Trigger_Press__ Left_hand_Grip_Press__ Left_Hand_position_X Left_Hand_position_Y
Left_Hand_position_Z Right_Hand_pitch Right_Hand_Yaw Right_Hand_Roll
Right_Hand_Trigger_Press__ Right_Hand_Grip_Press__ Right_Hand_position_X Right_Hand_position_Y
Right_Hand_position_Z Head_pitch Head_Yaw Head_Roll;
input @;
if fname ne lag(fname) then input;
input Participant_ID Dominant_Hand Player_Height Trial_Count Ambient_light Debris
Hallway_Length Background Ethnicity Gender Weapon Spawn_Point Distance_From_Player
Target_Height Height_Difference Prefade_Time Shadow_Fade_Time Reaction Reaction_Time
Miliseconds Left_hand_pitch Left_hand_Yaw Left_hand_Roll Left_hand_Trigger_Press__ Left_hand_Grip_Press__
Left_Hand_position_X Left_Hand_position_Y Left_Hand_position_Z Right_Hand_pitch Right_Hand_Yaw
Right_Hand_Roll Right_Hand_Trigger_Press__ Right_Hand_Grip_Press__ Right_Hand_position_X Right_Hand_position_Y
Right_Hand_position_Z Head_pitch Head_Yaw Head_Roll ;
subjpath = fname;
run;
Can you explain what the line "input @" is doing?
@davidsmarch1 wrote:
This currently works and may address some of your suggestions:
You want to read multiple .csv with a single data step where you need to skip (not read) the header line of each of these .csv.
infile 'M:\FPST-VR Pilot Testing\TestLogs Data Collection\*.csv' delimiter = ',' dsd truncover FILENAME=FNAME;
.....
if fname ne lag(fname)....
FNAME contains the name of the source file that currently gets processed. If the name from the previous iteration of the data step is different - lag(fname) - then you are currently reading the first line of a new .csv which is the header line that you want to skip. That's the case where IF FNAME NE LAG(FNAME) becomes true.
Each input statement moves the cursor to the next line of source text file unless you use the @argument. That's fully documented for the INPUT Statement .
You need input @; to read the next line to populate FNAME with the current name so you can compare it to the name of the previous line. But you need to keep the pointer on the current line so you can actually read this line into SAS variables in case it's not the header line.
If it is the header line then you issue another input statement so the pointer moves to the next line where you've got data.
if fname ne lag(fname) then input;
...but: Above logic will fail should you have a source .csv that only contains a header line without any data. To circumvent such an issue you could instead code as below:
input @; if fname ne lag(fname) then RETURN; input Participant_ID best32. ......
I also strongly suggest to use INFORMATS when reading source strings into SAS variables. Informats instruct SAS how to map a source string into a SAS variable.
attrib .... dominant_hand length=$10 INFORMAT=$10. ...and so on....
...or alternatively use the informat directly within the input statement.
input ..... dominant_hand :$10. ....
...and if there are a lot of variables and you're lazy then use the EG/Studio import wizard or Proc Import for one .csv and then copy/paste the generated data step code (with Proc Import from the SAS log) and amend it to what will work for all .csv's.
I usually use
infile .... filename=fname;
input @;
if fname ne lag(fname) then delete;
input ....
The will handle both empty files and files with only one line.
As @Tom rightfully flagged: Don't use RETURN but DELETE to always ensure the desired result.
Using RETURN will only work if you also add statement OUTPUT before the run statement.
@Patrick wrote:
As @Tom rightfully flagged: Don't use RETURN but DELETE to always ensure the desired result.
Using RETURN will only work if you also add statement OUTPUT before the run statement.
Actually RETURN or DELETE will have the same effect as they will both end the current iteration of the data step. I think you meant don't use INPUT as that will then allow the iteration to continue to the rest of the code statements.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.