BookmarkSubscribeRSS Feed
davidsmarch1
Fluorite | Level 6

Hello all,

 

I have hundreds of datasets of the same structure (i.e., all the same variable names) that I need to set into one dataset. I used the very helpful SAS macro to import all of the csv files. They are each named DSN and appended with a count number (DSN1, DSN2, etc. up to 189). Now in attempting to set them, I am running into the dreaded error where a variable (in this case, Reaction_Time) is both character and numeric.

 

Instead of going into every single dataset and converting the variable before setting, I am hoping to be able to deal with this issue within the Set step.

 

I've run this code and many tweaks of this code and I am still getting "ERROR: Variable rt has been defined as both character and numeric."

 

data combined;
set dsn: (rename=(Reaction_Time=rt));
rt = input(Reaction_Time, 8.);
run;

Can someone please point me in the right direction?

 

13 REPLIES 13
ballardw
Super User

Import while very helpful for single file reading is not the way to read multiple files of the same structure. Proc Import will make separate guesses based on the content of each file as to variable type and length. Different files, different properties.

 

If all of the files are supposed to be the same then the proper approach would be to write a data step that reads the variables with properties needed, which hopefully you have some documentation as to content to help with setting maximum lengths and proper informats for things like date or time values.

Then once you have that program working replace the PROC IMPORT code in that macro with your data step to read the files and create the sets.

 

Or using wildcards in the infile statement a single data step can read all the files but which approach to take there can vary a bit depending on the actual content of  your files (multiple header rows for example add complexity [as well as typically messing with proc import])

Tom
Super User Tom
Super User

Generate a data step that can read one file.  You could start with the code that PROC IMPORT generates, but it is ugly and overcomplicated.  You could use the code that this macro generates instead https://github.com/sasutils/macros/blob/master/csv2ds.sas.

 

Then just modify it to handle all of the files in one step.  You might need to adjust the code so that it is compatible with all of the files (change the length of the character variables to handle the largest value for example).

 

Let's work an example.  Say you exported SASHELP.CARS to a CSV file and wanted to read it in.  You could use code like:

data fromcsv;
  infile '/mydir/cars.csv' dsd truncover firstobs=2 ;
  length Make $13 Model $39 Type $6 Origin $6 DriveTrain $5 MSRP 8
    Invoice 8 EngineSize 8 Cylinders 8 Horsepower 8 MPG_City 8
    MPG_Highway 8 Weight 8 Wheelbase 8 Length 8
  ;
  informat MSRP comma. Invoice comma. ;
  format MSRP comma8. Invoice comma8. ;
  input Make -- Length ;
run;

To adjust it to read multiple files.  You could change the filename to use a wildcard so it matches multiple files.  Then add the FILENAME= option to the INFILE statement and remove the FIRSTOBS=2 option.  Then add data step code to skip the first line of each file.

data fromcsv;
  infile '/mydir/car*.csv' dsd truncover filename=fname ;
  length Make $13 Model $39 Type $6 Origin $6 DriveTrain $5 MSRP 8
    Invoice 8 EngineSize 8 Cylinders 8 Horsepower 8 MPG_City 8
    MPG_Highway 8 Weight 8 Wheelbase 8 Length 8
  ;
  informat MSRP comma. Invoice comma. ;
  format MSRP comma8. Invoice comma8. ;
  input @;
  if fname ne lag(fname) then delete;
  input Make -- Length ;
run;

 

davidsmarch1
Fluorite | Level 6

Tom,

 

That seems like a useful start. The reason I used the other macro is that the csv files do not share a common prefix, so I cannot use a wildcard to read in all csv's in the directory. Do you have a suggestion for how to handle that within the example you posted? [Edit, I see I may be able to do something like infile '(M:\FPST-VR Pilot Testing\*.csv' dsd truncover filename=fname; to gather all csv's.)

 

My current outstanding issue is that the variable names have spaces within them. How do I use the Length statement with variables that will be renamed by SAS? Do I use the name SAS will give them? 

Tom
Super User Tom
Super User

If there is no pattern to the names how do you know which ones you want?

If you can make logic to select the files then just use a list of filenames.  You can easily use DREAD() function to get filenames and then apply logic to select the ones you want.  Don't do it in macro logic however, keep in it normal data step code so you have a dataset of the filenames and not a bunch of macro variables.

 

You could then use the data step to drive the process.  So assume you had a dataset named LIST with variable named FNAME your code would look like like:

data fromcsv;
  set list;
  infile dummy dsd truncover firstobs=2 filevar=fname end=eof;
  length Make $13 Model $39 Type $6 Origin $6 DriveTrain $5 MSRP 8
    Invoice 8 EngineSize 8 Cylinders 8 Horsepower 8 MPG_City 8
    MPG_Highway 8 Weight 8 Wheelbase 8 Length 8
  ;
  informat MSRP comma. Invoice comma. ;
  format MSRP comma8. Invoice comma8. ;
  do while (not eof);
    input Make -- Length ;
    output;
  end;
run;

Just because the column headers in the CSV file have spaces in them there is no reason to make the variable names have spaces.  So if the header text is 'First Name'  you could name the variable firstname or FirstName or first_name.

 

If you really do want to make variable names with sapces you will first have to set the VALIDVARNAME option to ANY instead of the normal V7 setting.  And then use name literals.  Then you could use 'First Name'n as the variable name in your SAS code.

ballardw
Super User

@davidsmarch1 wrote:

Tom,

 

That seems like a useful start. The reason I used the other macro is that the csv files do not share a common prefix, so I cannot use a wildcard to read in all csv's in the directory. Do you have a suggestion for how to handle that within the example you posted? [Edit, I see I may be able to do something like infile '(M:\FPST-VR Pilot Testing\*.csv' dsd truncover filename=fname; to gather all csv's.)

 

My current outstanding issue is that the variable names have spaces within them. How do I use the Length statement with variables that will be renamed by SAS? Do I use the name SAS will give them? 


If all of the CSV files in the folder have the same structur then M:\FPST-VR Pilot Testing\*.csv is indeed valid.

I tend to place those into a FILENAME statement:

Filename mycsv "M:\FPST-VR Pilot Testing\*.csv" ;

/* and read using*/
data mydata;
   infile mycsv dsd dlm=',' <other infile options>;
  < rest of the data step>
run;

One reason is changing the filename lets me test code with just one (or a subset of the files) without modifying the data step (unless needed).

 

Pick reasonable names, Add reasonable labels to the variables.

Also, depending on your data you may end up considering custom informats to read certain types of data. If you have values like "Yes"/"No", "True"/"false" you might want to consider an informat that reads them into a numeric 1/0 (for true/false) as there are many things that run much nicer with the numeric values. (Not to mention a properly designed informat will handle random case differences like Yes/yes/yEs and ONE informat can handle the yes/no, true/false values).

Also if you have some source columns that insist on putting things like N/A or NULL into columns that should really be numeric you can fix that with a proper informat to read as numeric and assign missing (not to mention special missing values if you need to know the difference be N/A and NULL (or Refused , Not Collected or similar).

davidsmarch1
Fluorite | Level 6

Thank you for all of your suggestions. I ended up taking your advice and using a more controllable method. Here is what worked for me:

data combined;
	LENGTH fname $200;
    infile 'M:\FPST-VR Pilot Testing\TestLogs Data Collection\*.csv' delimiter = ',' dsd truncover filename=fname;
	informat 	Participant_ID best32.	Dominant_Hand $10.	Player_Height best32.	Trial_Count	best32. Ambient_light	best32. Debris best32.
		Hallway_Length best32.	Background	$29. 	Ethnicity $5.	Gender $4.	Weapon $15. 	Spawn_Point $3.	Distance_From_Player best32.
		Target_Height best32.		Height_Difference best32. Prefade_Time best32.	Shadow_Fade_Time best32.	Reaction $15.	Reaction_Time $5.	
		Miliseconds best32.	Left_hand_pitch best32.	Left_hand_Yaw best32.	Left_hand_Roll best32.	Left_hand_Trigger_Press__ best32.	Left_hand_Grip_Press__ best32.	
		Left_Hand_position_X best32.	Left_Hand_position_Y best32.	Left_Hand_position_Z best32.	Right_Hand_pitch best32.	Right_Hand_Yaw	best32.
		Right_Hand_Roll best32.	Right_Hand_Trigger_Press__ best32.	Right_Hand_Grip_Press__	best32. Right_Hand_position_X best32.	Right_Hand_position_Y best32.	
		Right_Hand_position_Z best32.	Head_pitch best32.	Head_Yaw best32.	Head_Roll best32.;
    input @;
	if eov then input;
	input 	Participant_ID 	Dominant_Hand 	Player_Height 	Trial_Count	 Ambient_light	 Debris 
		Hallway_Length 	Background		Ethnicity 	Gender 	Weapon  	Spawn_Point 	Distance_From_Player 
		Target_Height 		Height_Difference  Prefade_Time 	Shadow_Fade_Time 	Reaction 	Reaction_Time 	
		Miliseconds 	Left_hand_pitch 	Left_hand_Yaw 	Left_hand_Roll 	Left_hand_Trigger_Press__ 	Left_hand_Grip_Press__ 	
		Left_Hand_position_X 	Left_Hand_position_Y 	Left_Hand_position_Z 	Right_Hand_pitch 	Right_Hand_Yaw	
		Right_Hand_Roll 	Right_Hand_Trigger_Press__ 	Right_Hand_Grip_Press__	 Right_Hand_position_X 	Right_Hand_position_Y 	
		Right_Hand_position_Z 	Head_pitch 	Head_Yaw 	Head_Roll ;
	subjpath = fname;
	eov=0;
    run;
Tom
Super User Tom
Super User

I doubt that will work since you did not include the EOV= option on the INFILE statement.

But you don't need it since you did include the FILENAME= option.

So replace this line:

if eov then input;

With

if fname ne lag(fname) then input;

And remove this line:

eov=0;

 

Also not sure why you have the INFORMAT statement.  None of the variables listed require any special informats to be read properly. The informats you are attaching are what will be used anyway (BEST is really the name of a FORMAT and using it as an INFORMAT name will just result in the normal numeric infomat being used).

 

It would be much clearer to have a LENGTH (or ATTRIB) statement to directly set the variable type and length instead of forcing SAS to guess what type and length to define them as based on information available at the time they are first used in some other statement (in this case the INFORMAT statement).

ballardw
Super User

A minor point that may save some typing, when you have multiple variables with the same INFORMAT (or format) you can list all of the variables and only type the Informat (or format) name one time.

 

 

davidsmarch1
Fluorite | Level 6

This currently works and may address some of your suggestions:

 

data combined;
	LENGTH fname $200 Dominant_Hand $10 Background	$29	Ethnicity $5 Gender $4	Weapon $15 Spawn_Point $3 Reaction $15	Reaction_Time $5;
    infile 'M:\FPST-VR Pilot Testing\TestLogs Data Collection\*.csv' delimiter = ',' dsd truncover filename=fname;
	informat 	Participant_ID best32. Player_Height Trial_Count Ambient_light  Debris 
		Hallway_Length Distance_From_Player Target_Height Height_Difference Prefade_Time 	
		Shadow_Fade_Time Miliseconds Left_hand_pitch Left_hand_Yaw Left_hand_Roll 	
		Left_hand_Trigger_Press__ Left_hand_Grip_Press__  Left_Hand_position_X Left_Hand_position_Y 	
		Left_Hand_position_Z Right_Hand_pitch Right_Hand_Yaw Right_Hand_Roll 	
		Right_Hand_Trigger_Press__ Right_Hand_Grip_Press__	Right_Hand_position_X Right_Hand_position_Y 	
		Right_Hand_position_Z Head_pitch Head_Yaw Head_Roll;
    input @;
	if fname ne lag(fname) then input;
	input 	Participant_ID 	Dominant_Hand 	Player_Height 	Trial_Count	 Ambient_light	 Debris 
		Hallway_Length 	Background		Ethnicity 	Gender 	Weapon  	Spawn_Point 	Distance_From_Player 
		Target_Height 		Height_Difference  Prefade_Time 	Shadow_Fade_Time 	Reaction 	Reaction_Time 	
		Miliseconds 	Left_hand_pitch 	Left_hand_Yaw 	Left_hand_Roll 	Left_hand_Trigger_Press__ 	Left_hand_Grip_Press__ 	
		Left_Hand_position_X 	Left_Hand_position_Y 	Left_Hand_position_Z 	Right_Hand_pitch 	Right_Hand_Yaw	
		Right_Hand_Roll 	Right_Hand_Trigger_Press__ 	Right_Hand_Grip_Press__	 Right_Hand_position_X 	Right_Hand_position_Y 	
		Right_Hand_position_Z 	Head_pitch 	Head_Yaw 	Head_Roll ;
	subjpath = fname;
    run;

Can you explain what the line "input @" is doing? 

Patrick
Opal | Level 21

@davidsmarch1 wrote:

This currently works and may address some of your suggestions:

 


You want to read multiple .csv with a single data step where you need to skip (not read) the header line of each of these .csv. 

infile 'M:\FPST-VR Pilot Testing\TestLogs Data Collection\*.csv' delimiter = ',' dsd truncover FILENAME=FNAME;
.....
if fname ne lag(fname)....

FNAME contains the name of the source file that currently gets processed. If the name from the previous iteration of the data step is different - lag(fname) - then you are currently reading the first line of a new .csv which is the header line that you want to skip. That's the case where IF FNAME NE LAG(FNAME) becomes true.

Each input statement moves the cursor to the next line of source text file unless you use the @argument. That's fully documented for the INPUT Statement

Patrick_0-1713140222240.png

You need input @;   to read the next line to populate FNAME with the current name so you can compare it to the name of the previous line. But you need to keep the pointer on the current line so you can actually read this line into SAS variables in case it's not the header line. 

 

If it is the header line then you issue another input statement so the pointer moves to the next line where you've got data.

if fname ne lag(fname) then input;

 

...but: Above logic will fail should you have a source .csv that only contains a header line without any data. To circumvent such an issue you could instead code as below:

input @;
if fname ne lag(fname) then RETURN;
input Participant_ID best32. ......

 

I also strongly suggest to use INFORMATS when reading source strings into SAS variables. Informats instruct SAS how to map a source string into a SAS variable. 

attrib .... dominant_hand length=$10 INFORMAT=$10.  ...and so on....

...or alternatively use the informat directly within the input statement.

input ..... dominant_hand :$10.  ....

...and if there are a lot of variables and you're lazy then use the EG/Studio import wizard or Proc Import for one .csv and then copy/paste the generated data step code (with Proc Import from the SAS log) and amend it to what will work for all .csv's. 

Tom
Super User Tom
Super User

I usually use

infile .... filename=fname;
input @;
if fname ne lag(fname) then delete;
input ....

The will handle both empty files and files with only one line.

Patrick
Opal | Level 21

@davidsmarch1 

As @Tom rightfully flagged: Don't use RETURN but DELETE to always ensure the desired result. 

Using RETURN will only work if you also add statement OUTPUT before the run statement.

Tom
Super User Tom
Super User

@Patrick wrote:

@davidsmarch1 

As @Tom rightfully flagged: Don't use RETURN but DELETE to always ensure the desired result. 

Using RETURN will only work if you also add statement OUTPUT before the run statement.


Actually RETURN or DELETE will have the same effect as they will both end the current iteration of the data step.  I think you meant don't use INPUT as that will then allow the iteration to continue to the rest of the code statements.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 13 replies
  • 713 views
  • 6 likes
  • 4 in conversation