Re: Run data step on all files in directory

pamplemouse22 · Posted 02-12-2020 04:48 PM

Hello,

I am trying to run the following code through all the data sets in my work directory (about 3000 files). All of these files begin with "_" (e.g., _01012003, _01022003, etc). Can you please advise on a simple way to rename the fifth variable in all datasets?

Currently, each data set looks something like this, named by the date, and the date is included in the fifth variable. (It's the third in this example data set). The x and y variables are not sorted.

data _01012003; 
	input x y mean_01012003; 
	datalines ; 
	-117.18604, 32.752861, 29.082
	-117.173935, 32.74743, 29.580
	-117.116278, 32.744611, 29.580
	;
run;

The original code I have to import these datafiles is copied below . Maybe there is a way to tweak the original code to rename the 5th variable.

%macro impt(fname,path);
	proc import DATAFILE="&path" OUT=&fname DBMS=xls REPLACE;
	run;
%mend impt;

%let path=D:\User\Data\Output\ap\;
data _null_;
	length fname $100;
   command = "dir &path.*.xls /b";;
   infile dummy pipe filevar=command end=eof truncover filename=fname;
   do i = 1 by 1 while(not eof);
      input path $128.;
      path=catt("&path.",path);
	  fname = scan(scan(path,-1,'\'),1,'.');
      call execute(cats('%nrstr(%impt)(fname=',fname,',path=',path,');'));
      end;
   stop;
run;

The final goal is to have a combined dataset that looks like this: I was having issues with transposing it directly because the variable names did not match and the rows were unsorted.

data goal; 
	input x y mean date; 
	datalines; 
	-117.18604, 32.752861, 29.082, 01012003
	-117.173935, 32.74743, 29.580, 01012003
	-117.116278, 32.744611, 29.580, 01012003
	;
run;

Thank you.

ballardw · Posted 02-12-2020 05:43 PM

Using a data step to rename variables is inefficient because the data step would read each record from each set. And your code creates a new variable , not renames it.
Proc Data sets with the MODIFY statement will change a variable name in place without the potential for changing any data values.

If the variable of interest truly is the 5th variable, as reported by proc contents, then you select all of the 5th variables from sets in a library with code like

data example;
   set sashelp.vcolumn (where=(libname = 'YOURLIB' and varnum=5 ));
RUN;

The libname must be uppercase as that is how it is stored in the meta data.

You can then use that data set with a data step to create proc datasets code to modify the variable names.

data _null_;
   set example end=last;
   file print;
   length codestr $ 100;
   if _n_=1 then codestr =catx(' ','Proc datasets library=',libname,'nodetails nolist;');
   put codestr;
   codestr =catx (' ','Modify',memname,';');
   put codestr;
   codestr = catx(' ','Rename',name,'=mean;' );
   put codestr;
   if last then do;
      codestr='run;quit;';
      put codestr;
   end;
run;

This will write Proc Datasets code to the results window that will look something like

Proc datasets library= WORK nodetails nolist;                                                     
Modify CARS ;                                                                                     
Rename DriveTrain =mean;                                                                          
                                                                                                  
Modify CLASS ;                                                                                    
Rename Weight =mean;                                                                              
run;quit;

using some temporary sets I made to demonstrate the code. You can verify that the needed variables appear in the rename statement. If so then copy that into the editor and run the generated lines of code.

Another option is to modify the data _null_ code so that each Put Codestr; line is replaced with

call execute(codestr);

which will stack of the lines into the execution queue and execute the code after the data step finishes.

mkeintz · Posted 02-12-2020 05:57 PM

You would be better off doing renames instead of assignments. Let's say dataset _01012003 has the fifth variable named X and datasets _01012999 has 5th variable named Y. Then you would want;

data want;

set _01012003 (rename=x=mean)

_01012003 (rename=y=mean)

….

;

YOu can use the DICTIONARIES capability of sql which access metadata, like the variable names, as in:

data _one ;
  input x1 x2 x3 x4 x;
datalines;
11 12 13 14 15
;
data _two;
  input x1 x2 x3 x4 y;
datalines;
21 22 23 24 25
;

proc sql noprint;
  select cats (memname,' (rename=',name,'=mean)') 
  into :dslist separated by ' '
  from dictionary.columns where libname='WORK' and char(memname,1)='_' and varnum=5;
%put &=dslist;
quit;

data want;
  set &dslist;
run;

But remember, this depends CRUCIALLY on the variable in question ALWAYS being the 5th variable (not just the fifty numeric variable). One could add some messy code to the above to get the first numeric, but this gives the basic concept.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

ballardw · Posted 02-12-2020 06:48 PM

@pamplemouse22 didn't give a very complete description of the data. If there is exactly one variable whose name begins with "mean_" then you could search for that regardless of the position in the data set:

data example;
   set sashelp.vcolumn (where=(libname = 'WORK' ));
   if upcase(name)=: 'MEAN_';
RUN;

pamplemouse22 · Posted 02-12-2020 07:31 PM

I updated the post! Thank you. I will try your suggestion.

pamplemouse22 · Posted 02-12-2020 07:28 PM

Thank you! Rename won't work because i would have thousands of variables to rename.

I will try the dictionaries feature in proc sql! thanks!

Patrick · Posted 02-13-2020 12:42 AM

@pamplemouse22

You could try something like below.

/* create some sample data */
proc datasets lib=work nolist nowarn;
  delete _prodsavail;
  run;
quit;
data 
  _abc(rename=(weight=_a_weight)) 
  _xyz(rename=(weight=_bweight));
  set sashelp.class;
run;


/* generate code for matching tables */
proc sql;
  create table members as
  select libname, memname, name
  from dictionary.columns 
  where 
    libname='WORK' 
    and memname like '^_%' escape '^' 
    and varnum=5
  ;
quit;

filename codegen temp;
data test;
/*  file print;*/
  file codegen;
  set members end=last;
  if _n_=1 then 
    put 
      @3 'length _sourcefile_name $41 sourcefile_name $41 ;' / 
      @3 'set';
  put 
    @5 libname +(-1) '.' memname '(rename=(' name '=common_name))';
  if last then 
    put 
      @5 'indsname=_sourcefile_name;' /
      @5 'sourcefile_name=_sourcefile_name;' ;
run;

/* execute generated code to create a table with the combined data */
data want;
  %include codegen / source2;
run;

proc print data=want;
run;

Tom · Posted 02-12-2020 10:32 PM

This would be much easier if you actually had datafiles (either actual SAS datasets or delimited text files) but it looks like you instead of a bunch of XLS files instead.

For now let's assume you are lucky and the XLS files do not have stuff that will cause PROC IMPORT to generate incompatible datasets.

Looks like you just want to do the following.

data all ;
  length dsname $41 ;
  set _: indsname=dsname ;
  mean=coalesce(of mean_:);
  date = input(scan(dsname,-1,'_'),mmddyy8.);
  format date yymmdd10.;
  drop mean_:;
run;

So read all of the datasets whose name starts with an underscore. Put the NAME of the dataset into the variable DSNAME. Read the last 8 characters of the name as a date (Make the informat match the style used in the names). Use the COALESCE() function to get the only non-missing value of all of those MEAN_: variables into a new variable.

Reeza · Posted 02-12-2020 11:39 PM

How were those files created? Can you back up this process and fix that step instead?

pamplemouse22 · Posted 02-13-2020 11:33 AM

They were created with a different software and took a really long time to run - would prefer not to revisit that.

Kurt_Bremser · Posted 02-13-2020 02:23 AM

Your problem is called "Excel", and as long as you try to work with files in this unusable format, you will waste your time on fixing problems that shouldn't be there in the first place.

INSIST on the data being sent in a usable format (delimited text files), where you simply ignore the header line and set correct column attributes in the data step that reads the files; concatenating the data into one dataset will then be a breeze, and if you get incompatible data (wrong order of columns or such), the data step will throw an error.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

pamplemouse22 · Posted 02-13-2020 11:34 AM

I'm very open to doing this! Especially since I only have 5 variables in each file. How would I modify the original import code to do this?

Reeza · Posted 02-13-2020 11:46 AM

GETNAMES=NO if you know all the files are the same format and then stack them and rename your 5 variables.

pamplemouse22 · Posted 02-13-2020 12:15 PM

so simple, i can't believe I didn't think of this! Thank you!

Classroom Training Available!