Solved: Combine files from different source folders

NewUsrStat · Posted 10-30-2024 04:56 AM

Hi guys,

suppose to have a number of folders:

Folder Tokyo: "C:\Users\....\Tokyo"

Folder NYC: "C:\Users\....\NYC"

Folder Singapore: "C:\Users\....\Singapore"

Each folder contains different files named identical across the folders: Hospitalization_flu, Hospitalization_Covid, ....

The files appear like this:

data Hospitalization_flu  ;
  input ID:$20. Admission :date09. Discharge :date09. Sex  Morbidity;
  format Admission date9. Discharge date9.;
cards;
0001  13JAN2017 25JAN2017  M   Yes
0001  22FEB2018 03MAR2018  M   No
0001  22FEB2018 03MAR2018  M   Yes
0002  01DEC2016 14DEC2016  F   No
0002  01DEC2016 14DEC2016  F   No
0002  25DEC2017 02JAN2018  F   No
0002  06JAN2018 09JAN2018  F   Yes
;run;

I would like to append (vertically with set) all files with the same name from different folders, for example all "Hospitalization_flu" files to generate a unique Hospitalization_flu file but:

the files, for example Hospitalization_flu files have a variable number of columns but a set of common columns like ID, Admission, Discharge, .... I'm not interested in extra columns but I would like to take only the common variables.
the order of variables is not the same so the files should be sorted to have the same order
I would like to add an extra variable that takes the name of the folder of origin in order to track the source folder

The folders are around 20 and so totally I have 20 files named for example "Hospitalization_flu", 20 named "Hospitalization_Covid" and so on. Moreover the files contain around 70 variables and to sort them manually is not feasible.

Note that IDs present in different files/folders are unique since from an anonymization process and files are SAS datasets.

Can anyone help me please to automate the process?

Thank you in advance

Tom · Posted 10-30-2024 10:25 AM

Is this a one time thing? or will the list of folders (or list of datasets) change over time and you need something that will detect the new folders and/or new datasets and generate the appropriate code?

You need to break the problem into pieces and figure out how you will solve each one.

If you know the names of the datasets and have librefs already defined pointing to the folders then the code you want to run will be something like this:

data Hospitalization_flu  ;
  length source $8 ;
  set Tokyo.Hospitalization_flu
      NYC.Hospitalization_flu
      Singapor.Hospitalization_flu
      indsname=dsname
  ;
  source = scan(indsname,1,'.');
run;

Which will use the INDSNAME= option of the SET statement to get the name of the dataset that contributed the current observation into a temporary variable. The SCAN() function can then be used to pull out the libref part of the name. Note that the libref values are limited to 8 characters.

If you need help finding out what datasets exist so that the code can be automated then you probably want to use a tool that can find the names of all of the files in a directory tree. So something like: https://github.com/sasutils/macros/blob/master/dirtree.sas

Once you have the list you can use it to generate code by just using data steps (macros are not really needed if the goal is just to combine all of them).

So something like:

* Get list of files ;
%dirtree(\\Users\xxx\TopFolder,out=files,maxdepth=2)

* Subset to SAS datasets ;
* Generate libref for each path ;
data datasets;
  set files;
  where scan(file,-1,'.')='sas7bdat';
  by path;
  length libref $8 ;
  retain libref rc;
  if first.path then do;
     llbref=' ';
     rc=libname(libref,path);
  end;
run;

Now that you re-order the data by member name instead and use it to generate the data step.

proc sort data=datasets;
  by filename path;
run;

filename code temp;
data _null_;
  file code lrecl=75 ;
  set datasets;
  by filename ;
  length memname $32 dsname $41 ;
  memname=scan(filename,1,'.');
  dsname=catx('.',libef,memname);
  if first.filename then put 
  'data ' memname ';' 
/ '  set ' @
  ;
  put dsname @ ;
  if last.filename then put
  'indsname=dsname;'
/ '  length libref $8;'
/ '  libref=scan(dsname,1,".");'
/ 'run;'
  ;
run;

View solution in original post

Kurt_Bremser · Posted 10-30-2024 05:23 AM

Define "file".
If these are SAS datasets, then the horizontal position of columns is irrelevant, as the variables are addressed by name.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

NewUsrStat · Posted 10-30-2024 06:26 AM

Hi, yes they are sas datasets

LinusH · Posted 10-30-2024 08:25 AM

Definie "automate".

If you need to repeat the jobs and perdform updates due to new folders and variables, sure, I would build macro programs to do the job.

But it's not a super quick task, but here are som general thoughts.

You need a libname for each folder, so have a list/dataset with paths and name for each libref, use a data step with call execute to assign them.

If they are already assigned, you could go the other way around and fetch this informaiton from DICTIONARY.LIBNAMES using proc sql;

In macro lop over the dat stes from each folder, potentially build a view on each table where you add the folder path in a new variable, and at the end concatenate all of them. If your datsets are not big you would be fine using a KEEP statement in the data step that concatenets the data. Otherwise generate a KEEP= data set option when you first read each data set.

Data never sleeps

NewUsrStat · Posted 10-30-2024 10:29 AM

Thank you very much for your comment. "Automate" in the sense that I cannot remove extra variables by hand, sort or do additional operations on such a huge number of variables and datasets because it is error prone.

ballardw · Posted 10-30-2024 10:08 AM

@NewUsrStat wrote:
Hi, yes they are sas datasets

I hope someone made sure that all of the like named data sets have the same named variables and that the same named variables all have the same type, numeric or character, and that the Lengths of the variables are the same.

Otherwise combining data set with different types for the same variable will fail with errors and different lengths may result in truncation of data values.

NewUsrStat · Posted 10-30-2024 10:26 AM

This is the case. All datasets have coherent variables in names, length etc as they result from other coding.

Tom · Posted 10-30-2024 10:25 AM

Is this a one time thing? or will the list of folders (or list of datasets) change over time and you need something that will detect the new folders and/or new datasets and generate the appropriate code?

You need to break the problem into pieces and figure out how you will solve each one.

If you know the names of the datasets and have librefs already defined pointing to the folders then the code you want to run will be something like this:

data Hospitalization_flu  ;
  length source $8 ;
  set Tokyo.Hospitalization_flu
      NYC.Hospitalization_flu
      Singapor.Hospitalization_flu
      indsname=dsname
  ;
  source = scan(indsname,1,'.');
run;

Which will use the INDSNAME= option of the SET statement to get the name of the dataset that contributed the current observation into a temporary variable. The SCAN() function can then be used to pull out the libref part of the name. Note that the libref values are limited to 8 characters.

If you need help finding out what datasets exist so that the code can be automated then you probably want to use a tool that can find the names of all of the files in a directory tree. So something like: https://github.com/sasutils/macros/blob/master/dirtree.sas

Once you have the list you can use it to generate code by just using data steps (macros are not really needed if the goal is just to combine all of them).

So something like:

* Get list of files ;
%dirtree(\\Users\xxx\TopFolder,out=files,maxdepth=2)

* Subset to SAS datasets ;
* Generate libref for each path ;
data datasets;
  set files;
  where scan(file,-1,'.')='sas7bdat';
  by path;
  length libref $8 ;
  retain libref rc;
  if first.path then do;
     llbref=' ';
     rc=libname(libref,path);
  end;
run;

Now that you re-order the data by member name instead and use it to generate the data step.

proc sort data=datasets;
  by filename path;
run;

filename code temp;
data _null_;
  file code lrecl=75 ;
  set datasets;
  by filename ;
  length memname $32 dsname $41 ;
  memname=scan(filename,1,'.');
  dsname=catx('.',libef,memname);
  if first.filename then put 
  'data ' memname ';' 
/ '  set ' @
  ;
  put dsname @ ;
  if last.filename then put
  'indsname=dsname;'
/ '  length libref $8;'
/ '  libref=scan(dsname,1,".");'
/ 'run;'
  ;
run;

NewUsrStat · Posted 10-30-2024 10:42 AM

"Is this a one time thing? or will the list of folders (or list of datasets) change over time and you need something that will detect the new folders and/or new datasets and generate the appropriate code?" No, nothing will never change.

NewUsrStat · Posted 10-30-2024 11:34 AM

Thank you very much Tom for your help!

Eric_Lee · Posted 10-30-2024 09:24 AM

The attachment works when the type of variables across all datasets is same. If not, the code must be modified to handle this mismatch, and errors will occur while the code is running.

Please update the rows about the 'libname' as you need.

Edit by KB:

The code in a code window, so nobody needs to view/download attachments:

libname folder1 "path_to_folder1";
libname folder2 "path_to_folder2";
libname folder3 "path_to_folder3";
libname folder4 "path_to_folder4";

libname placed "path_to_you_wanna_place_dataset";


options validvarname=upcase;


/*get the informations about all datasets of all folder*/
proc sql noprint;
  create table vcolumn as
    select distinct libname,memname,name,type,length from sashelp.vcolumn
      where libname like 'FOLDER%' and memtype='DATA' ;
quit;


/*get properties of all unique variables*/
proc sql noprint;
  create table vcol as
    select distinct memname,name,type,max(length) as length 
      from vcolumn group by memname,name;
quit;
proc sort;by memname name type;run;

data attributes_var;
  length attr $10000 attr_var $200;
  retain attr;
  set vcol;
  by memname name type;
  if type in ('char') then attr_var=catx(' ',name,cats('char(',put(length,best.),')'));
  else attr_var=catx(' ',name,'num');

  if first.memname then attr=strip(attr_var);
  else attr=catx(' , ',attr,attr_var);
  if last.memname then attr=cats('(',attr,');');
  if last.memname then output;
run;


/*get datasets to set*/
proc sql noprint;
  create table vname as
    select distinct memname,libname from vcolumn;
quit;
proc sort;by memname libname;run;

data attributes_dataset;
  length attr $10000 attr_dataset $200;
  retain attr;
  set vname;
  by memname libname;
  attr_dataset=catx('.',libname,memname);

  if first.memname then attr=strip(attr_dataset);
  else attr=catx(' ',attr,attr_dataset);
  if last.memname then output;
run;


proc sql noprint;
  select count(distinct memname) into: nloop trimmed from attributes_dataset;
quit;
%put &nloop.;


%macro loop(placed_folder=);

%do i=1 %to &nloop.;

  data _null_;
    set attributes_dataset;
    if _n_=&i.;
    call symputx('name',strip(memname),'G');
    call symputx('datasets',strip(attr),'G');
  run;

  proc sql noprint;
    select distinct strip(attr) into: vars
      from attributes_var where memname in ("&name.");
  quit;

  %put &i./&nloop.==>&name. &datasets.<==>&vars.;


  proc sql noprint;
    create table temp
      &vars.;
  quit;

  data &placed_folder..&name.;
    set temp &datasets.;
  run;

%end;


%mend loop;


/*the arguement of 'placed_folder' is the libname to place the combined dataset*/
%loop(placed_folder=placed);

NewUsrStat · Posted 10-30-2024 10:30 AM

Thank you very much Eric for your help. I will try to modify the script accordingly.

Combine files from different source folders

Re: Combine files from different source folders

Re: Combine files from different source folders

Re: Combine files from different source folders

Re: Combine files from different source folders

Re: Combine files from different source folders

Re: Combine files from different source folders

Re: Combine files from different source folders

Re: Combine files from different source folders

Re: Combine files from different source folders

Re: Combine files from different source folders

Re: Combine files from different source folders

Re: Combine files from different source folders

The 2025 SAS Hackathon has begun!