Solved: Re: How to merge all files in one folder VERTICALLY

Phil_NZ · Posted 02-04-2021 11:16 PM

Hi SAS Users,

I know how to merge all sheets in one file horizontally, but today I face a problem of merging files vertically, I am wondering if you can help me to sort it out.

So my code to import all 64 files sas7bdat is as below

/*Inserting all files*/

%macro importdata(File=,outf=);
Libname input 'C:\Users\pnguyen\OneDrive - Massey University\PhD JOURNEY\3calculation';

    data &outf.;

    set input.&outf._;

  run;
%mend;

filename mydir 'C:\Users\pnguyen\OneDrive - Massey University\PhD JOURNEY\3calculation';
data _null_;
did = dopen('mydir');
do i = 1 to dnum(did);
  fname = scan(dread(did,i),1,'.');
  /*At this point I assign that the filename has the suffix is xlsx or sas7bdat
  fname= ARGENTINA_*/
  length short_fn $29;
  short_fn= substr(fname, 1,length(fname)-1);
  /*short_fn=ARGENTINA*/

  cmd=cats('%importdata(File=C:\Users\pnguyen\OneDrive - Massey University\PhD JOURNEY\3calculation\',
      strip(fname),',outf=',short_fn,');');
  call execute(cmd);
end;
keep fname;
run;

And the results I have is as below

All 64 files having the same column, I want to merge these 64 files together vertically. Simply speaking, converting 64 countries to one country.

And I also used Proc contents to retrieve the format of all variables in one file if it facilitates you in coding

And I am wondering if you can please explain to me what does the "#" column (first column) of the

Alphabetic List of Variables and Attributes

table above indicate?

Many thanks and warmest regards.

Thank you for your help, have a fabulous and productive day! I am a novice today, but someday when I accumulate enough knowledge, I can help others in my capacity.

Tom · Posted 02-06-2021 07:56 AM

Are all of these dataset supposed to contain the same variables? If so then why do they have different lengths for some of the variables (or worse different types)? How were these datasets created? Can you fix the step that created them to insure that they are all defining the variables in the same way? For example if you created the datasets originally by converting a text file, like a CSV file, into the SAS datasets then use the same data step code to define for each. Or better just read all of the source text files in a single data step.

If you cannot fix the issue earlier in the process then you will need to figure what length (and type) to use for each variable. For character variables just define the length as the maximum length the variable has in any of the datasets. For type mismatches you will want to pick a type and convert them. How to pick a type depends on what is in the data. If the most datasets have the variable as numeric and the few that have it as character all of the actual non-blank values n those dataset are digit strings that could be convert into a number then make the variable numeric. (Note this is also true for date values. If all of the strings in the datasets where the date variable is character are strings that can be converted into a date then define the variable as numeric and convert the variable to numeric using an appropriate date informat.). But if some of the strings cannot be numbers than convert the variables that are numeric to character. Note that if one dataset has the variable of the wrong type but all of the variables are missing (or blank for character variables) then just drop the variable from that dataset.

So how do you figure it out? Look at the list of variables. For eaxmple by using PROC CONTENTS.

You can program some of the checks . So this code will make a list all of the variable names that includes a count of how many times it is numeric and how many times it is character and the maximum length. It then prints just the names that have any type mismatches .

proc contents data=INPUT._ALL_ noprint out=contents;
run;
proc sql ;
create table variables as 
  select upcase(name) as name
          , count(type=1) as n_num
         ,  count(type=2) as n_char
         , max(length) as max_length
 from contents
  group by 1
;
quit;
proc print data=variables;
  where n_num>0 and n_char>0 ;
run;

You can use that to generate code to define the type/length of every variable.

filename code temp;
data _null_;
  set variables end=eof;
  if _n_=1 then put 'length';
  put @3 name @ ;
  if n_char then put '$' @;
  put max_length ;
  if eof then put ';' ;
run;
data skeleton ;
  %include code / source2;
  stop;
run;

Now to make your overall dataset just add that empty SKELETON dataset as the first in the list. Make sure to remove any formats attached to the character variables.

data want;
  set skeleton &dslist ;
  format _character_ ;
run;

View solution in original post

andreas_lds · Posted 02-05-2021 12:27 AM

And I am wondering if you can please explain to me what does the "#" column (first column) of the

Alphabetic List of Variables and Attributes

table above indicate?

The number is the position of the variable in the dataset.

ChrisNZ · Posted 02-05-2021 12:38 AM

Mock-up code to retrieve the table names and then concatenate them:

proc sql noprint:
  select MEMNAME into :names separated by ' ' from DICTIONARY.TABLES where LIBNAME='MAPS';
quit;
data WANT;
  set &names.;
run;

Why would you do this?

Maybe a view is better?

data V_WANT/view=V_WANT;
  set &names.;
run;

High-Performance SAS Coding - Third Edition

Phil_NZ · Posted 02-05-2021 04:32 AM

Hi @ChrisNZ , thank you for your suggestion.

I run the code given by you and change the ":" to ";" at the first line, the result has some errors like mentioned in the log, can you please tell me what does the error mean?

28         proc sql noprint;
29           select MEMNAME into :names separated by ' ' from DICTIONARY.TABLES where LIBNAME='MAPS';
30         quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           0.02 seconds
      cpu time            0.00 seconds
      

31         data WANT;
32           set &names.;
ERROR: File WORK.AFGHANI2.DATA does not exist.
ERROR: File WORK.AFGHANIS.DATA does not exist.
ERROR: File WORK.AFRICA.DATA does not exist.
ERROR: File WORK.AFRICA2.DATA does not exist.
ERROR: File WORK.ALGERIA.DATA does not exist.
ERROR: File WORK.ALGERIA2.DATA does not exist.
ERROR: File WORK.ANDORRA.DATA does not exist.
ERROR: File WORK.ANDORRA2.DATA does not exist.
ERROR: File WORK.ANOMALY.DATA does not exist.
ERROR: File WORK.ANTARCT2.DATA does not exist.
ERROR: File WORK.ANTARCTI.DATA does not exist.
ERROR: File WORK.ARGENTI2.DATA does not exist.

Apart from that, I have a look on .(An Introduction to PROC SQL ), and I saw a part called concatenate the data, and I think it is similar to my work, which is

I think we can do the same thing by using UNION ALL like

PROC SQL;

CREATE TABLE MERGEALL AS
SELECT *
FROM ARGENTINA
UNION ALL
/* the same code for other 62 countries*/
SELECT *
FROM ZAMBIA;
QUIT;

I know it is very suboptimal but I am wondering if it is a solution, and if it is a solution, is there anyway to optimize the code by using UNION ALL in PROC SQL like that.

Many thanks and warmest regards.

Thank you for your help, have a fabulous and productive day! I am a novice today, but someday when I accumulate enough knowledge, I can help others in my capacity.

Kurt_Bremser · Posted 02-05-2021 05:22 AM

You need to slightly expand the SQL code:

proc sql noprint:
  select catx(".",libname,MEMNAME) into :names separated by ' '
  from DICTIONARY.TABLES where LIBNAME='MAPS';
quit;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

ChrisNZ · Posted 02-06-2021 02:35 AM

> ERROR: File WORK.AFGHANI2.DATA does not exist.
Come on. You need to try harder and understand what the code does.
Modify the values that go INTO : the macro variable so it also includes the library name instead of just MEMNAME.

High-Performance SAS Coding - Third Edition

Phil_NZ · Posted 02-06-2021 04:00 AM

Hi @ChrisNZ

Yes, I am trying to read again and make sure I do not miss any information and trying to fully understand the code given.

Back to the code suggested by you

proc sql noprint:
  select MEMNAME into :names separated by ' ' from DICTIONARY.TABLES where LIBNAME='MAPS';
quit;
data WANT;
  set &names.;
run;

I mostly understand the code, except the LIBNAME='MAPS'

I try to google but there is no result so far regarding 'MAPS', could you please give me a source that I can read about 'MAPS'?

Many thanks for your dedicated help so far and warm regards.

Thank you for your help, have a fabulous and productive day! I am a novice today, but someday when I accumulate enough knowledge, I can help others in my capacity.

ChrisNZ · Posted 02-06-2021 05:52 AM

You need something like:

proc sql noprint:
  select cats(LIBNAME,'.',MEMNAME) into :names separated by ' ' from DICTIONARY.TABLES where LIBNAME='WORK';
quit;
data WANT;
  set &names.;
run;

MAPS is standard SAS library where country maps are found. I picked it just because you had country data set names.

You are asking about a simple WHERE clause used to select which tables you want to keep.

Since your tables seem to be in the WORK library, maybe that's the library you need in the WHERE clause.

High-Performance SAS Coding - Third Edition

ChrisNZ · Posted 02-06-2021 05:55 AM

You might also want to use the INDSNAME option if the country name is not in the source tables.

High-Performance SAS Coding - Third Edition

andreas_lds · Posted 02-05-2021 12:41 AM

So my code to import all 64 files sas7bdat is as below

You don't "import" datasets, see https://communities.sas.com/t5/SAS-Programming/Is-there-any-way-to-import-all-files-in-one-folder-wi....

The following code is hardly tested:

libname input 'C:\Users\pnguyen\OneDrive - Massey University\PhD JOURNEY\3calculation' access=readonly; 

data _null_;
   set sashelp.vtable(where=(LibName = 'INPUT'));
   
   length dataset $ 42;
   
   if _n_ = 1 and exist('work.AllCountries') then do;
      call execute('proc delete data=work.AllCountries;run;');
   end;
   
   dataset = cats(LibName, '.', MemName);
   
   call execute(cats('proc append base=work.AllCountries data=', dataset, ';run;'));
run;

EDIT: The code suggested by @ChrisNZ is, of course, much easier to read, and avoids unnecessary consumption of disk space.

ChrisNZ · Posted 02-05-2021 02:27 AM

EDIT: The code suggested by @ChrisNZ is, of course, much easier to read, and avoids unnecessary consumption of disk space.

SAS is not very efficient dealing with a large number of open tables. 64 tables is probably starting to get there.
This, and proc append processing data in bulk, might mean that your solution is more efficient than my bare-bone code.

High-Performance SAS Coding - Third Edition

Phil_NZ · Posted 02-05-2021 05:03 AM

Hi @andreas_lds !

Thank you for your suggestion. I try to run your code and it announce something like that

28         libname input 'C:\Users\pnguyen\OneDrive - Massey University\PhD JOURNEY\3calculation' access=readonly;
NOTE: Libref INPUT was successfully assigned as follows: 
      Engine:        V9 
      Physical Name: C:\Users\pnguyen\OneDrive - Massey University\PhD JOURNEY\3calculation
29         
30         data _null_;
31            set sashelp.vtable(where=(LibName = 'INPUT'));
32         
33            length dataset $ 42;
34         
35            if _n_ = 1 and exist('work.AllCountries') then do;
36               call execute('proc delete data=work.AllCountries;run;');
37            end;
38         
39            dataset = cats(LibName, '.', MemName);
40         
41            call execute(cats('proc append base=work.AllCountries data=', dataset, ';run;'));
42         run;

NOTE: There were 64 observations read from the data set SASHELP.VTABLE.
      WHERE LibName='INPUT';
NOTE: DATA statement used (Total process time):
      real time           0.55 seconds
      cpu time            0.04 seconds
      

NOTE: CALL EXECUTE generated line.
1         + proc append base=work.AllCountries data=INPUT.ARGENTINA_;run;

NOTE: Appending INPUT.ARGENTINA_ to WORK.ALLCOUNTRIES.
2                                                          The SAS System                             22:57 Friday, February 5, 2021

NOTE: BASE data set does not exist. DATA file is being copied to BASE file.
NOTE: There were 1013 observations read from the data set INPUT.ARGENTINA_.
NOTE: The data set WORK.ALLCOUNTRIES has 1013 observations and 108 variables.
NOTE: PROCEDURE APPEND used (Total process time):
      real time           0.02 seconds
      cpu time            0.00 seconds
      

2         + proc append base=work.AllCountries data=INPUT.AUSTRALIA_;run;

NOTE: Appending INPUT.AUSTRALIA_ to WORK.ALLCOUNTRIES.
WARNING: Variable ENAME has different lengths on BASE and DATA files (BASE 62 DATA 82).
WARNING: Variable MNEM has different lengths on BASE and DATA files (BASE 6 DATA 9).
WARNING: Variable WC07536 has different lengths on BASE and DATA files (BASE 15 DATA 19).
WARNING: Variable WC09802 has different lengths on BASE and DATA files (BASE 4 DATA 22).
WARNING: Variable TYPE_1 has different lengths on BASE and DATA files (BASE 2 DATA 4).
WARNING: Variable EXNAME has different lengths on BASE and DATA files (BASE 12 DATA 21).
WARNING: Variable GEOGN has different lengths on BASE and DATA files (BASE 18 DATA 16).
WARNING: Variable acc_sta has different lengths on BASE and DATA files (BASE 43 DATA 62).
ERROR: No appending done because of anomalies listed above. Use FORCE option to append these files.
NOTE: 0 observations added.

Can you please give me a hint to sort it out? Apart from that, can I ask from where you have the library

work.AllCountries

And could you please guide me why you delete in this code ?

if _n_ = 1 and exist('work.AllCountries') then do;
      call execute('proc delete data=work.AllCountries;run;');
   end;

Warmest regards.

Thank you for your help, have a fabulous and productive day! I am a novice today, but someday when I accumulate enough knowledge, I can help others in my capacity.

andreas_lds · Posted 02-05-2021 05:32 AM

For "vertical merging" (aka appending) data, it is necessary that each variable is defined in exactly the same way in all datasets. The code to fix such issues is more or less trivial, you should request that such issues are fixed by the person creating the datasets.

Tom · Posted 02-06-2021 07:56 AM

Are all of these dataset supposed to contain the same variables? If so then why do they have different lengths for some of the variables (or worse different types)? How were these datasets created? Can you fix the step that created them to insure that they are all defining the variables in the same way? For example if you created the datasets originally by converting a text file, like a CSV file, into the SAS datasets then use the same data step code to define for each. Or better just read all of the source text files in a single data step.

If you cannot fix the issue earlier in the process then you will need to figure what length (and type) to use for each variable. For character variables just define the length as the maximum length the variable has in any of the datasets. For type mismatches you will want to pick a type and convert them. How to pick a type depends on what is in the data. If the most datasets have the variable as numeric and the few that have it as character all of the actual non-blank values n those dataset are digit strings that could be convert into a number then make the variable numeric. (Note this is also true for date values. If all of the strings in the datasets where the date variable is character are strings that can be converted into a date then define the variable as numeric and convert the variable to numeric using an appropriate date informat.). But if some of the strings cannot be numbers than convert the variables that are numeric to character. Note that if one dataset has the variable of the wrong type but all of the variables are missing (or blank for character variables) then just drop the variable from that dataset.

So how do you figure it out? Look at the list of variables. For eaxmple by using PROC CONTENTS.

You can program some of the checks . So this code will make a list all of the variable names that includes a count of how many times it is numeric and how many times it is character and the maximum length. It then prints just the names that have any type mismatches .

proc contents data=INPUT._ALL_ noprint out=contents;
run;
proc sql ;
create table variables as 
  select upcase(name) as name
          , count(type=1) as n_num
         ,  count(type=2) as n_char
         , max(length) as max_length
 from contents
  group by 1
;
quit;
proc print data=variables;
  where n_num>0 and n_char>0 ;
run;

You can use that to generate code to define the type/length of every variable.

filename code temp;
data _null_;
  set variables end=eof;
  if _n_=1 then put 'length';
  put @3 name @ ;
  if n_char then put '$' @;
  put max_length ;
  if eof then put ';' ;
run;
data skeleton ;
  %include code / source2;
  stop;
run;

Now to make your overall dataset just add that empty SKELETON dataset as the first in the list. Make sure to remove any formats attached to the character variables.

data want;
  set skeleton &dslist ;
  format _character_ ;
run;

Phil_NZ · Posted 02-06-2021 08:00 PM

Hi @Tom !

Thank you for your kindness and dedicated explanation. However, just stop by at the "pointing out numeric and character" types of variable:

From coding, I compare the results from your code and @ballardw 's code:

Your code is

proc contents data=INPUT._ALL_ noprint out=contents;
run;
proc sql ;
create table variables as 
  select upcase(name) as name
          , count(type=1) as n_num
         ,  count(type=2) as n_char
         , max(length) as max_length
 from contents
  group by 1
;
quit;
proc print data=variables;
  where n_num>0 and n_char>0 ;
run;

and @ballardw 's code is

Libname input 'C:\Users\pnguyen\OneDrive - Massey University\PhD JOURNEY\3calculation - Copy' access=readonly;

proc format;
value showx
low-high='x'
;
run;
proc tabulate data=sashelp.vcolumn;
   where libname='INPUT' and memtype='DATA';
   class memname name type length;
   table name*type*length,
         memname*n=' '*f=showx.
         /misstext=' '
   ;
run;

I add results of both of you and @ballardw as the additional file, here I just screenshot quite a bit

Your approach:

The result is funny because in contents Table, at least in Argentina_ file, I can see there are variables that receive value 1 and other receive value 2 in "TYPE", so there should be no way that n_num=n_char=64

And it is the result from @ballardw

From ballardw's approach, there are 10 countries having the BDATE as numeric, while the result from your approach just say each variable in 64 files all have char and numeric, I am quite confused about that.

And from @ballardw 's approach, there is only BDATE having char and num, the others are consistent in data type across all 64 files.

And could you please explain more about the code, and can you please make an example to define the type/length of a specific variable, for example BDATE?

filename code temp;
data _null_;
  set variables end=eof;
  if _n_=1 then put 'length';
  put @3 name @ ;
  if n_char then put '$' @;
  put max_length ;
  if eof then put ';' ;
run;
data skeleton ;
  %include code / source2;
  stop;
run;

Many thanks and cheers!

Thank you for your help, have a fabulous and productive day! I am a novice today, but someday when I accumulate enough knowledge, I can help others in my capacity.

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!