Solved: Re: PROC SQL to load column from multiple tables

jpprovost · Posted 10-21-2019 03:26 PM

Hi folks,

I have more than 50 tables with identical structure (they all have the same number of columns and same datatype). I would like to do the following code which returns 1 line of data, but for all my 50+ tables. Then, I would also like to regroup all those lines into one dataset.

PROC SQL;
CREATE TABLE DASHBOARD2 AS
(	SELECT AVG(VALUE) AS MOYENNE, MAX(VALUE) AS MAXIMUM, MIN(VALUE) AS MINIMUM, SUM(VALUE) AS SOMME, COUNT(*) AS NBOBS
	FROM TEST01
);
QUIT;

Is there a way to do so for all the 50+ tables and to get one dataset with 50+ lines with thoses informations (avg, max, min, sum, nbos)?

ScottBass · Posted 10-23-2019 03:33 AM

@jpprovost wrote:

Hi folks,

I have more than 50 tables with identical structure (they all have the same number of columns and same datatype). I would like to do the following code which returns 1 line of data, but for all my 50+ tables. Then, I would also like to regroup all those lines into one dataset.
PROC SQL;
CREATE TABLE DASHBOARD2 AS
(	SELECT AVG(VALUE) AS MOYENNE, MAX(VALUE) AS MAXIMUM, MIN(VALUE) AS MINIMUM, SUM(VALUE) AS SOMME, COUNT(*) AS NBOBS
	FROM TEST01
);
QUIT;
Is there a way to do so for all the 50+ tables and to get one dataset with 50+ lines with thoses informations (avg, max, min, sum, nbos)?

@novinosrin is correct...

Your proc sql is going to return one line, with avg, max, min, sum, and count for all the rows in test01.

Get proc summary to create the same results for your single test01 dataset. Read the doc on proc summary if you need to.

Now that it's working for one dataset, just append all your datasets into one, using the indsname option to capture the contributing dataset name. You could even use a data step view if your datasets are really large. Read the doc on the indsname option if you need to.

Then, just use the dataset name as a class variable in proc summary. This is analogous to group by in SQL.

If you insist on proc sql, then append the datasets as above, and use the dataset name in a group by in proc sql:

data test01 (where=(x between 0 and 10))
     test02 (where=(x between 10 and 20))
     test03 (where=(x between 20 and 30))
     ;
   do x=0 to 30;
      output;
   end;
run;

data append / view=append;
   * or test: if you don't have any extra files named test... ;   
   set test01-test03 indsname=n;  
   name=n;
run;

proc sql;
   create table foo as
   select 
      name
      ,avg(x) as avg
      ,min(x) as min
      ,max(x) as max
   from
      append
   group by
      name
   ;
quit;

proc summary data=append nway noprint;
   class name;
   var x;
   output out=foo2 (drop=_type_ _freq_)
      mean=avg 
      min=min
      max=max 
   ;
run;

Macros are great - I use them every day. But it's just as important to know when NOT to use a macro.

Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.

View solution in original post

novinosrin · Posted 10-21-2019 03:31 PM

HI @jpprovost I think

1. Append 50 tables to 1

2. Data step/proc summary (By groups)

should be a better approach than proc sql(which warrants 48 union all, or a macro loop one by one)

jpprovost · Posted 10-21-2019 03:34 PM

@novinosrin wrote:

HI @jpprovost I think

1. Append 50 tables to 1

2. Data step/proc summary (By groups)

should be a better approach than proc sql(which warrants 48 union all, or a macro loop one by one)

Hi @novinosrin ,

I did the following macro :

%macro dash;
%do i = 1 %to 3;
	PROC SQL;
	CREATE TABLE DASHBOARDk&i AS
	(	SELECT AVG(VALUE) AS MOYENNE, MAX(VALUE) AS MAXIMUM, MIN(VALUE) AS MINIMUM, SUM(VALUE) AS SOMME, COUNT(*) AS NBOBS
		FROM TEST0&i
	);
	QUIT;
run ;

%end;
%mend dash;

It extracts the wanted informations into 50+ different tables. Now, I struggle to get them all together.

How can I use your proposed solution? I'm new with SAS and I'm trying to "Google" as many things as I can.

Thanks!

novinosrin · Posted 10-21-2019 03:39 PM

data append;
set test1-test50 indsname=n;/*if it is in continuous sequence suffix*/
nameof_dsn=n;
run;


proc summary data=append nway noprint;
class nameof_dsn;
var value;
output out=want mean=moyenne max=maximum; /*and so forth, i am lazy to write the rest*/
run;

Edited: to include nameof_dsn as Class variable

s_lassen · Posted 10-22-2019 09:18 AM

I think this can be done quite simply by using an INSERT statement in SQL:

%macro dash;
	PROC SQL;
	CREATE TABLE DASHBOARD AS
		SELECT AVG(VALUE) AS MOYENNE, MAX(VALUE) AS MAXIMUM, MIN(VALUE) AS MINIMUM, SUM(VALUE) AS SOMME, COUNT(*) AS NBOBS
		FROM TEST01;
	;
	QUIT;

%do i = 2 %to 3;
	PROC SQL;
	Insert into dashboard
		SELECT AVG(VALUE) AS MOYENNE, MAX(VALUE) AS MAXIMUM, MIN(VALUE) AS MINIMUM, SUM(VALUE) AS SOMME, COUNT(*) AS NBOBS
		FROM TEST0&i
	);
	QUIT;
run ;

%end;
%mend dash;

Shmuel · Posted 10-21-2019 03:39 PM

I assume that each table have 1 or more observations.

When you ask for max(value) - or any other statistics - do you mean per table or per all tables as a unit ?

Do you want the statistics of one specific variable or on all variables ?

It seems to me that PROC MEANS will be more efficient than PROC SQL.

jpprovost · Posted 10-21-2019 03:47 PM

In fact, my 50+ tables represents some the 50+ last months. Of course, I have more observations each month the file is generated. What I want is to define limits from the past 50+ months usings statistics of those months to verify that my new data (next months) are within those limits.

So the answer to your question is the a specific statistic on one variable (value).

If it's not clear, do not hesitate.

Shmuel · Posted 10-21-2019 06:11 PM

@jpprovost wrote:
In fact, my 50+ tables represents some the 50+ last months. Of course, I have more observations each month the file is generated. What I want is to define limits from the past 50+ months usings statistics of those months to verify that my new data (next months) are within those limits.

So the answer to your question is the a specific statistic on one variable (value).

If it's not clear, do not hesitate.

In such case the code should look like:

proc datasets lib=work nolist; delete final; quit; 

%macro stat;
    %do i=1 %to 50;
     proc means data=test&i noprint nway;
           class <month>;
           var value;
           output out=temp mean=moyenne max=maximum 
                          min=minimum sum=somme n=nobs;
     run;
     proc append base=final data=temp; run;
     proc datasets lib=work nolist; delete temp; quit; 
%mend;
%stat;

jpprovost · Posted 10-22-2019 10:30 AM

@Shmuel wrote:

@jpprovost wrote:
In fact, my 50+ tables represents some the 50+ last months. Of course, I have more observations each month the file is generated. What I want is to define limits from the past 50+ months usings statistics of those months to verify that my new data (next months) are within those limits.

So the answer to your question is the a specific statistic on one variable (value).

If it's not clear, do not hesitate.

In such case the code should look like:
proc datasets lib=work nolist; delete final; quit; 

%macro stat;
    %do i=1 %to 50;
     proc means data=test&i noprint nway;
           class <month>;
           var value;
           output out=temp mean=moyenne max=maximum 
                          min=minimum sum=somme n=nobs;
     run;
     proc append base=final data=temp; run;
     proc datasets lib=work nolist; delete temp; quit; 
%mend;
%stat;

Thank you @Shmuel for your help and time.

Really appreciated.

ScottBass · Posted 10-23-2019 03:33 AM

@jpprovost wrote:

Hi folks,

I have more than 50 tables with identical structure (they all have the same number of columns and same datatype). I would like to do the following code which returns 1 line of data, but for all my 50+ tables. Then, I would also like to regroup all those lines into one dataset.
PROC SQL;
CREATE TABLE DASHBOARD2 AS
(	SELECT AVG(VALUE) AS MOYENNE, MAX(VALUE) AS MAXIMUM, MIN(VALUE) AS MINIMUM, SUM(VALUE) AS SOMME, COUNT(*) AS NBOBS
	FROM TEST01
);
QUIT;
Is there a way to do so for all the 50+ tables and to get one dataset with 50+ lines with thoses informations (avg, max, min, sum, nbos)?

@novinosrin is correct...

Your proc sql is going to return one line, with avg, max, min, sum, and count for all the rows in test01.

Get proc summary to create the same results for your single test01 dataset. Read the doc on proc summary if you need to.

Now that it's working for one dataset, just append all your datasets into one, using the indsname option to capture the contributing dataset name. You could even use a data step view if your datasets are really large. Read the doc on the indsname option if you need to.

Then, just use the dataset name as a class variable in proc summary. This is analogous to group by in SQL.

If you insist on proc sql, then append the datasets as above, and use the dataset name in a group by in proc sql:

data test01 (where=(x between 0 and 10))
     test02 (where=(x between 10 and 20))
     test03 (where=(x between 20 and 30))
     ;
   do x=0 to 30;
      output;
   end;
run;

data append / view=append;
   * or test: if you don't have any extra files named test... ;   
   set test01-test03 indsname=n;  
   name=n;
run;

proc sql;
   create table foo as
   select 
      name
      ,avg(x) as avg
      ,min(x) as min
      ,max(x) as max
   from
      append
   group by
      name
   ;
quit;

proc summary data=append nway noprint;
   class name;
   var x;
   output out=foo2 (drop=_type_ _freq_)
      mean=avg 
      min=min
      max=max 
   ;
run;

Macros are great - I use them every day. But it's just as important to know when NOT to use a macro.

Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.

jpprovost · Posted 10-23-2019 09:17 AM

@ScottBass wrote:
@jpprovost wrote:

Hi folks,

I have more than 50 tables with identical structure (they all have the same number of columns and same datatype). I would like to do the following code which returns 1 line of data, but for all my 50+ tables. Then, I would also like to regroup all those lines into one dataset.
PROC SQL;
CREATE TABLE DASHBOARD2 AS
(	SELECT AVG(VALUE) AS MOYENNE, MAX(VALUE) AS MAXIMUM, MIN(VALUE) AS MINIMUM, SUM(VALUE) AS SOMME, COUNT(*) AS NBOBS
	FROM TEST01
);
QUIT;
Is there a way to do so for all the 50+ tables and to get one dataset with 50+ lines with thoses informations (avg, max, min, sum, nbos)?
@novinosrin is correct...

Your proc sql is going to return one line, with avg, max, min, sum, and count for all the rows in test01.

Get proc summary to create the same results for your single test01 dataset. Read the doc on proc summary if you need to.

Now that it's working for one dataset, just append all your datasets into one, using the indsname option to capture the contributing dataset name. You could even use a data step view if your datasets are really large. Read the doc on the indsname option if you need to.

Then, just use the dataset name as a class variable in proc summary. This is analogous to group by in SQL.

If you insist on proc sql, then append the datasets as above, and use the dataset name in a group by in proc sql:
data test01 (where=(x between 0 and 10))
     test02 (where=(x between 10 and 20))
     test03 (where=(x between 20 and 30))
     ;
   do x=0 to 30;
      output;
   end;
run;

data append / view=append;
   * or test: if you don't have any extra files named test... ;   
   set test01-test03 indsname=n;  
   name=n;
run;

proc sql;
   create table foo as
   select 
      name
      ,avg(x) as avg
      ,min(x) as min
      ,max(x) as max
   from
      append
   group by
      name
   ;
quit;

proc summary data=append nway noprint;
   class name;
   var x;
   output out=foo2 (drop=_type_ _freq_)
      mean=avg 
      min=min
      max=max 
   ;
run;
Macros are great - I use them every day. But it's just as important to know when NOT to use a macro.

Thank to you @ScottBass.

Of course macros are great, I try to use is as much as I can.

However, do you have any tips/hints of when macros should NOT be used?

Regards,

ScottBass · Posted 10-24-2019 12:04 AM

Of course macros are great, I try to use is as much as I can.

No! I disagree. I would change this to "I try to use only when required". So, what is the definition of "required"? Obviously you'll get different answers, my quick one off the top of my head is 1) when you want to parameterize a block of working code (if you find yourself cut-and-pasting a block of code with the same edits, that's a good candidate), or 2) you want to encapsulate a block of code that you will call often (this could also be an external %include block of code).

However, do you have any tips/hints of when macros should NOT be used?

Macro is analogous to the pre-processor in C. It does a lot more, but let's run with this analogy. All it does is control the code that gets sent to the compiler (for now I'm ignoring data step interfaces like symget and call symputx).

I like to think of macro as "If I could type this code at 10,000 words a second, and never make a mistake, then I wouldn't need macro". Silly perhaps, but macro is just controlling the code that gets sent to the compiler (or a proc which isn't compiled, but I digress).

Also, macro is MUCH slower than if you could do it in the data step or a proc. Another example is if you're using macro to process in groups, when that processing would be supported via BY group processing in SAS.

If you find you're writing really klunky, hard to follow macro code for your problem, step back and see if you really need macro.

A quick example and I've got to run:

Say you have 1000 external files you need to concatenate into a single dataset. They are all in one directory and have the same format.

You could get the list of files, then loop over that list in a macro, generating 1000 data steps, appending each input to a target table.

Or, you could use an aggregated filename, plus the filename and/or filevar options to the infile statement.

I'd say the key to knowing when not to use macro is simply increasing your overall knowledge of alternative, better approaches via SAS, esp. the data step.

I can usually tell a "newbie" (I'm not saying you're one, by the way) when they think macro is the panacea for everything.

HTH...

Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.

jpprovost · Posted 10-24-2019 09:04 AM

I define myself as a newbie, don't worry.
Thanks for all those explanations. They are very useful and I will keep it in mind in the future.

Thanks 🙂

Classroom Training Available!