Re: Creating multiple tables from a larger dataset

mhoward2 · Posted 12-06-2019 03:37 PM

Whats the most efficient way to split a large table into smaller tables based on varying number of values in a column? Lets say I have the following data, but want to split out each color into their own table.

Green	5
Green	8
Green	3
Green	1
Green	10
Yellow	9
Yellow	2
Yellow	6
Yellow	8
Yellow	12
Yellow	5
Yellow	8
Purple	5
Purple	2
Purple	6

My first thought it that you would first need to get a list of all distinct values from the column you are filtering on, sometimes there's 3 sometimes there's 15. I thought about using a macro to run a Proc Sql create table select * from 'have' where color = %Macro_variable. But dont know how to assign the macro variable?

Thanks in advance!

PaigeMiller · Posted 12-06-2019 03:44 PM

First, please consider the option of leaving everything in one large data set, and then any analyses you want to do can be performed by using BY statements, and then you don't even have to know how many groups there are. This is a much simpler option, that requires much less programming, than splitting up the data and then writing code to handle each of the split data sets.

--
Paige Miller

mhoward2 · Posted 12-06-2019 03:49 PM

I appreciate the advice as I am still a novice at this but it needs to split out as it is the end product for multiple reports going out to different people!

PaigeMiller · Posted 12-06-2019 03:55 PM

@mhoward2 wrote:
I appreciate the advice as I am still a novice at this but it needs to split out as it is the end product for multiple reports going out to different people!

No, you don't need to split the data out the way you have planned.

If you are creating tables or reports, then for example if you are going to use PROC EXPORT, this works with a single data set, no splitting needed

proc export data=have(where=(color='Green')) outfile='green.xlsx';
run;

proc export data=have(where=(color='Yellow')) outfile='yellow.xlsx';
run;

But that's still a lot of manual typing, a macro would work without splitting the data set.

%macro dothis;
    proc sql noprint;
        select distinct color into :colors separated by ' ' from have;
    quit;
    %do i=1 %to %sysfunc(countw(&colors));
    %let thiscolor=%scan(&colors,&i,%str( ));
    proc export data=have(where=(color="&thiscolor")) outfile="&thiscolor..xlsx";
    run;
    %end;
%mend;
%dothis

The conclusion that you need to split the data set is not warranted, and splitting is just extra work.

--
Paige Miller

mhoward2 · Posted 12-06-2019 04:39 PM

the Tables are inserted into excel via VBA and reports are created in excel with VBA from them. I don't want them to be exported to an excel workbook. I want them to be on the SAS server so my excel template can use VBA to go grab them depending on which table is needed. Does that make sense or am I missing something here?

PeterClemmensen · Posted 12-06-2019 03:44 PM

Here is one way

data have;
input color $ val;
datalines;
Green 5
Green 8
Green 3
Green 1
Green 10
Yellow 9
Yellow 2
Yellow 6
Yellow 8
Yellow 12
Yellow 5
Yellow 8
Purple 5
Purple 2
Purple 6
;

data _null_;
   if _n_=1 then do;                                
      declare hash h(multidata:'y'); 
      h.definekey('color');                                     
      h.definedata('color', 'val');                                    
      h.definedone();                                           
   end;
   do until(last.color);                                     
      set have;                                             
      by color notsorted;
      h.add();                                                  
   end;
   h.output(dataset:color);                                   
   h.clear();                                                   
run;

The DATA to DATA Step Macro
Blog: SASnrd

mhoward2 · Posted 12-06-2019 03:53 PM

This is great! How would I go about selecting all columns in the "h.definedata('Color', 'val');" statement? the table has like 20 column. would I have to list all column headers there? Thanks again!

PeterClemmensen · Posted 12-06-2019 03:54 PM

Not a problem. Do like this

data _null_;
   if _n_=1 then do;                                
      declare hash h(dataset:'have(obs=0)', multidata:'y'); 
      h.definekey('color');                                     
      h.definedata(all:'Y');                                    
      h.definedone();                                           
   end;
   do until(last.color);                                     
      set have;                                             
      by color notsorted;
      h.add();                                                  
   end;
   h.output(dataset:color);                                   
   h.clear();                                                   
run;

The DATA to DATA Step Macro
Blog: SASnrd

ballardw · Posted 12-06-2019 04:08 PM

If there is only procedure involved BY groups are still the way to go. Example:

data have;
input color $ val;
datalines;
Green 5
Green 8
Green 3
Green 1
Green 10
Yellow 9
Yellow 2
Yellow 6
Yellow 8
Yellow 12
Yellow 5
Yellow 8
Purple 5
Purple 2
Purple 6
;

proc sort data=have;
   by color;
run;

ods rtf file="c:\data\test1.rtf" newfile=Bygroup;

proc print data=have;
   by color;
   var val;
run;
ods rtf close;

title;

For example will create three separate output RTF files. Each one will have the output for only one of the BY group values.

Using a numbered output file name I did means the first created is Test1, then incremented at each by value change creating Test2 and Test3.

Note this is for ONE procedure. If you have multiple procedures then you would likely be better off to go through a macro approach to select the appropriate values for each of the different procedures.

Depending on the ODS destination the option to do this will change somewhat.

PaigeMiller · Posted 12-06-2019 04:13 PM

@ballardw wrote:

If there is only procedure involved BY groups are still the way to go. Example:

data have;
input color $ val;
datalines;
Green 5
Green 8
Green 3
Green 1
Green 10
Yellow 9
Yellow 2
Yellow 6
Yellow 8
Yellow 12
Yellow 5
Yellow 8
Purple 5
Purple 2
Purple 6
;

proc sort data=have;
   by color;
run;

ods rtf file="c:\data\test1.rtf" newfile=Bygroup;

proc print data=have;
   by color;
   var val;
run;
ods rtf close;

title;

I wish I had said that! Great example!

--
Paige Miller

novinosrin · Posted 12-06-2019 04:40 PM

Sir @ballardw Very clever. Brilliant presence of mind! Kudos!

I wish clever ways strikes my mind like that when needed and not offer the obvious solutions. Very nice thinking! but I am jealous though!

ballardw · Posted 12-06-2019 04:45 PM

@novinosrin wrote:

Sir @ballardw Very clever. Brilliant presence of mind! Kudos!

I wish clever ways strikes my mind like that when needed and not offer the obvious solutions. Very nice thinking! but I am jealous though!

No reason to be jealous. Sympathy for the time I have occasionally spent going through documentation looking for "there has to be a better way", perhaps, but not jealousy.

novinosrin · Posted 12-06-2019 04:28 PM

Hi @mhoward2

@mhoward2 wrote:

My first thought it that you would first need to get a list of all distinct values from the column you are filtering on, sometimes there's 3 sometimes there's 15. I thought about using a macro to run a Proc Sql create table select * from 'have' where color = %Macro_variable. But dont know how to assign the macro variable?

Thanks in advance!

IMHO Your above statement sounds you are not quite a novice if you can think at macro level. You basically just need to get used to the syntax. That's all. So to help your logical thinking, I merely wrote the syntax


data have;
input color $ val;
datalines;
Green 5
Green 8
Green 3
Green 1
Green 10
Yellow 9
Yellow 2
Yellow 6
Yellow 8
Yellow 12
Yellow 5
Yellow 8
Purple 5
Purple 2
Purple 6
;


proc sql noprint;
select distinct color into :c1-
from have;
quit;

%put &=sqlobs;

%macro t;
 %do i=1 %to &sqlobs;
  proc sql;
  create table &&c&i as
  select *
  from have
  where color="&&c&i";
  quit;
%end;
%mend t;

%t

HTH

mhoward2 · Posted 12-06-2019 04:49 PM

This is exactly what I was looking for! Can you give me a quick explanation of how the "&&c&i" works? What the that telling sas? How would I name the tables along the lines of "LibName.Color_TotalValues" For each different color of course

novinosrin · Posted 12-06-2019 05:04 PM

&& resolves & so &&c resolves &c /*this is called indirect macro variable reference*/

&i resolves to the index var values . for example 1..

So the whole macro variable reference &&c&i would initially resolves to &c1, then &c1 resolves to whatever values stored in the macro symbol table from the DISTINCT SQL idea of yours in the 1st step.

For How would I name the tables along the lines of "LibName.Color_TotalValues" For each different color of course?

Revise the create table statement to

  create table libname.&&c&i.._TotalValues as

Some notes I wrote for a friend about INDIRECT MACROVARIABLE REFERENCES a while ago:

NOTE: PLEASE READ IF and ONLY WHEN YOU HAVE TIME

INDIRECT MACRO REFERENCES

Objective: To generate a set of executable texts through INDIRECT MACRO REFERENCES as a derivative of the Macro processor at compile time that is to be executed at executive time operation by data step compiler/SQL processor or perhaps interestingly even by SQL optimizer.

So let’s recap, what’s an indirect macro reference? For those who already know, please skip the details with an implied GOTO and go the examples.

In simple terms, An indirect macro reference typically triggers the macro processor to perform the task again or in formal words known as RE-SCAN. This essentially means, the macro processor reads or scans the macro reference more than once every time it encounters more than one macro consecutive trigger(&) in succession. For each execution, the macro processor takes the macro reference from the token and looks up to the global/symbol table where the macro variable is stored and resolves the value placing it back in the input stack which is your pile of executable tokens.

Mind you, this is a pretty sequential process and I would stress the fact not to use too many macro variables/macro definitions unless parameterization is the sole way to design and execute a process in a production environment. First, brevity in terms of code, execution is challenged. Second, it is highly memory intensive. Third, in essence it’s actually wasting resources when there are highly efficient and often optimized solutions available. Well, for those whose familiarity with technology is limited, it goes back to the old saying-“When all you have is a hammer, everything looks like a nail”. Trust me, you are rather on the contrary belonging to smart category. Anyway, the topic of optimization is something beyond the scope of the objective of this thread but I wanted to bring that awareness unless you already know it.

So once the work of the macro processor is complete, the resolved text(tokens) are executed by the compiler. Now let’s take some example to understand all the verbose that’s been discussed above.

To begin with, we shall see how a direct reference works-

%let need=novinosrin is crazy;

%put &=need;

Yep straight forward as it is to expect the obvious in the log

1288 %put &=need;

NEED=novinosrin is crazy

But the point is, it is important to understand, the operation of the macro processor i.e SCAN THE WORD(TOKEN)àFIND THE MACRO TRIGGER, IF PRESENTàLOOK UPàSYMBOL TABLEàGET THE VALUEàRESOLVE THE TEXT(TOKEN) and then send to the compiler for execution.

Now, moving on Indirect Macro reference aka MORE THAN ONE AMPERSAND(& to &&&&&&& and so on) as PREFIX. So here when a macro processor finds multiple triggers, the macro processor work in fact loops iteratively or in other words redo, redo and redo until all triggers are resolved to plain text/token for execution by the compiler. Yeah right, this would start to make sense how much time, energy and cost is spent during a compile time operation.

RE-SCAN Thumb rule: Two ampersands resolve to one ampersand, i.e. && resolution= &

Now let’s take the same example above and add another & as prefix making it double.

%let need=novinosrin is crazy;

%put &&need;

We see in the log

1294 %put &&need;

novinosrin is crazy

So what happens here is &&need was scanned first. && became &. So &&need became &need. Now, scan again from left to right, &need’s value is retrieved.

Now let’s add two more ampersands as prefix to get better intuition.

%let need=novinosrin is crazy;

%put &&&&need;

Log: 1303 %put &&&&need;

novinosrin is crazy

So we get to learn, &&&&need becomes &&&need. This becomes &&need. This becomes &need. J Fun stuff eh? lol That’s how it works.

Moving on- CREAM OF THE CROP- the real indirect reference that references a “macro variable with another macro variable reference”. Let’s take the same example,

%let need=novinosrin is crazy;

%let required=need;

Here you we would like to reference the value of need with the macro variable reference required. So let’s try resolving it,

%put &&required;

As we learned earlier, && to &, so &&required would become &required. So resolution is of course

1312 %put &&required;

Need

So what are we missing here. Pretty intuitively, we need an extra ampersand that should be left over so that the macro processor can use that trigger to resolve for &need. In other words, we need the required macro reference resolve to &need. Therefore, adding another ampersand would suffice-

%put &&&required;

1313 %put &&&required;

novinosrin is crazy

So does that make us happy? Hmm, not quite.

Furthermore, what if one is a SAS addict and wants to get to the bottom of it going one step further in referencing a macro variable with indirect reference through another indirect reference. Yep, some teaser after allJ So let’s take the same example

%let need=novinosrin is crazy;

%let required=need;

%let want=required;

Now, the crazy addict wants to reference need with a macro variable reference want. By now, I am sure you smart folks would do it in a breeze. So yes it’s a simple arithmetic. Take a chance

1345 %put &want;

required

1346 %put &&want;

required

1347 %put &&&want;

need

1348

1349 %put &&&&want;

required

1350 %put &&&&&want;

need

1351

1352 %put &&&&&&want;

need

1353

1354 %put &&&&&&&want;

novinosrin is crazy

Finally, we can conclude the RE-SCAN is actually a loop work of the macro processor depending on the number of indirect references represented by number of macro triggers as prefix following the thumb execution principle && to & while the derivatives of the one macro reference adding a & trigger as prefix resolves to another macro variable value which goes until the end of all macro triggers being resolved. I hope this helps. Thank you for reading.

Kind regards,

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!