- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Whats the most efficient way to split a large table into smaller tables based on varying number of values in a column? Lets say I have the following data, but want to split out each color into their own table.
Green | 5 |
Green | 8 |
Green | 3 |
Green | 1 |
Green | 10 |
Yellow | 9 |
Yellow | 2 |
Yellow | 6 |
Yellow | 8 |
Yellow | 12 |
Yellow | 5 |
Yellow | 8 |
Purple | 5 |
Purple | 2 |
Purple | 6 |
My first thought it that you would first need to get a list of all distinct values from the column you are filtering on, sometimes there's 3 sometimes there's 15. I thought about using a macro to run a Proc Sql create table select * from 'have' where color = %Macro_variable. But dont know how to assign the macro variable?
Thanks in advance!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
First, please consider the option of leaving everything in one large data set, and then any analyses you want to do can be performed by using BY statements, and then you don't even have to know how many groups there are. This is a much simpler option, that requires much less programming, than splitting up the data and then writing code to handle each of the split data sets.
Paige Miller
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@mhoward2 wrote:
I appreciate the advice as I am still a novice at this but it needs to split out as it is the end product for multiple reports going out to different people!
No, you don't need to split the data out the way you have planned.
If you are creating tables or reports, then for example if you are going to use PROC EXPORT, this works with a single data set, no splitting needed
proc export data=have(where=(color='Green')) outfile='green.xlsx';
run;
proc export data=have(where=(color='Yellow')) outfile='yellow.xlsx';
run;
But that's still a lot of manual typing, a macro would work without splitting the data set.
%macro dothis;
proc sql noprint;
select distinct color into :colors separated by ' ' from have;
quit;
%do i=1 %to %sysfunc(countw(&colors));
%let thiscolor=%scan(&colors,&i,%str( ));
proc export data=have(where=(color="&thiscolor")) outfile="&thiscolor..xlsx";
run;
%end;
%mend;
%dothis
The conclusion that you need to split the data set is not warranted, and splitting is just extra work.
Paige Miller
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Here is one way
data have;
input color $ val;
datalines;
Green 5
Green 8
Green 3
Green 1
Green 10
Yellow 9
Yellow 2
Yellow 6
Yellow 8
Yellow 12
Yellow 5
Yellow 8
Purple 5
Purple 2
Purple 6
;
data _null_;
if _n_=1 then do;
declare hash h(multidata:'y');
h.definekey('color');
h.definedata('color', 'val');
h.definedone();
end;
do until(last.color);
set have;
by color notsorted;
h.add();
end;
h.output(dataset:color);
h.clear();
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Not a problem. Do like this
data _null_;
if _n_=1 then do;
declare hash h(dataset:'have(obs=0)', multidata:'y');
h.definekey('color');
h.definedata(all:'Y');
h.definedone();
end;
do until(last.color);
set have;
by color notsorted;
h.add();
end;
h.output(dataset:color);
h.clear();
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
If there is only procedure involved BY groups are still the way to go. Example:
data have; input color $ val; datalines; Green 5 Green 8 Green 3 Green 1 Green 10 Yellow 9 Yellow 2 Yellow 6 Yellow 8 Yellow 12 Yellow 5 Yellow 8 Purple 5 Purple 2 Purple 6 ; proc sort data=have; by color; run; ods rtf file="c:\data\test1.rtf" newfile=Bygroup; proc print data=have; by color; var val; run; ods rtf close; title;
For example will create three separate output RTF files. Each one will have the output for only one of the BY group values.
Using a numbered output file name I did means the first created is Test1, then incremented at each by value change creating Test2 and Test3.
Note this is for ONE procedure. If you have multiple procedures then you would likely be better off to go through a macro approach to select the appropriate values for each of the different procedures.
Depending on the ODS destination the option to do this will change somewhat.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@ballardw wrote:
If there is only procedure involved BY groups are still the way to go. Example:
data have; input color $ val; datalines; Green 5 Green 8 Green 3 Green 1 Green 10 Yellow 9 Yellow 2 Yellow 6 Yellow 8 Yellow 12 Yellow 5 Yellow 8 Purple 5 Purple 2 Purple 6 ; proc sort data=have; by color; run; ods rtf file="c:\data\test1.rtf" newfile=Bygroup; proc print data=have; by color; var val; run; ods rtf close; title;
I wish I had said that! Great example!
Paige Miller
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Sir @ballardw Very clever. Brilliant presence of mind! Kudos!
I wish clever ways strikes my mind like that when needed and not offer the obvious solutions. Very nice thinking! but I am jealous though!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@novinosrin wrote:
Sir @ballardw Very clever. Brilliant presence of mind! Kudos!
I wish clever ways strikes my mind like that when needed and not offer the obvious solutions. Very nice thinking! but I am jealous though!
No reason to be jealous. Sympathy for the time I have occasionally spent going through documentation looking for "there has to be a better way", perhaps, but not jealousy.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi @mhoward2
@mhoward2 wrote:
My first thought it that you would first need to get a list of all distinct values from the column you are filtering on, sometimes there's 3 sometimes there's 15. I thought about using a macro to run a Proc Sql create table select * from 'have' where color = %Macro_variable. But dont know how to assign the macro variable?
Thanks in advance!
IMHO Your above statement sounds you are not quite a novice if you can think at macro level. You basically just need to get used to the syntax. That's all. So to help your logical thinking, I merely wrote the syntax
data have;
input color $ val;
datalines;
Green 5
Green 8
Green 3
Green 1
Green 10
Yellow 9
Yellow 2
Yellow 6
Yellow 8
Yellow 12
Yellow 5
Yellow 8
Purple 5
Purple 2
Purple 6
;
proc sql noprint;
select distinct color into :c1-
from have;
quit;
%put &=sqlobs;
%macro t;
%do i=1 %to &sqlobs;
proc sql;
create table &&c&i as
select *
from have
where color="&&c&i";
quit;
%end;
%mend t;
%t
HTH
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
&& resolves & so &&c resolves &c /*this is called indirect macro variable reference*/
&i resolves to the index var values . for example 1..
So the whole macro variable reference &&c&i would initially resolves to &c1, then &c1 resolves to whatever values stored in the macro symbol table from the DISTINCT SQL idea of yours in the 1st step.
For How would I name the tables along the lines of "LibName.Color_TotalValues" For each different color of course?
Revise the create table statement to
create table libname.&&c&i.._TotalValues as
Some notes I wrote for a friend about INDIRECT MACROVARIABLE REFERENCES a while ago:
NOTE: PLEASE READ IF and ONLY WHEN YOU HAVE TIME
INDIRECT MACRO REFERENCES
Objective: To generate a set of executable texts through INDIRECT MACRO REFERENCES as a derivative of the Macro processor at compile time that is to be executed at executive time operation by data step compiler/SQL processor or perhaps interestingly even by SQL optimizer.
So let’s recap, what’s an indirect macro reference? For those who already know, please skip the details with an implied GOTO and go the examples.
In simple terms, An indirect macro reference typically triggers the macro processor to perform the task again or in formal words known as RE-SCAN. This essentially means, the macro processor reads or scans the macro reference more than once every time it encounters more than one macro consecutive trigger(&) in succession. For each execution, the macro processor takes the macro reference from the token and looks up to the global/symbol table where the macro variable is stored and resolves the value placing it back in the input stack which is your pile of executable tokens.
Mind you, this is a pretty sequential process and I would stress the fact not to use too many macro variables/macro definitions unless parameterization is the sole way to design and execute a process in a production environment. First, brevity in terms of code, execution is challenged. Second, it is highly memory intensive. Third, in essence it’s actually wasting resources when there are highly efficient and often optimized solutions available. Well, for those whose familiarity with technology is limited, it goes back to the old saying-“When all you have is a hammer, everything looks like a nail”. Trust me, you are rather on the contrary belonging to smart category. Anyway, the topic of optimization is something beyond the scope of the objective of this thread but I wanted to bring that awareness unless you already know it.
So once the work of the macro processor is complete, the resolved text(tokens) are executed by the compiler. Now let’s take some example to understand all the verbose that’s been discussed above.
To begin with, we shall see how a direct reference works-
%let need=novinosrin is crazy;
%put &=need;
Yep straight forward as it is to expect the obvious in the log
1288 %put &=need;
NEED=novinosrin is crazy
But the point is, it is important to understand, the operation of the macro processor i.e SCAN THE WORD(TOKEN)àFIND THE MACRO TRIGGER, IF PRESENTàLOOK UPàSYMBOL TABLEàGET THE VALUEàRESOLVE THE TEXT(TOKEN) and then send to the compiler for execution.
Now, moving on Indirect Macro reference aka MORE THAN ONE AMPERSAND(& to &&&&&&& and so on) as PREFIX. So here when a macro processor finds multiple triggers, the macro processor work in fact loops iteratively or in other words redo, redo and redo until all triggers are resolved to plain text/token for execution by the compiler. Yeah right, this would start to make sense how much time, energy and cost is spent during a compile time operation.
RE-SCAN Thumb rule: Two ampersands resolve to one ampersand, i.e. && resolution= &
Now let’s take the same example above and add another & as prefix making it double.
%let need=novinosrin is crazy;
%put &&need;
We see in the log
1294 %put &&need;
novinosrin is crazy
So what happens here is &&need was scanned first. && became &. So &&need became &need. Now, scan again from left to right, &need’s value is retrieved.
Now let’s add two more ampersands as prefix to get better intuition.
%let need=novinosrin is crazy;
%put &&&&need;
Log: 1303 %put &&&&need;
novinosrin is crazy
So we get to learn, &&&&need becomes &&&need. This becomes &&need. This becomes &need. J Fun stuff eh? lol That’s how it works.
Moving on- CREAM OF THE CROP- the real indirect reference that references a “macro variable with another macro variable reference”. Let’s take the same example,
%let need=novinosrin is crazy;
%let required=need;
Here you we would like to reference the value of need with the macro variable reference required. So let’s try resolving it,
%put &&required;
As we learned earlier, && to &, so &&required would become &required. So resolution is of course
1312 %put &&required;
Need
So what are we missing here. Pretty intuitively, we need an extra ampersand that should be left over so that the macro processor can use that trigger to resolve for &need. In other words, we need the required macro reference resolve to &need. Therefore, adding another ampersand would suffice-
%put &&&required;
1313 %put &&&required;
novinosrin is crazy
So does that make us happy? Hmm, not quite.
Furthermore, what if one is a SAS addict and wants to get to the bottom of it going one step further in referencing a macro variable with indirect reference through another indirect reference. Yep, some teaser after allJ So let’s take the same example
%let need=novinosrin is crazy;
%let required=need;
%let want=required;
Now, the crazy addict wants to reference need with a macro variable reference want. By now, I am sure you smart folks would do it in a breeze. So yes it’s a simple arithmetic. Take a chance
1345 %put &want;
required
1346 %put &&want;
required
1347 %put &&&want;
need
1348
1349 %put &&&&want;
required
1350 %put &&&&&want;
need
1351
1352 %put &&&&&&want;
need
1353
1354 %put &&&&&&&want;
novinosrin is crazy
Finally, we can conclude the RE-SCAN is actually a loop work of the macro processor depending on the number of indirect references represented by number of macro triggers as prefix following the thumb execution principle && to & while the derivatives of the one macro reference adding a & trigger as prefix resolves to another macro variable value which goes until the end of all macro triggers being resolved. I hope this helps. Thank you for reading.
Kind regards,