Splitting/Joining datasets dynamically (macro)

Ody · Posted 01-27-2017 02:45 PM

Hello all,

I have a 2 part question.

The first part is that I want my testdata to be split into different datasets based on the order field. I can do this statically by writing something like this:


data testdata;
infile datalines dsd truncover;
input 
ordernum:$3.
order:1.
code:$3.	
;

datalines;
111, 1, qwe
111, 2, gfd
111, 3, sdf
222, 1, erw
222, 2, fgh
333, 1, saa
333, 2, ath
333, 3, gdw
444, 1, etg
444, 2, hef
444, 3, sfy
444, 4, hrl
555, 1, mmj
555, 2, bhg
555, 3, dgy
555, 4, ggd
555, 5, dss
;;;



data ordernum1 ordernum2 ordernum3 ordernum4 ordernum5;
set testdata;
if Order = 1 then output ordernum1;
if Order = 2 then output ordernum2;
if Order = 3 then output ordernum3;
if Order = 4 then output ordernum4;
if Order = 5 then output ordernum5;
run;

However, with this approach I can't be assured that there will always only be a total of 5 orders; there could be any number.

I'm having trouble getting a macro working and was looking for some additional eyes. This may not even be the best way to do this but I appreciate any input.



proc sql noprint;
select max(Order) into : order_num
from testdata;
quit;

%put Max  Count = &order_num;


options mlogic mprint;
%macro split;

data %do i = 1 %to &order_num;
ordernum&i %end;
;

set test_Data;
select(Order);
%do i = 1 %to &order_num;
when Order = &i then output ordernum&i; 
%end;
otherwise;
end;
run;

%mend split;
%split;

And lastly, I have a master table of ordernums I want to join these separated datasets back to. How do I approach writing the join logic dynamically? For example, I could do this manually if I knwo the number of tables the testdata will be split into by writing something like this:

proc sql;
create table New_Table as
select 
a.*,
b.code  as code1,
c.code  as code2,
d.code  as code3,
e.code  as code4,
f.code  as code5

from master a
left join ordernum1 b on a.ordernum = b.ordernum
left join ordernum2 c on a.ordernum = c.ordernum
left join ordernum3 d on a.ordernum = d.ordernum
left join ordernum4 e on a.ordernum = e.ordernum
left join ordernum5 f on a.ordernum = f.ordernum

;
quit;

However, if I don't know how many datasets will be created how can I include this join logic into the macro?

As always, appreciate any assistance. Thanks!

collinelliot · Posted 01-27-2017 03:11 PM

I'd worry about whether the ordernum was always sequential, so I'd do the following for the splitting apart. The remerging is a bit more macro work than I have time for.

data testdata;
infile datalines dsd truncover;
input 
ordernum:$3.
order:1.
code:$3.	
;

datalines;
111, 1, qwe
111, 2, gfd
111, 3, sdf
222, 1, erw
222, 2, fgh
333, 1, saa
333, 2, ath
333, 3, gdw
444, 1, etg
444, 2, hef
444, 3, sfy
444, 4, hrl
555, 1, mmj
555, 2, bhg
555, 3, dgy
555, 4, ggd
555, 5, dss
;;;

%macro orderSplit;

proc sql;
    SELECT DISTINCT orderNum INTO :ord1-:ord999
    FROM testData;
quit;

%let cnt = &sqlObs;
    
%do i = 1 %to &cnt;
    
    data order&i;
        set testData;
        where orderNum = "&&ord&i";
    run;

%end;

%mend orderSplit;
%orderSplit;

Ody · Posted 01-27-2017 03:41 PM

There is a little massaging of the data prior to this processing. The ordernum may not always be sequential. The order field is manufactored. I made up the testdata dataset as a small example. I create the order field using this code:

data testdata;
set testdata;
by ordernum notsorted;
if first.ordernum then Order = 0;
Order + 1;
run;

I think I understand the logic behind how you'd split apart the dataset.

proc sql;
    SELECT DISTINCT orderNum INTO :ord1-:ord999
    FROM testData;
quit;

However my original dataset has hundreds of thousands of records and I dont think that approach would be the best one which is why I was opting for seperating it by the order field I create.

ballardw · Posted 01-27-2017 03:19 PM

May I ask what you are doing that requires splitting the data apart especially since you are then merging back? It may be something that can be handled with BY group processing.

Does your master ordernumr table have one record per ordernum or multiple? A small example set of the ordernumber and some of the "split" data and what the result after the combination should look like may be helpful.

Ody · Posted 01-27-2017 03:57 PM

The master table has a distinct ordernum per record. There are many "testdata" datasets with multiple records per ordernum. My goal is to join them to the master table to maintain that distinct ordernum per record and essentially flatten out the data by creating additional columns for each ordernum.

As an example, creating table Test by joining the master table with the testdata table from above results in a data set with 17 records istead of 5 which is not what I'm aiming to do.

data master;
infile datalines dsd truncover;
input 
ordernum:$3.
;

datalines;
111
222
333
444
555
;

proc sql;
create table test as
select 
a.*,
b.code
from master a
left join testdata b on a.ordernum = b.ordernum;
quit;

ballardw · Posted 01-27-2017 04:28 PM

@Ody wrote:

The master table has a distinct ordernum per record. There are many "testdata" datasets with multiple records per ordernum. My goal is to join them to the master table to maintain that distinct ordernum per record and essentially flatten out the data by creating additional columns for each ordernum.

As an example, creating table Test by joining the master table with the testdata table from above results in a data set with 17 records istead of 5 which is not what I'm aiming to do.

The part I highlight above in bold text is very often a less than ideal approach. The main exception is when it is the very last step of a report for human consumption. The process you are describing creates a data structure that can be extremely hard to work with as the numbers, and hence names, of variables changes each time this process is followed. You might want to google "normalize data" for details.

Ody · Posted 01-30-2017 12:43 PM

The main exception is why I'm asking. This is the last step of a program I'm writing for a report.

Haikuo · Posted 01-27-2017 03:48 PM

May I answer your part2 first, as you may not need part1 to get part2.

This is your provided data file (Thank you for the ready-to-go input data):

data testdata;
	infile datalines dsd truncover;
	input 
		ordernum:$3.
		order:1.
		code:$3.	
	;
	datalines;
111, 1, qwe
111, 2, gfd
111, 3, sdf
222, 1, erw
222, 2, fgh
333, 1, saa
333, 2, ath
333, 3, gdw
444, 1, etg
444, 2, hef
444, 3, sfy
444, 4, hrl
555, 1, mmj
555, 2, bhg
555, 3, dgy
555, 4, ggd
555, 5, dss
;;;

Now this to answer the your part2:

/*transpose to get all needed elements*/
PROC SORT
	DATA=WORK.TESTDATA(KEEP=code order ordernum)
	OUT=WORK.Sorted
;
	BY ordernum;
RUN;

PROC TRANSPOSE DATA=WORK.Sorted
	OUT=WORK.TRNS(drop=_name_)
	PREFIX=code
;
	BY ordernum;
	ID order;
	VAR code;
RUN;

QUIT;
/*merge ONE table*/
proc sql;
	create table New_Table as
		select 
			*
		from master a
			left join WORK.TRNS b on a.ordernum = b.ordernum
	;
quit;

If for some reason splitting is still needed, for part1, here is one dynamic non-macro splitting approach:

/*Hash split*/
PROC SORT
	DATA=WORK.TESTDATA(KEEP=code order ordernum)
	OUT=WORK.Hash_sorted
;
	BY order;
RUN;

data _null_;
	if _n_=1 then
		do;
			/*if 0 then set hash_sorted;*/
			declare hash h(dataset:'hash_sorted', multidata:'y');
			h.definekey('order');
	h.definedata(all:
			'y');
			h.definedone();
		end;

	rc=h.clear();

	do until (last.order);
		set hash_sorted;
		by order;
		rc=h.add();
	end;

	rc=h.output(dataset:cats('ordernum',order));
run;

Ody · Posted 01-27-2017 04:03 PM

Your code is a little over my head so it'll take me some time to absorb it. But I really appreciate the help, thanks.

mkeintz · Posted 01-27-2017 04:49 PM

Yes you can read all the data once via sql just to get the names of the datasets to be created. And then read a second time to output those data sets.

However, it appears that testdata is sorted by ordernum. If master (one record per ordernum, right?) is also sorted (or just indexed) by ordernum, then you can do it all in a single pass, by combining a merge statement with a hash table to dynamically output appropriately named data sets.

The hash table H below holds all the merged data for a given ordernum (see the h.add method). The program, upon encountering the last record for the ordernum, outputs the data to a correspondingly named data set (h.output method), and then clears the hash table to receive data for the next ordernum (h.clear):

/* Make an empty dataset with all the vars needed for the hash table*/
data dummy; 
  if 0 then set testdata master;
  n=_n_;
  stop;
run;

data _null_;
  merge testdata master;
  by ordernum;
  n=_n_;
  if _n_=1 then do;
    declare hash h (dataset:'dummy',ordered:'a');
      h.definekey('n');
      h.definedata(all:'Y');
      h.definedone();
  end;
  rc=h.add();
  if last.ordernum then do;
     h.output(dataset:cats('ordernum',ordernum,'(drop=n)'));
     h.clear();
  end;
run;

Notes:

The variable N (=_N_) is used as the key to the hash table to insure unique keys (thereby avoiding some extraneous coding to handle duplicate keys. You'll notice N is dropper in the output method.
I didn't HAVE to make the DUMMY dataset. I could have dropped reference to DUMMY in the "declare H" statement, and then individually listed the name of each var (each name in quotes) from both datasets as well as var N in the .definedata method, but that's sometime too much typing (and too many typing errors). I chose instead to use the "dataset:dummy" parameter, and the nthe "all:'Y'" parameter in the defindata method.
Note the "data dummy" doesn't actually read data. It just creates a dataset with no observations but all the needed metdata.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Ody · Posted 01-30-2017 12:37 PM

Thanks for the explanation; it was very helpful! It's still something I'm going to have to read up on but this is a really good start. Thanks again!

Ody · Posted 01-31-2017 11:51 AM

After working at my problem a little more I figured out my issue was the syntax in the select part of the data step. For anyone interested here is the updated code:

proc sql noprint;
select max(Order) into : order_num
from testdata;
quit;

%put Max  Count = &order_num;


options mlogic mprint;
%macro split;

data %do i = 1 %to &order_num;
ordernum&i %end;
;

set testdata;
select(Order);
%do i = 1 %to &order_num;
when ("&i") output ordernum&i; 
%end;
otherwise;
end;
run;

%mend split;
%split;

Now, I need to figure out how to join these data sets back to my master table dynamically. If anyone has any suggestions or examples I'd be curious to see. Thanks.

Splitting/Joining datasets dynamically (macro)

Re: Splitting/Joining datasets dynamically (macro)

Re: Splitting/Joining datasets dynamically (macro)

Re: Splitting/Joining datasets dynamically (macro)

Re: Splitting/Joining datasets dynamically (macro)

Re: Splitting/Joining datasets dynamically (macro)

Re: Splitting/Joining datasets dynamically (macro)

Re: Splitting/Joining datasets dynamically (macro)

Re: Splitting/Joining datasets dynamically (macro)

Re: Splitting/Joining datasets dynamically (macro)

Re: Splitting/Joining datasets dynamically (macro)

Re: Splitting/Joining datasets dynamically (macro)

Catch up on SAS Innovate 2026

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away