Solved: Re: Split data set equally, remove unneeded data

deleted_user · Posted 05-08-2010 06:42 AM

Hello,
I hope you can help me:
I have one company A with a certain sales figure X in dataset 1.
I have n potential peer companies in dataset 2.
I have to match my company to a peer company out of dataset 2 based on sales number X.
Thereby I have to take the following steps:
1) Sort dataset 2 by Sales number
=> This is straight forward, no help needed for that step
2) Split dataset 2 into three equal portfolios based on Sales (this means that I have n observations, so each portfolio will contain n/3 observations which are sorted by Sales. So the first portfolio will contain the companies with the lowest sales numbers, etc.)
=> I tried to use a macro I found online with a "split function", however, it doesn't seem to work. Do you have something better?
3) Cross-check if company A from dataset 1 could fit into one of the three portfolios created. If the Sales figure of company A from dataset 1 is larger or smaller than the sales numbers of all potential peers in portfolio n, then delete portfolio n. => I should be left with only one portfolio
4) Out of this one portfolio that is left, find the peer company that has a sales figure that is closest to the sales figure for company A
=> both step 3 + 4 I don't know how to program, so some help would be great!
[I know that it looks like step 2 and 3 are redundant. However, in reality I have to match the company based on several figures, not just sales. That's why I need to create these three portfolios. It's just for simplicity that I leave this out for now].

Thanks you!

sbb · Posted 05-08-2010 10:40 AM

At the simplest level, suggest you create a Portfolio variable for identification and divide your "dataset 2" observations equally based on the value of Sales. A DATA step or PROC SQL can perform this task - a macro approach would only benefit if you had a large number of Portfolio splits to generate - so start simple and use the technique involving: IF THEN ; SAS coding.

Then create a SAS file with a subset of key variables needed to compare "dataset 1" to "dataset 2" -- one additional variable is OBSNUM = _N_; to track the source observation.

Then perform your cross-check between "dataset 1" and "dataset 2", using the key-variables to determine where you a relationship in your data.

Lastly, using the results of your cross-check, retrieve the observations from your "dataset 1" and identify your closest_peer within "dataset 2".

Suggest you start programming, one step at a time -- then come back to the forum for suggestions and guidance with any particular problem you might have. You may find PROC SQL to be more comfortable for coding, but do consider that you have more control when using individual DATA step and PROC (SORT or otherwise) invocations, in many cases.

Make use of the SAS support website http://support.sas.com/ for helpful SAS-hosted DOC and also supplemental technical /

conference subject-matter reference materials.

Editor's note: in the time since this original post, @Reeza has contributed a useful article with code you can use.

Scott Barry
SBBWorks, Inc.

View solution in original post

sbb · Posted 05-08-2010 10:40 AM

At the simplest level, suggest you create a Portfolio variable for identification and divide your "dataset 2" observations equally based on the value of Sales. A DATA step or PROC SQL can perform this task - a macro approach would only benefit if you had a large number of Portfolio splits to generate - so start simple and use the technique involving: IF THEN ; SAS coding.

Then create a SAS file with a subset of key variables needed to compare "dataset 1" to "dataset 2" -- one additional variable is OBSNUM = _N_; to track the source observation.

Then perform your cross-check between "dataset 1" and "dataset 2", using the key-variables to determine where you a relationship in your data.

Lastly, using the results of your cross-check, retrieve the observations from your "dataset 1" and identify your closest_peer within "dataset 2".

Suggest you start programming, one step at a time -- then come back to the forum for suggestions and guidance with any particular problem you might have. You may find PROC SQL to be more comfortable for coding, but do consider that you have more control when using individual DATA step and PROC (SORT or otherwise) invocations, in many cases.

Make use of the SAS support website http://support.sas.com/ for helpful SAS-hosted DOC and also supplemental technical /

conference subject-matter reference materials.

Editor's note: in the time since this original post, @Reeza has contributed a useful article with code you can use.

Scott Barry
SBBWorks, Inc.

deleted_user · Posted 05-18-2010 01:43 PM

Hello Scott,
thanks for your help. Unfortunately, I am not much further. I'm not very clear about how I can use the If...Then... statement to obtain three equally large portfolios.
One thought is that after sorting by Sales, I allocate numbers to my observations:
ex:
Obs Sales
1 100k
2 30k
3 80k
... ....
Then I specify that IF Obs =< n/3 THEN portfolio 1, IF Obs > n/3 and Obs =< 2n/3 THEN portolio 2, else portfolio 3. Is this what you had in mind? But how do I account for rounding issues then?
Also, can you specify in more detail how the cross-check between dataset 1 and dataset 2 is supposed to work? The SAS support website is helpful, but I need to know what function I am looking for.
Thanks,
Tina

deleted_user · Posted 05-18-2010 01:52 PM

Hello Scott,
Thanks for your help. Unfortunately, I am not much further. I'm not very clear about how I can use the If...Then... statement to obtain three equally large portfolios. One thought is that after sorting by Sales, I allocate numbers to my observations:
ex: Obs Sales
1 100k
2 30k
... ....
Then I specify that if Obs smaller/equal n/3 then portfolio 1, if Obs larger n/3 and Obs smaller/equal 2n/3 then portfolio 2, else portfolio 3. Is this what you had in mind? But how do I account for rounding issues then?
Also, can you specify in more detail how the cross-check between dataset 1 and dataset 2 is supposed to work? The SAS support website is helpful, but I need to know what function I am looking for.
Thanks,
Tina

deleted_user · Posted 05-18-2010 04:43 PM

Here is an example of spliting a dataset into 3 different datasets.
There might be a more elegant solution, but I put this together in 5 minutes as an example.

data sampledata;
format sales dollar10.2;
do sales=100 to 154;
output;
end;
run;
data sampledata;
set sampledata;
if first.sales then n=1;
else n+1;
run;
proc sql;
create table obs as
select floor(max(n)/3) as Obs
from sampledata;
quit;
data _null_;
set obs;
call symput('obs',obs);
run;
data sampledata;
set sampledata;
Group = 2;
if n le &obs. then Group = 1;
if n gt (&obs.*2) then Group = 3;
run;
data Group1;
set sampledata (where=(Group=1));
run;
data Group2;
set sampledata (where=(Group=2));
run;
data Group3;
set sampledata (where=(Group=3));
run;

ballardw · Posted 06-02-2010 06:39 PM

If the dataset is relatively large and multiple passes might take too long the last 4 datasteps in dblan's example can be reduced to:

data group1 (where=(Group=1))
Group2 (where=(Group=2))
Group3 (where=(Group=3))
;
set sampledata;
Group = 2;
if n le &obs. then Group = 1;
if n gt (&obs.*2) then Group = 3;
run;

Split data set equally, remove unneeded data

Re: Split data set equally, remove unneeded data

Re: Split data set equally, remove unneeded data

Re: Split data set equally, remove unneeded data

Re: Split data set equally, remove unneeded data

Re: Split data set equally, remove unneeded data

Re: Split data set equally, remove unneeded data

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away