BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
buddha_d
Pyrite | Level 9

Dear community member,

               I have a huge dataset with millions of records. I have to split that dataset into 2 datasets. First dataset will have bank on-site transactions ( 200000) and another will have all other on-line transactions. What I need to do is get 200000 randomly selected records for on-line transactions. 

             I can explain this with an example data below. 

 

/* this is the sample dataset */
data have_data;
input id $ account1 $ account2 $ ;
cards;
AAA A1234 B1456
AAA A1234 B14156
AAA A1234 A2345
AAA A1234 A23451
AAA A1234 A23452
BBB B2546 C1254
BBB B2546 C12456
BBB B2546 C14576
BBB B2546 C1259
BBB B2546 C12591
CCC D4568 F1254
CCC D4568 G1458
CCC D4568 D4568
CCC D4568 D45681
CCC D4568 D45682
CCC D4568 D45683
DDD G1234 I5678
DDD G1234 I89654
DDD G1234 J4586
DDD G1234 I56781
DDD G1234 I89655
DDD G1234 J45861
;
RUN;

/* Splitting two mutually exclusive datasets*/
proc sql;
	create table have_split as 
	select *
	from work.have_data 
	where account2 in ('B14156','C1254','C14576','F1254','D4568','I5678','I89654','J4586')
	;
	create table have_need as 
	select *
	from work.have_data 
	where account2 not in ('B14156','C1254','C14576','F1254','D4568','I5678','I89654','J4586')
	;
quit;

/* Use the have_need dataset to get the same number of records as in have_split dataset 
   with completely randomized way */
proc sort data=have_need; 
	by ID ACCOUNT1; 
run;

/* Get the random records from have_need dataset, but can't get unequal records as in the strata mentioned */

PROC SURVEYSELECT DATA=HAVE_need OUT=WANT SEED=7889 SAMPSIZE=2 ;
STRATA ID ACCOUNT1 ;
RUN;

      I created a have_data dataset, then I split into 2 datasets. One is have_split and the other is have_need. I want have_need dataset to have the same number of observations as in have_split dataset (for example, AAA id has only one record, but want dataset has 2 records). I need to find a way to match the number of records in have_split (ID & ACCOUNT1) matched with want dataset with randomized dataset. I used proc surveyselect, but it gives only equal number of records per ID & ACCOUNT1. 

        Can you please help me to accomplish this task. Remember, the real dataset is huge and I tried to mimic that data in this small dataset. 

Thanks in advance 

 

1 ACCEPTED SOLUTION

Accepted Solutions
mkeintz
PROC Star

This program assumes that an online record that matches a given onsite record (i.e. has the same ID and ACCOUNT1) can appear anywhere in the dataset.

 

data have;
input id $ account1 $ account2 $ ;
cards;
AAA A1234 B1456
AAA A1234 B14156
AAA A1234 A2345
AAA A3425 A23451
AAA A3355 A23452
BBB B2546 C1254
BBB B2546 C12456
BBB B2546 C14576
BBB B3344 C1259
BBB B2546 C12591
CCC D4568 F1254
CCC D4568 G1458
CCC D4568 D4568
CCC D4568 D45681
CCC D4455 D45682
CCC D4455 D45683
DDD G1234 I5678
DDD G1234 I89654
DDD G1234 J4586
DDD G1234 I56781
DDD G1234 I89655
DDD G1234 J45861
DDD G3466 I4576
RUN;

data onsite online;
  set have (in=firstpass_all_records)
      have (in=secondpass_online_only
            where=(not(account2 in ('B14156','C1254','C14576','F1254','D4568','I5678','I89654','J4586')))
           );

  if _n_=1 then do;
    declare hash h ();
      h.definekey('id','account1');
      h.definedata('id','account1','needed_online','avail_online');
      h.definedone();
  end;

  if firstpass_all_records then do;
    if h.find() ^=0 then do; 
      needed_online=0;
      avail_online=0; 
      h.add(); 
    end;

    if account2 in ('B14156','C1254','C14576','F1254','D4568','I5678','I89654','J4586') then do;
      output onsite;
      needed_online=needed_online+1;
    end;
    else avail_online=avail_online+1;
    h.replace();  /*Update this key combination in the hash object */
  end;

  if secondpass_online_only;
  h.find();   /*Get appropriate avail_online and needed_online */
  if avail_online>0 and needed_online>0 then do;
    if rand('uniform')<= needed_online/avail_online then do;
      output online;
      needed_online=needed_online-1;
    end;
  end;
  avail_online=avail_online-1;
  h.replace();
  drop avail_online needed_online;
run;

 

It works by passing through the data twice:

  1. First pass.  Find and output all onsite records, and count the number of onsite and number on online records for each ID/ACCOUNT1 combination.
    1. Store those counts (as variables avail_online and needed_online) in a hash object (think lookup table keyed on ID/ACCOUNT1).
  2. Second pass (online  records only).  For each incoming online record fetch the corresponding avail_online and needed_online value.  If a uniform random value is less than or equal to needed_online/avail_online then output an online record and decrement needed_online.   Regardless of the uniform random value decrement avail_online.

 

Now if you want matches only within ACCOUNT1  (not ID/ACCOUNT1) then drop the reference to 'id' in the hash definekey and definedata methods.  

 

This reads the data twice, but does no sorting.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

10 REPLIES 10
PGStats
Opal | Level 21

Proc surveyselect allows you to define your SAMPSIZE by strata, in a dataset. That's what you need to do.

PG
ballardw
Super User

You code showing specific account number is likely a lot more work than needed.

 

Something like

data have_data;
input id $ account1 $ account2 $ ;
cards;
AAA A1234 B1456
AAA A1234 B14156
AAA A1234 A2345
AAA A1234 A23451
AAA A1234 A23452
BBB B2546 C1254
BBB B2546 C12456
BBB B2546 C14576
BBB B2546 C1259
BBB B2546 C12591
CCC D4568 F1254
CCC D4568 G1458
CCC D4568 D4568
CCC D4568 D45681
CCC D4568 D45682
CCC D4568 D45683
DDD G1234 I5678
DDD G1234 I89654
DDD G1234 J4586
DDD G1234 I56781
DDD G1234 I89655
DDD G1234 J45861
;
RUN;

data need;
   set have_data;
   accounttype = ( account2 in ('B14156','C1254','C14576','F1254','D4568','I5678','I89654','J4586'));
run;

proc sort data=need;
   by accounttype;
run;

proc surveyselect data=need out=want
   sampsize=(4 6);
   strata accounttype;
run;

You cleverly did not describe which is "on line" or "on site" so I picked different sizes for the sample from each type in your example data to show how to select different numbers.

 

For as many records as you say you have I really hope that you have a data set that indicates which accounts are which elsewhere.

 

Or perhaps you need to describe your problem a bit differently if an account might have both on line and on site activity, which seems quite possible.

 

buddha_d
Pyrite | Level 9

Ballardw, 

            on-site is the have_split dataset and on-line dataset is the have_need dataset. I tried the code you provided, but I wasn't able to get the different records as we have seen in have_split dataset. Thanks. 

mkeintz
PROC Star

It seems to me that you can avoid the proc sort (the expensive part of your program) entirely, while solving the problem of getting a sample of online transactions equal in size to the total population of onsite transactions (which is presumably about 200K out of millions of records):

 

data onsite (drop=n_onsite n_online);
  set have end=end_of_have;
  if account2 in (....) then do;
    output;
    n_onsite+1;
  end;
  else n_online+1;
  if end_of_have;
  call symput('N_onsite',cats(n_onsite));
  call symput('N_online',cats(n_online));
run;

data online (drop=remaining_:);
  set have;
  where not account2 in (....) ;
  retain remaining_sample &n_onsite
         remaining_available &n_online;
  if rand('uniform') <= remaining_sample/remaining_available then do;
    output;
    remaining_sample=remaining_sample-1;
    if remaining_sample=0 then stop;
  end;
  remaining_available=remaining_available-1;
run;

This program avoids sorting the online transactions (have_need in your program), which means it avoids writing out a big data set just for sampling purposes.

 

The first data step writes out all the onsite transactions, and counts the number of online and onsite transactions in the initial data set.  At the end of this step it puts those counts in macrovars N_ONSITE and N_ONLINE.

 

The second data step re-reads all the online transactions, and randomly selects (with equal probability) the sample of online transactions having N_ONSITE sample size.  It does this by counting down the number of available online transactions, and counting down the number of needed randomly selected transactions.

 

Edited additional note:  The second data step uses a WHERE statement to filter in only online transactions.  Using a WHERE to filter incoming instead of an IF allows the use of an unconditional decrement of REMAINING_AVAILABLE, since the WHERE statement (unlike an IF filter) never permits an online transaction to be seen by the data step.

 

The benefit is that the original data set is read twice, but no large intermediate dataset is written to disk - possibly very time consuming.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
buddha_d
Pyrite | Level 9
Azurite, Thank you for the code. I would agree with you that sorting would be really expensive. I would try the code and get back to you if I have additional questions. Thank you so much for the code. 
buddha_d
Pyrite | Level 9

mkeintz ,

          I ran the same code, but I have some changes in my sample data to replicate my real data. For example, ID (AAA) has 3 different account1s (A1234, A3425, A3355). Since I selected Account2 as B14156, I need to get the randomization pick from Account1 (A1234). I have 2 records to pick from the have dataset (AAA A1234 B1456 or AAA A1234 A2345). What I am basically saying is there is a secondary filter that needs to apply which doing randomization (that is Account1). If account2 belongs to what ever account1 number block, then the same sample number need to be picked from that Account1 block. 

          If you need more clarity please let me know. I tried below code, please take a look. 

thanks, 



data have;
input id $ account1 $ account2 $ ;
cards;
AAA A1234 B1456
AAA A1234 B14156
AAA A1234 A2345
AAA A3425 A23451
AAA A3355 A23452
BBB B2546 C1254
BBB B2546 C12456
BBB B2546 C14576
BBB B3344 C1259
BBB B2546 C12591
CCC D4568 F1254
CCC D4568 G1458
CCC D4568 D4568
CCC D4568 D45681
CCC D4455 D45682
CCC D4455 D45683
DDD G1234 I5678
DDD G1234 I89654
DDD G1234 J4586
DDD G1234 I56781
DDD G1234 I89655
DDD G1234 J45861
DDD G3466 I4576
;
RUN;

data onsite (drop=n_onsite n_online);
  set have end=end_of_have;
  if account2 in ('B14156','C1254','C14576','F1254','D4568','I5678','I89654','J4586') then do;
    output;
    n_onsite+1;
  end;
  else n_online+1;
  if end_of_have;
  call symput('N_onsite',cats(n_onsite));
  call symput('N_online',cats(n_online));
run;

data online (drop=remaining_:);
  set have;
  where  account2 not in ('B14156','C1254','C14576','F1254','D4568','I5678','I89654','J4586') ;
  retain remaining_sample &n_onsite
         remaining_available &n_online;
  if rand('uniform') <= remaining_sample/remaining_available then do;
    output;
    remaining_sample=remaining_sample-1;
    if remaining_sample=0 then stop;
  end;
  remaining_available=remaining_available-1;
run;
mkeintz
PROC Star

So now I better understand.

 

For any account2 onsite record selected, you want a randomly selected matching online record having the same account1.

 

Question 1: Can a given account1 be present in more than 1 ID?

Question 2: Is your data sorted by ID?  (I already see that it is not sorted by account1 within ID, but that is not a big problem).

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
buddha_d
Pyrite | Level 9

Answer1: Yes if that account1 is present in on-site dataset. But, if it isn't in that dataset then no new account1 should be in on-site  dataset.

Answer2: It is ok to be sorted, but we want to bypass that step as it takes more time. If it's absolutely necessary then we can do it. 

Please let me know if you need more information.

Thanks a bunch mkeintz. I appreciate your help. 

mkeintz
PROC Star

This program assumes that an online record that matches a given onsite record (i.e. has the same ID and ACCOUNT1) can appear anywhere in the dataset.

 

data have;
input id $ account1 $ account2 $ ;
cards;
AAA A1234 B1456
AAA A1234 B14156
AAA A1234 A2345
AAA A3425 A23451
AAA A3355 A23452
BBB B2546 C1254
BBB B2546 C12456
BBB B2546 C14576
BBB B3344 C1259
BBB B2546 C12591
CCC D4568 F1254
CCC D4568 G1458
CCC D4568 D4568
CCC D4568 D45681
CCC D4455 D45682
CCC D4455 D45683
DDD G1234 I5678
DDD G1234 I89654
DDD G1234 J4586
DDD G1234 I56781
DDD G1234 I89655
DDD G1234 J45861
DDD G3466 I4576
RUN;

data onsite online;
  set have (in=firstpass_all_records)
      have (in=secondpass_online_only
            where=(not(account2 in ('B14156','C1254','C14576','F1254','D4568','I5678','I89654','J4586')))
           );

  if _n_=1 then do;
    declare hash h ();
      h.definekey('id','account1');
      h.definedata('id','account1','needed_online','avail_online');
      h.definedone();
  end;

  if firstpass_all_records then do;
    if h.find() ^=0 then do; 
      needed_online=0;
      avail_online=0; 
      h.add(); 
    end;

    if account2 in ('B14156','C1254','C14576','F1254','D4568','I5678','I89654','J4586') then do;
      output onsite;
      needed_online=needed_online+1;
    end;
    else avail_online=avail_online+1;
    h.replace();  /*Update this key combination in the hash object */
  end;

  if secondpass_online_only;
  h.find();   /*Get appropriate avail_online and needed_online */
  if avail_online>0 and needed_online>0 then do;
    if rand('uniform')<= needed_online/avail_online then do;
      output online;
      needed_online=needed_online-1;
    end;
  end;
  avail_online=avail_online-1;
  h.replace();
  drop avail_online needed_online;
run;

 

It works by passing through the data twice:

  1. First pass.  Find and output all onsite records, and count the number of onsite and number on online records for each ID/ACCOUNT1 combination.
    1. Store those counts (as variables avail_online and needed_online) in a hash object (think lookup table keyed on ID/ACCOUNT1).
  2. Second pass (online  records only).  For each incoming online record fetch the corresponding avail_online and needed_online value.  If a uniform random value is less than or equal to needed_online/avail_online then output an online record and decrement needed_online.   Regardless of the uniform random value decrement avail_online.

 

Now if you want matches only within ACCOUNT1  (not ID/ACCOUNT1) then drop the reference to 'id' in the hash definekey and definedata methods.  

 

This reads the data twice, but does no sorting.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
buddha_d
Pyrite | Level 9

mkeintz, This is perfect program for what I am looking for. Thanks a lot. 👍

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 10 replies
  • 1825 views
  • 0 likes
  • 4 in conversation