BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
Balli
Obsidian | Level 7

Hi Team,

Looking for some help here :

 

I've got two SQL tables

Table A: It contains a state and Phone numbers. Each state has got around 20 different numbers. 

Table B : This table got accounts, customers and states . It has around 40K accounts. 

 

I want to randomly assign a phone number from table A to each account in table B, based on their state.  

 

I think this can be achieved using first and last variables but unable to come up with a solution. Will be great help if someone can please provide me some guidance here.

 

Here is a rough idea of what the tables contain.

 

Table A:

Balli_0-1751248883054.png

Table B:

Balli_1-1751249556515.png

 

1 ACCEPTED SOLUTION

Accepted Solutions
mkeintz
PROC Star

You have a small dataset of phone numbers by state, and a "large" dataset of accounts.  I understand that you are fine with randomly assigning a given phone number to multiple qualifying accounts.  Here's a program that avoids the need to sort the large dataset just to facilitate a merge:

 

Using @Ksharp's sample data:

data phone_arrays (keep=state _nphones _col:);
  do _nphones=1 by 1 until (last.state);
    set tablea  (where=(not missing(phone)));
    by state;
    array _col {50};
    _col{_nphones}=phone;
  end;
run;

data want (drop=_:);
  set tableb;
  if _n_=1 then do;
    if 0 then set phone_arrays ;
    declare hash h (dataset:'phone_arrays');
      h.definekey('state'); 
      h.definedata(all:'Y');
      h.definedone();
  end;
  array col {*} _col: ;
  
  if h.find()=0 then phnum=col{ceil(_nphones*ranuni(1508915))};
run;

The _COL array is given an arbitrary size -- large enough to account for the largest group of available phone numbers.

This assumes the TABLEA dataset is already sorted by state.

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

11 REPLIES 11
Tom
Super User Tom
Super User

You could just add a random number to the phone number dataset and sort by STATE and the new variable with the random number to get the phone numbers in random order.

 

For a more detailed answer we probably need more information. 

Do you want each account and each customer to have their own phone number?  Or do you need to use both account and customer together, like a combination key,  to assign a phone number?

Do you want each account/customer pair to get a different number?

Are there more accounts/customers for any state than the number of phone numbers you have for that state?  If so what do you want to do then? Is it ok if two account/customer get the same number?

Balli
Obsidian | Level 7

Hi Tom, thanks for your response. the problem is not about randomising the phone numbers, it is more about joining the two datasets. AccountIDs are unique and as I mentioned, there are around 40,00 accounts which means there will be many accounts per state and multiple accounts would have the same phone number. 

Patrick
Opal | Level 21

The approach Tom proposes is not that hard to implement. Please provide usable sample data if you're after code (=instead of screenshots two data steps that create the sample data). 

 

andreas_lds
Jade | Level 19

Please show the expected result making it easier to understand what you really want. 

Ksharp
Super User

That would be a lot of help if you could post your data as a Data Step or plain text, not just a picture.

Nobody would like to type it for you and you would miss the solution from someone.

data tablea;
input state $ phone;
cards;
NSW 111
NSW 112
NSW 123
NSW 114
QLD 121
QLD 211
QLD 141
QLD 411
;
data tableb;
input accountid state $;
cards;
1 NSW
2 QLD
3 VIC
;

proc surveyselect data=tablea out=phone seed=123 sampsize=1;
strata state;
run;
proc sort data=phone;by state;run;
proc sort data=tableb;by state;run;
data want;
 merge tableb(in=ina) phone;
 by state;
 if ina;
run;

 

Ksharp
Super User

If it is a replaced sampling ,and each accountid from the same state could have different phone.

Try this one :

 

data tablea;
input state $ phone;
cards;
NSW 111
NSW 112
NSW 123
NSW 114
QLD 121
QLD 211
QLD 141
QLD 411
;
data tableb;
input accountid state $;
cards;
1 NSW
8 NSW
2 QLD
4 QLD
5 QLD
3 VIC
;

proc sql;
create table SampleSize as
select state,count(*) as SampleSize
 from tableb
  group by state;
quit;

proc surveyselect data=tablea out=phone seed=123 sampsize=SampleSize selectall;
strata state;
run;
proc sort data=phone;by state;run;
proc sort data=tableb;by state;run;
data want;
 merge tableb(in=ina) phone;
 by state;
 if ina;
run;
FreelanceReinh
Jade | Level 19

@Ksharp wrote:
proc surveyselect data=tablea out=phone seed=123 sampsize=SampleSize selectall;
strata state;
run;

Great idea, @Ksharp, to use PROC SURVEYSELECT, but I'm pretty sure you actually wanted to use the options

method=urs outhits outrandom

instead of selectall.

FreelanceReinh
Jade | Level 19

Hi @Balli,

 


@Balli wrote:

(...)

I think this can be achieved using first and last variables ...


Here is an approach using the FIRST.STATE and LAST.STATE variables: The observation numbers of the first and last observation of each state in dataset HAVE_A (assuming they are grouped) are stored in a temporary lookup table (hash object). While reading dataset HAVE_B they are retrieved and a (uniformly distributed) random number between them is selected using the RAND function. This random number, in turn, is used in the POINT= option of the SET statement retrieving a phone number from dataset HAVE_A.

/* Create sample data for demonstration */

%let states='NSW','QLD','VIC','WA','ACT','NT','SA','TAS';

data have_A;
do State=&states;
  do _n_=1 to 23-mod(rank(md5(State)),7);
    Phone=ranuni(2718)*(2**31-1);
    output;
  end;
end;
run;

data have_B;
call streaminit(27182818);
do _n_=1 to 40000;
  AccountID=ranuni(3142)*(2**31-1);
  length State $3;
  State=choosec(rand('integer',8),&states);
  output;
end;
run;

proc sql;
insert into have_B
values(123456789,'XYZ');
quit;


/* Assign a randomly selected phone number from have_A to each account in have_B, based on their state */

data want(drop=n1 n2);
call streaminit(31415927);
if _n_=1 then do;
  if 0 then set have_B have_A;
  dcl hash h(); /* lookup table for the first and last obs. number per state in have_A */
  h.definekey('State');
  h.definedata('n1','n2');
  h.definedone();
  do _n_=1 by 1 until(last);
    set have_A end=last;
    by state notsorted; /* Dataset have_A must be grouped by State. */
    if first.state then n1=_n_;
    if last.state then do;
      n2=_n_;
      h.add();
    end;
  end;
end;
set have_b;
if h.find()=0 then do;
  _n_=rand('integer',n1,n2);
  set have_A(keep=phone) point=_n_;
end;
else call missing(phone);
run;

Alternatively, you could store the phone numbers in the hash object and retrieve them from there (e.g., using a sequence number as a second key item).

 

But I think a solution using PROC SURVEYSELECT (@Ksharp's idea) is easier to understand.

 

Edit: Inserted the ELSE statement at the end of the program to avoid retaining a phone number from the previous observation if an unexpected value of State occurs.

mkeintz
PROC Star

You have a small dataset of phone numbers by state, and a "large" dataset of accounts.  I understand that you are fine with randomly assigning a given phone number to multiple qualifying accounts.  Here's a program that avoids the need to sort the large dataset just to facilitate a merge:

 

Using @Ksharp's sample data:

data phone_arrays (keep=state _nphones _col:);
  do _nphones=1 by 1 until (last.state);
    set tablea  (where=(not missing(phone)));
    by state;
    array _col {50};
    _col{_nphones}=phone;
  end;
run;

data want (drop=_:);
  set tableb;
  if _n_=1 then do;
    if 0 then set phone_arrays ;
    declare hash h (dataset:'phone_arrays');
      h.definekey('state'); 
      h.definedata(all:'Y');
      h.definedone();
  end;
  array col {*} _col: ;
  
  if h.find()=0 then phnum=col{ceil(_nphones*ranuni(1508915))};
run;

The _COL array is given an arbitrary size -- large enough to account for the largest group of available phone numbers.

This assumes the TABLEA dataset is already sorted by state.

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Balli
Obsidian | Level 7
Thanks. appreciate your response. This is very quick and elegant solution. Exactly what I was after 🙂
Balli
Obsidian | Level 7

Thanks everyone for the response and sorry for not including the sample data and the expected results. I will try to include more details next time.

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 11 replies
  • 1958 views
  • 7 likes
  • 7 in conversation