BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Hhh111
Calcite | Level 5
Hi all,

I need opinions on how to read number of observation based on different customer. For example, my data as per below

Dataset A
Customer| customer no | count
A | 111 | 3
B | 222 | 2

Dataset B (Main)
Customer | customer no | sales
A | 111 | 12000
A | 111 | 2000
A | 111 | 10000
A | 111 | 7000
B | 222 | 1000
B | 222 | 5000
B | 222 | 6500

In dataset A, the count indicates how many observation i need to extract from dataset B (Main dataset). So i dont know how to randomly select the no of observation by each customer as shown above.

Thanks.

1 ACCEPTED SOLUTION

Accepted Solutions
PaigeMiller
Diamond | Level 26

The general method would be to assign a random number to each record, then sort by customer number and that random number, then select the first COUNT records of each Customer number.

 

Something liek this:

 

data both;
     merge a b;
    by customer_no;
    random_no=rand('uniform');
run;
proc sort data=both;
    by customer_no random_no;
run;
data want;
    set both;
    by customer_no;
    if first.customer_no then seq=0;
    seq+1;
    if seq<=count;
run;
--
Paige Miller

View solution in original post

10 REPLIES 10
PeterClemmensen
Tourmaline | Level 20

So in this case, you want to randomly pick 3 obs from group Customer=A and 2 obs from Customer=B?

PaigeMiller
Diamond | Level 26

The general method would be to assign a random number to each record, then sort by customer number and that random number, then select the first COUNT records of each Customer number.

 

Something liek this:

 

data both;
     merge a b;
    by customer_no;
    random_no=rand('uniform');
run;
proc sort data=both;
    by customer_no random_no;
run;
data want;
    set both;
    by customer_no;
    if first.customer_no then seq=0;
    seq+1;
    if seq<=count;
run;
--
Paige Miller
ErikLund_Jensen
Rhodochrosite | Level 12

Hi @Hhh111 

 

One way to do it is by sorting sales data in random order for each customer and then keep only the first n observarions from each customer, where n is the individual sample size for each customer:

 

* Get test data 1 - Sample size for each customer;
data sample_no;
	input Customer$ customer_no count;
	cards;
A 111 3
B 222 2
;
run;

* Get test data 2 - Sales data set;
data main;
	input Customer$ customer_no sales;
		cards;
A 111 12000
A 111 2000
A 111 10000
A 111 7000
B 222 1000
B 222 5000
B 222 6500
;
run;

* Add random number to sales data set to use for sorting;
data m2; set main;
	shuffle_id = ranuni(3);
run;

* Join sample size on all records in main and sort on customer - random number;
proc sql;
	create table m3 as 
	select a.Customer, a.customer_no, a.sales, a.shuffle_id, b.count
	from m2 as a inner join sample_no as b
	on a.Customer = b.Customer and a.customer_no = b.customer_no
	order by a.Customer, a.customer_no, a.shuffle_id;
quit;

* Keep only wanted count of records for each customer;
data want (drop=shuffle_id count rcnt);
	set m3;
	retain rcnt;
	by Customer customer_no;
	if first.customer_no then rcnt = 0;
	rcnt = rcnt + 1;
	if rcnt <= count then output;
run;
ErikLund_Jensen
Rhodochrosite | Level 12
Oh, didn't see the answer from PaigeMiller. Basically the same, but he came up with a more elegant solution with two steps only!
Reeza
Super User
PROC SURVEYSELECT has a sampsize option where you need to provide the sizes as a data set or values. I think you should be using that here instead of doing it manually via a data step.
PeterClemmensen
Tourmaline | Level 20
Data A;
input Customer $ customerno count;
datalines;
A 111 3
B 222 2
;

data B;
input Customer $ customerno sales;
datalines;
A 111 12000
A 111 2000
A 111 10000
A 111 7000
B 222 1000
B 222 5000
B 222 6500
;

data want(keep=Customer customerno sales);
	if _N_=1 then do;
		declare hash h();
		h.definekey('p');
		h.definedone();
	end;

	do _N_=1 by 1 until (last.Customer);
		set B;
		by Customer;
	end;

    set A;

	do until (c=count);
		p=rand('integer', 1, _N_);
		if h.check() ne 0 then do;
			n=_iorc_+p;
			set B point=n;
			output;
			c+1;h.ref();
		end;
	end;
	h.clear();
	_iorc_+count+1;
	c=0;
run;
PeterClemmensen
Tourmaline | Level 20

This is the simplest though

 

proc surveyselect data=B sampsize=A(rename=(count=_NSIZE_)) 
				      out=want(keep=Customer customerno sales) method=srs noprint;
   	strata Customer;
run;
ballardw
Super User

@PeterClemmensen wrote:

This is the simplest though

 

proc surveyselect data=B sampsize=A(rename=(count=_NSIZE_)) 
				      out=want(keep=Customer customerno sales) method=srs noprint;
   	strata Customer;
run;

And if the Customer value might repeat then add the Customerno to the strata statement.

Both data sets need to be sorted by the variables on the strata statement.

Hhh111
Calcite | Level 5

Thanks All for the response...i get the solution already Smiley Happy

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 10 replies
  • 2096 views
  • 6 likes
  • 6 in conversation