Solved: Re: How to read number of observation by variables

Hhh111 · Posted 09-18-2019 11:53 AM

Hi all,

I need opinions on how to read number of observation based on different customer. For example, my data as per below

Dataset A
Customer| customer no | count
A | 111 | 3
B | 222 | 2

Dataset B (Main)
Customer | customer no | sales
A | 111 | 12000
A | 111 | 2000
A | 111 | 10000
A | 111 | 7000
B | 222 | 1000
B | 222 | 5000
B | 222 | 6500

In dataset A, the count indicates how many observation i need to extract from dataset B (Main dataset). So i dont know how to randomly select the no of observation by each customer as shown above.

Thanks.

PaigeMiller · Posted 09-18-2019 12:37 PM

The general method would be to assign a random number to each record, then sort by customer number and that random number, then select the first COUNT records of each Customer number.

Something liek this:

data both;
     merge a b;
    by customer_no;
    random_no=rand('uniform');
run;
proc sort data=both;
    by customer_no random_no;
run;
data want;
    set both;
    by customer_no;
    if first.customer_no then seq=0;
    seq+1;
    if seq<=count;
run;

--
Paige Miller

View solution in original post

PeterClemmensen · Posted 09-18-2019 11:56 AM

So in this case, you want to randomly pick 3 obs from group Customer=A and 2 obs from Customer=B?

The DATA to DATA Step Macro
Blog: SASnrd

Hhh111 · Posted 09-18-2019 08:44 PM

Yes!!

PaigeMiller · Posted 09-18-2019 12:37 PM

The general method would be to assign a random number to each record, then sort by customer number and that random number, then select the first COUNT records of each Customer number.

Something liek this:

data both;
     merge a b;
    by customer_no;
    random_no=rand('uniform');
run;
proc sort data=both;
    by customer_no random_no;
run;
data want;
    set both;
    by customer_no;
    if first.customer_no then seq=0;
    seq+1;
    if seq<=count;
run;

--
Paige Miller

ErikLund_Jensen · Posted 09-18-2019 12:57 PM

Hi @Hhh111

One way to do it is by sorting sales data in random order for each customer and then keep only the first n observarions from each customer, where n is the individual sample size for each customer:

* Get test data 1 - Sample size for each customer;
data sample_no;
	input Customer$ customer_no count;
	cards;
A 111 3
B 222 2
;
run;

* Get test data 2 - Sales data set;
data main;
	input Customer$ customer_no sales;
		cards;
A 111 12000
A 111 2000
A 111 10000
A 111 7000
B 222 1000
B 222 5000
B 222 6500
;
run;

* Add random number to sales data set to use for sorting;
data m2; set main;
	shuffle_id = ranuni(3);
run;

* Join sample size on all records in main and sort on customer - random number;
proc sql;
	create table m3 as 
	select a.Customer, a.customer_no, a.sales, a.shuffle_id, b.count
	from m2 as a inner join sample_no as b
	on a.Customer = b.Customer and a.customer_no = b.customer_no
	order by a.Customer, a.customer_no, a.shuffle_id;
quit;

* Keep only wanted count of records for each customer;
data want (drop=shuffle_id count rcnt);
	set m3;
	retain rcnt;
	by Customer customer_no;
	if first.customer_no then rcnt = 0;
	rcnt = rcnt + 1;
	if rcnt <= count then output;
run;

ErikLund_Jensen · Posted 09-18-2019 01:03 PM

Oh, didn't see the answer from PaigeMiller. Basically the same, but he came up with a more elegant solution with two steps only!

Reeza · Posted 09-18-2019 01:22 PM

PROC SURVEYSELECT has a sampsize option where you need to provide the sizes as a data set or values. I think you should be using that here instead of doing it manually via a data step.

PeterClemmensen · Posted 09-18-2019 02:14 PM

Data A;
input Customer $ customerno count;
datalines;
A 111 3
B 222 2
;

data B;
input Customer $ customerno sales;
datalines;
A 111 12000
A 111 2000
A 111 10000
A 111 7000
B 222 1000
B 222 5000
B 222 6500
;

data want(keep=Customer customerno sales);
	if _N_=1 then do;
		declare hash h();
		h.definekey('p');
		h.definedone();
	end;

	do _N_=1 by 1 until (last.Customer);
		set B;
		by Customer;
	end;

    set A;

	do until (c=count);
		p=rand('integer', 1, _N_);
		if h.check() ne 0 then do;
			n=_iorc_+p;
			set B point=n;
			output;
			c+1;h.ref();
		end;
	end;
	h.clear();
	_iorc_+count+1;
	c=0;
run;

The DATA to DATA Step Macro
Blog: SASnrd

PeterClemmensen · Posted 09-18-2019 03:32 PM

This is the simplest though

proc surveyselect data=B sampsize=A(rename=(count=_NSIZE_)) 
				      out=want(keep=Customer customerno sales) method=srs noprint;
   	strata Customer;
run;

The DATA to DATA Step Macro
Blog: SASnrd

ballardw · Posted 09-18-2019 05:16 PM

@PeterClemmensen wrote:

This is the simplest though

proc surveyselect data=B sampsize=A(rename=(count=_NSIZE_)) 
				      out=want(keep=Customer customerno sales) method=srs noprint;
   	strata Customer;
run;

And if the Customer value might repeat then add the Customerno to the strata statement.

Both data sets need to be sorted by the variables on the strata statement.

Hhh111 · Posted 09-18-2019 10:47 PM

Thanks All for the response...i get the solution already

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away