Solved: Re: Select a certain observation within duplicate cases

Ruslan · Posted 05-22-2018 05:21 PM

Dear All,

I have the following cases:

var1 var2 var3 var4 var5

12040 10-K 33746 26098 29Dec2010

12040 10-K 32298 26098 29Dec2010

12040 10-K 33746 25206 04Jan2010

12040 10-K 32298 25206 04Jan2010

8504 10KSB 15831 11617 23Aug2006

8504 10KSB 14250 11617 23Aug2006

8504 10KSB 12775 11617 23Aug2006

8504 10KSB 12142 11617 23Aug2006

8504 10KSB 15831 10515 05Sep2006

8504 10KSB 14250 10515 05Sep2006

8504 10KSB 12775 10515 05Sep2006

8504 10KSB 12142 10515 05Sep2006

8504 10KSB 15831 10105 07Sep2006

8504 10KSB 14250 10105 07Sep2006

8504 10KSB 12775 10105 07Sep2006

8504 10KSB 12142 10105 07Sep2006

8328 10-K 60811 49348 28Jul2009

8328 10-K 39154 49348 28Jul2009

8328 10-K 60811 30424 15Jul2009

8328 10-K 39154 30424 15Jul2009

I hope these three cases are enough to give you the general picture of my problem. Now, I want to get the following dataset:

var1 var2 var3 var4 var5

12040 10-K 33746 26098 29Dec2010

12040 10-K 32298 25206 04Jan2010

8504 10KSB 15831 11617 23Aug2006

8504 10KSB 14250 10515 05Sep2006

8504 10KSB 12775 10105 07Sep2006

8328 10-K 60811 49348 28Jul2009

8328 10-K 39154 30424 15Jul2009

So, consider duplicates in the first case. I have two duplicates by var4/var5. So, I need to go to the first duplicate and select the row where the biggest value of var4 corresponds to the biggest value of var3 (i.e. 26098 is assigned to 33746), so I take the first row and drop the second row. Then, I go to the second duplicate and select the row where the SECOND biggest value of var4 corresponds to the SECOND biggest value of var3 (i.e. 25206 is assigned to 32298). Then, if needed, I go to the third duplicate case and select the row where the THIRD biggest value of var4 corresponds to the THIRD biggest value of var3. And so on.

The same principle is applied to the second and third cases mentioned above. I really hope my explanation is clear.

Your help would be very highly appreciated. I look forward to getting your reply.

Kind regards,

Ruslan

novinosrin · Posted 05-23-2018 10:42 AM

@Ruslan Good morning, see if the modified version with your modified sample below works. I haven't had a coffee yet and so please test thoroughly. I will get my head around it in a bit after some caffeine.

data have;
input (var1       var2     var3       var4      var5) (:$10.);
cards;
12040 10-K 56376 47611 28Dec2007
12040 10-K 33746 26098 29Dec2010
12040 10-K 32298 26098 29Dec2010
12040 10-K 33247 25722 27Dec2013
12040 10-K 33746 25206 04Jan2010
12040 10-K 32298 25206 04Jan2010
8504 10KSB 20618 16895 16Jul2001
8504 10KSB 15831 11617 23Aug2006
8504 10KSB 14250 11617 23Aug2006
8504 10KSB 12775 11617 23Aug2006
8504 10KSB 12142 11617 23Aug2006
8504 10KSB 15831 10515 05Sep2006
8504 10KSB 14250 10515 05Sep2006
8504 10KSB 12775 10515 05Sep2006
8504 10KSB 12142 10515 05Sep2006
8504 10KSB 15831 10105 07Sep2006
8504 10KSB 14250 10105 07Sep2006
8504 10KSB 12775 10105 07Sep2006
8504 10KSB 12142 10105 07Sep2006
8504 10KSB 8970 7027 30Jun2000
8328 10-K 60811 49348 28Jul2009
8328 10-K 39154 49348 28Jul2009
8328 10-K 48477 37048 27Mar2015
8328 10-K 43947 32737 14Mar2014
8328 10-K 60811 30424 15Jul2009
8328 10-K 39154 30424 15Jul2009
;

data temp;
set have;
by  var1 var4 var5 notsorted ;
if first.var1 then grp=0;
if first.var4 and first.var5 and not last.var4 and not last.var5 then  grp+1;
run;

data want;
do until(last.var4);
	do _n_=1 by 1 until(last.var5);
		set temp;
		by var1 var4 var5 notsorted;
		if _n_=grp then output;
		else if _n_=1 and last.var5 then output;
	end; 
end;
drop grp;
run;

View solution in original post

Ruslan · Posted 05-22-2018 06:05 PM

Dear All,

Please help me with this urgent issue. I very much rely on your assistance.

Kind regards,

Ruslan

art297 · Posted 05-22-2018 06:40 PM

I think that the following does what you want:

proc sort data=have;
  by var1 var2 var4 var5 descending var3;
run;

proc sort data=have out=want nodupkey;
  by var1 var2 var4 var5;
run;

Art, CEO, AnalystFinder.com

Ruslan · Posted 05-22-2018 06:48 PM

No, unfortunately, it does not. If I use your code, I would get this for my first case:

var1 var2 var3 var4 var5

12040 10-K 33746 26098 29Dec2010

12040 10-K 33746 25206 04Jan2010

But I need to get this:

var1 var2 var3 var4 var5

12040 10-K 33746 26098 29Dec2010

12040 10-K 32298 25206 04Jan2010

Please please help me, guys. Look forward to your answers.

Kind regards,

Ruslan

novinosrin · Posted 05-22-2018 07:32 PM

I hope i understood your req:

data have;
input (var1       var2     var3       var4      var5) (:$10.);
cards;
12040   10-K     33746    26098   29Dec2010
12040   10-K     32298    26098   29Dec2010
12040   10-K     33746    25206   04Jan2010
12040   10-K     32298    25206   04Jan2010
8504     10KSB  15831   11617   23Aug2006
8504     10KSB  14250   11617   23Aug2006
8504     10KSB  12775   11617   23Aug2006
8504     10KSB  12142   11617   23Aug2006
8504     10KSB  15831   10515   05Sep2006
8504     10KSB  14250   10515   05Sep2006
8504     10KSB  12775   10515   05Sep2006
8504     10KSB  12142   10515   05Sep2006
8504     10KSB  15831   10105   07Sep2006
8504     10KSB  14250   10105   07Sep2006
8504     10KSB  12775   10105   07Sep2006
8504     10KSB  12142   10105   07Sep2006
8328     10-K     60811    49348   28Jul2009
8328     10-K     39154    49348   28Jul2009
8328     10-K     60811    30424   15Jul2009
8328     10-K     39154    30424   15Jul2009
;

data temp;
set have;
by  var1 var4 var5 notsorted ;
if first.var1 then grp=0;
if first.var4 and first.var5 then  grp+1;
run;


data want;
do until(last.var4);
	do _n_=1 by 1 until(last.var5);
		set temp;
		by var1 var4 var5 notsorted;
		if _n_=grp then output;
	end;
end;
drop grp;
run;

PGStats · Posted 05-23-2018 01:17 AM

This ?

data have;
input case $ var1 var2 $ var3 var4 var5 :date9.;
format var5 date9.;
datalines;
case_1 12040 10-K 33746 26098 29Dec2010
case_1 12040 10-K 32298 26098 29Dec2010
case_1 12040 10-K 33746 25206 04Jan2010
case_1 12040 10-K 32298 25206 04Jan2010
case_2 8504 10KSB 15831 11617 23Aug2006
case_2 8504 10KSB 14250 11617 23Aug2006
case_2 8504 10KSB 12775 11617 23Aug2006
case_2 8504 10KSB 12142 11617 23Aug2006
case_2 8504 10KSB 15831 10515 05Sep2006
case_2 8504 10KSB 14250 10515 05Sep2006
case_2 8504 10KSB 12775 10515 05Sep2006
case_2 8504 10KSB 12142 10515 05Sep2006
case_2 8504 10KSB 15831 10105 07Sep2006
case_2 8504 10KSB 14250 10105 07Sep2006
case_2 8504 10KSB 12775 10105 07Sep2006
case_2 8504 10KSB 12142 10105 07Sep2006
case_3 8328 10-K 60811 49348 28Jul2009
case_3 8328 10-K 39154 49348 28Jul2009
case_3 8328 10-K 60811 30424 15Jul2009
case_3 8328 10-K 39154 30424 15Jul2009
;

data want;
set have; by case;
if var3 ne lag(var3) and var4 ne lag(var4) then do;
    gr + 1;
    row = 0;
    end;
if first.case then do;
    gr = 1;
    row = 0;
    end;
row + 1;
if gr = row then output;
drop gr row;
run;

proc print data=want noobs; by case; id case; run;

PG

Ruslan · Posted 05-23-2018 06:42 AM

Hi All,

Many thanks for your replies.

@PGStats and @novinosrin, your codes work perfectly, however I have both good cases and bad cases combined together. So, how can I extract only bad cases (create var case as @PGStats did) like those I have mentioned in my message? So my full data look like this:

12040 10-K 56376 47611 28Dec2007

12040 10-K 33746 26098 29Dec2010

12040 10-K 32298 26098 29Dec2010

12040 10-K 33247 25722 27Dec2013

12040 10-K 33746 25206 04Jan2010

12040 10-K 32298 25206 04Jan2010

8504 10KSB 20618 16895 16Jul2001

8504 10KSB 15831 11617 23Aug2006

8504 10KSB 14250 11617 23Aug2006

8504 10KSB 12775 11617 23Aug2006

8504 10KSB 12142 11617 23Aug2006

8504 10KSB 15831 10515 05Sep2006

8504 10KSB 14250 10515 05Sep2006

8504 10KSB 12775 10515 05Sep2006

8504 10KSB 12142 10515 05Sep2006

8504 10KSB 15831 10105 07Sep2006

8504 10KSB 14250 10105 07Sep2006

8504 10KSB 12775 10105 07Sep2006

8504 10KSB 12142 10105 07Sep2006

8504 10KSB 8970 7027 30Jun2000

8328 10-K 60811 49348 28Jul2009

8328 10-K 39154 49348 28Jul2009

8328 10-K 48477 37048 27Mar2015

8328 10-K 43947 32737 14Mar2014

8328 10-K 60811 30424 15Jul2009

8328 10-K 39154 30424 15Jul2009

I need to get this:

12040 10-K 56376 47611 28Dec2007

12040 10-K 33746 26098 29Dec2010

12040 10-K 33247 25722 27Dec2013

12040 10-K 32298 25206 04Jan2010

8504 10KSB 20618 16895 16Jul2001

8504 10KSB 15831 11617 23Aug2006

8504 10KSB 14250 10515 05Sep2006

8504 10KSB 12775 10105 07Sep2006

8504 10KSB 8970 7027 30Jun2000

8328 10-K 60811 49348 28Jul2009

8328 10-K 48477 37048 27Mar2015

8328 10-K 43947 32737 14Mar2014

8328 10-K 39154 30424 15Jul2009

I very much look forward to getting your reply. Your help and advice would be highly appreciated.

Kind regards,

Ruslan

novinosrin · Posted 05-23-2018 10:42 AM

@Ruslan Good morning, see if the modified version with your modified sample below works. I haven't had a coffee yet and so please test thoroughly. I will get my head around it in a bit after some caffeine.

data have;
input (var1       var2     var3       var4      var5) (:$10.);
cards;
12040 10-K 56376 47611 28Dec2007
12040 10-K 33746 26098 29Dec2010
12040 10-K 32298 26098 29Dec2010
12040 10-K 33247 25722 27Dec2013
12040 10-K 33746 25206 04Jan2010
12040 10-K 32298 25206 04Jan2010
8504 10KSB 20618 16895 16Jul2001
8504 10KSB 15831 11617 23Aug2006
8504 10KSB 14250 11617 23Aug2006
8504 10KSB 12775 11617 23Aug2006
8504 10KSB 12142 11617 23Aug2006
8504 10KSB 15831 10515 05Sep2006
8504 10KSB 14250 10515 05Sep2006
8504 10KSB 12775 10515 05Sep2006
8504 10KSB 12142 10515 05Sep2006
8504 10KSB 15831 10105 07Sep2006
8504 10KSB 14250 10105 07Sep2006
8504 10KSB 12775 10105 07Sep2006
8504 10KSB 12142 10105 07Sep2006
8504 10KSB 8970 7027 30Jun2000
8328 10-K 60811 49348 28Jul2009
8328 10-K 39154 49348 28Jul2009
8328 10-K 48477 37048 27Mar2015
8328 10-K 43947 32737 14Mar2014
8328 10-K 60811 30424 15Jul2009
8328 10-K 39154 30424 15Jul2009
;

data temp;
set have;
by  var1 var4 var5 notsorted ;
if first.var1 then grp=0;
if first.var4 and first.var5 and not last.var4 and not last.var5 then  grp+1;
run;

data want;
do until(last.var4);
	do _n_=1 by 1 until(last.var5);
		set temp;
		by var1 var4 var5 notsorted;
		if _n_=grp then output;
		else if _n_=1 and last.var5 then output;
	end; 
end;
drop grp;
run;

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away