BookmarkSubscribeRSS Feed
KevinC_
Fluorite | Level 6
Hello Everyone,

I have a table that contains hundreds of columns with duplicate unique keys (customer ID). I would like to get rid of the duplicates by selecting the row that contains the max values in a number of columns. For each customer, I would like to keep the row that has the max value in date1, date2, date3, date4, date5,.... etc.

The tricky part is there are hundreds of dates I need to check. Is there an efficient way to do this?


Here is a sample record:
cust_id date1 date2 date3 date4 date5 date6 date7 date8 ... date300
11111 2/2 3/3 4/4 5/5 6/6 7/7 8/8 9/9 .... 12/12
11111 2/1 3/1 4/1 5/1 6/1 7/1 8/1 9/1 .... 12/01

I would like to keep the first row since it has the max value in every column.


Any input will be greatly appreciated !
Thank you  Smiley Happy
8 REPLIES 8
DBailey
Lapis Lazuli | Level 10

If the fact that a row has a maximum value for one column implies that it has the maximum value for all columns..... then

data tmp;

set have;

maxval=max(of date1-date300);

run;

proc sort in=tmp out=tmp2;

by cust_id descending maxval;

run;

proc sort in=tmp2 out=want nodupkey;

by cust_id;

run;

KevinC_
Fluorite | Level 6

Thank you DBailey.

Unfortunately that is not the case.  You brought up a good point.  Now I think about it, this is more like the real data:

cust_id date1 date2 date3 date4 date5 date6 date7 date8 ... date300

11111 2/2 3/3 4/1 5/5 6/6 7/1 8/8 9/9 .... 12/12

11111 2/1 3/1 4/9 5/1 6/1 7/9 8/1 9/1 .... 12/01

Note  4/9 and 7/9.  What i really need to do is to select the max value from each column.  So the ideal output would be:

11111 2/2 3/3 4/9 5/5 6/6 7/9 8/8 9/9 .... 12/12

Is there a way to do this?

PaigeMiller
Diamond | Level 26

Proc Means or Proc Summary with a BY statement should get you the max value for each cust_id

--
Paige Miller
KevinC_
Fluorite | Level 6

Thank you PaigeMiller.

I need the output to be a dataset with all of the columns, not just some statistics. 
Can proc means or proc summary select the max value from each column and output the entire dataset in its original layout?

PaigeMiller
Diamond | Level 26

No, you'd have to combine the output data set from PROC MEANS or PROC SUMMARY with the original data. For example:

data combined;

     merge originaldata procmeansoutput;

     by cust_id;

run;

--
Paige Miller
VD
Calcite | Level 5 VD
Calcite | Level 5

See this example:

data t1;

input id d1 d2 d3 d4;

cards;

111 2 4 6 7

222 4 5 6 7

111 3 2 5 8

222 5 6 3 2

;

run;

proc sql;

      create table t2 as select

      id,

      max(d1) as d1,

      max(d2) as d2,

      max(d3) as d3,

      max(d4) as d4

      from t1

      group by id

      order by id

;

quit;

This will give the output:

111

3

4

6

8

222

5

6

6

7

It will involve writing all your variable names in the proc sql step. There are ways to get those in SAS. However, you can also do that in Excel or Word.

damanaulakh88
Obsidian | Level 7

Hi,

This logicworks for your problem, do that for other ids too:-

Here i have worked on 8 observations and you can change this number to 300 or whatever you require:-

=================================================================================================

data new;

informat cust_id 5. date1 mmddyy10. date2 mmddyy10. date3 mmddyy10. date4 mmddyy10.  date5 mmddyy10.  date6 mmddyy10.

date7 mmddyy10.  date8 mmddyy10.;

input cust_id date1 date2 date3 date4  date5  date6  date7  date8 ;

format date1-date8 DATE9.;

datalines;

11111 02/02/2012 03/03/2012 04/04/2012 5/5/2012 6/6/2011 7/7/2012 8/8/2010 9/9/2012

11111 02/01/2009 03/01/2000 04/01/2013 5/1/2012 6/1/2012 7/1/2013 8/1/2012 9/1/2010

;

run;

proc print ;

run;

%macro dset();

proc sql;

drop table final;

quit;

%do i=1 %to 8;

proc sql;

create table final&i. as select distinct cust_id, max(date&i.) as dt&i. format date9. from new group by cust_id;

quit;

proc sql;

select dt&i. into: date&i. from final&i.;

quit;

%end;

data final;

set final1-final%eval(&i.-1);

run;

%do i=1 %to 8;

data final;

set final;

cst_id=cust_id;

date&i.="&&date&i.";

run;

%end;

proc sort data=final nodupkey;

by cst_id;

run;

proc print data=final;

run;

%mend;

%dset;

===============================================================================================================

Haikuo
Onyx | Level 15

Here is another approach, using Array():

if your data is already sorted or clustered by cust_id,

data have;

input cust_id (date1 date2 date3 date4 date5 date6 date7 date8) (:mmddyy10.);

format date1-date8 mmddyy10.;

datalines;

11111 02/02/2012 03/03/2012 04/04/2012 5/5/2012 6/6/2011 7/7/2012 8/8/2010 9/9/2012

11111 02/01/2009 03/01/2000 04/01/2013 5/1/2012 6/1/2012 7/1/2013 8/1/2012 9/1/2010

;

data want;

do until (last.cust_id);

  set have;

  by cust_id;

  array h date1-date8;

array temp(8) _temporary_;

if first.cust_id then call missing(of temp(*));

  do i=1 to dim(h);

temp(i)=max(temp(i),h(i));

end;

end;

do i=1 to dim(h);

  h(i)=temp(i);

end;

drop i;

run;

Haikuo

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 28898 views
  • 0 likes
  • 6 in conversation