DATA Step, Macro, Functions and more

duplicate values

Reply
Occasional Contributor
Posts: 10

duplicate values

Hello,

 

I have overlapping data where my ID value is reused and continues for another or the same client. When sorted the duplicate can either be the first or last observation. I want keep the two observations separate but set the duplicate to "." in the observation which is retiring and leave the one which continues on to the most current date.

 

IDJanFebMarAprMayJun
1202533121011
230254351011
330355378.123..
3...78.1231530
4404563124563
5..23.789153023
5202523.789...

 

I want the data to look like this:

IDJanFebMarAprMayJun
1202533121011
230254351011
3303553...
3...78.1231530
4404563124563
5..23.789153023
52025....

 please help!

Super User
Posts: 19,869

Re: duplicate values

It seems weird to keep multiple records. 

 

Have you looked at the lag() function? 

Or transpose, de-duo and then transpose again?

Occasional Contributor
Posts: 10

Re: duplicate values

some of the clients are changing ownership and this dataset spans about 10 years. so the months under the old owner are valid and the months that continue under the new owner are also valid. or in a few cases its the same building but a different client moved in and all the infrastructure is the same

 

 

Super User
Posts: 19,869

Re: duplicate values

10 years. So do you have more columns or more records per row? You currently are showing only at most two rows per ID, is that the most you can have?

Occasional Contributor
Posts: 10

Re: duplicate values

I have a few IDs which have 3 or 4 rows of information but there is no limit to how many rows there could be

Contributor
Posts: 41

Re: duplicate values

Hi @Josie1

The two solutions that have been posted will both handle unlimited rows of information per ID.

 

PROC Star
Posts: 1,760

Re: duplicate values

An example of a read-ahead technique:

 

data HAVE;
  input ID JAN FEB MAR APR MAY JUN ;
  cards;
1 20 25 33 12 10 11 
2 30 25 43 5 10 11 
3 30 35 53 78.123 . . 
3 . . . 78.123 15 30 
4 40 45 63 12 45 63 
5 . . 23.789 15 30 23 
5 20 25 23.789 . . . 
run;
data WANT;    
  set HAVE ;         
  if ^LASTOBS then do;
    set HAVE(firstobs=2 
           rename=(ID =NEXTID
                   JAN=NEXTJAN
                   FEB=NEXTFEB
                   MAR=NEXTMAR
                   APR=NEXTAPR
                   MAY=NEXTMAY
                   JUN=NEXTJUN) )
            end=LASTOBS;              
    if ID=NEXTID then do;             
      if JAN=NEXTJAN then JAN=.;
      if FEB=NEXTFEB then FEB=.;
      if MAR=NEXTMAR then MAR=.;
      if APR=NEXTAPR then APR=.;
      if MAY=NEXTMAY then MAY=.;
      if JUN=NEXTJUN then JUN=.;
    end;
  end;
  drop NEXT:;
run;
Contributor
Posts: 41

Re: duplicate values

[ Edited ]

You said your dataset was 10 years wide with a variable for each month, so you will want to avoid writing separate lines of code for each of the 120 or so variables.

I changed the sample data slightly to reflect that the columns are months&years. I also added a extra row for id5 to test that the code worked if the most recent month is a duplicate.

For each monthyear variable and each id, there is only one non-missing value , except where there is a duplicate when the client changes.

You can sort by all of the month-year variables from the using the double-hyphen (for example OCT14--MAR15), then use a lag to remove the duplicates.

 

My first pass at this failed because of how the lag function works if it is executed conditionally. To fix it I had to create and load the lageachmonth array. This is hard-coded to be the number of month-year variables ; 6 in this example.

 

It works, but I don't like that it has to process through the every variable in the do i=1 to dim(eachmonth) loop. I'd prefer to use a leave statement after a duplicate has been found, but with the current structure this won't work.

 

The lines of code I that I marked as optional are to be  used if you want the current client to be the the last one listed for each id. If you want the most recent client  to be listed first ,or simply don't care about the order, then leave out the optional lines.

 

data  HAVE;
  input ID OCT14 NOV14 DEC14 JAN15 FEB15 MAR15 ;
  cards;
1 20 25 33 12 10 11 
2 30 25 43 5 10 11 
3 30 35 53 78.123 . . 
3 . . . 78.123 15 30 
4 40 45 63 12 45 63 
5 . . 23.789 15 30 38 
5 20 25 23.789 . . . 
5 . . . . . 38
run;  
 



proc sort data=have   ; by id oct14--mar15;run;
data want (drop=i   lageachmonth1-lageachmonth6);
set have;
by id;
n=_N_; *optional line of code;
array eachmonth{*} oct14--mar15;
array lageachmonth{*} lageachmonth1-lageachmonth6; 
do i=1 to dim(eachmonth);
lageachmonth[i]=lag(eachmonth[i]);
if not(first.id) and eachmonth[i]=lageachmonth[i]   then  eachmonth[i]=.; 
end;
run;
*optional sort;
proc sort data=want  ; by id descending n;
run; 

 

Ask a Question
Discussion stats
  • 7 replies
  • 428 views
  • 0 likes
  • 4 in conversation