BookmarkSubscribeRSS Feed
Josie1
Obsidian | Level 7

Hello,

 

I have overlapping data where my ID value is reused and continues for another or the same client. When sorted the duplicate can either be the first or last observation. I want keep the two observations separate but set the duplicate to "." in the observation which is retiring and leave the one which continues on to the most current date.

 

IDJanFebMarAprMayJun
1202533121011
230254351011
330355378.123..
3...78.1231530
4404563124563
5..23.789153023
5202523.789...

 

I want the data to look like this:

IDJanFebMarAprMayJun
1202533121011
230254351011
3303553...
3...78.1231530
4404563124563
5..23.789153023
52025....

 please help!

7 REPLIES 7
Reeza
Super User

It seems weird to keep multiple records. 

 

Have you looked at the lag() function? 

Or transpose, de-duo and then transpose again?

Josie1
Obsidian | Level 7

some of the clients are changing ownership and this dataset spans about 10 years. so the months under the old owner are valid and the months that continue under the new owner are also valid. or in a few cases its the same building but a different client moved in and all the infrastructure is the same

 

 

Reeza
Super User

10 years. So do you have more columns or more records per row? You currently are showing only at most two rows per ID, is that the most you can have?

Josie1
Obsidian | Level 7

I have a few IDs which have 3 or 4 rows of information but there is no limit to how many rows there could be

JohnHoughton
Quartz | Level 8

Hi @Josie1

The two solutions that have been posted will both handle unlimited rows of information per ID.

 

ChrisNZ
Tourmaline | Level 20

An example of a read-ahead technique:

 

data HAVE;
  input ID JAN FEB MAR APR MAY JUN ;
  cards;
1 20 25 33 12 10 11 
2 30 25 43 5 10 11 
3 30 35 53 78.123 . . 
3 . . . 78.123 15 30 
4 40 45 63 12 45 63 
5 . . 23.789 15 30 23 
5 20 25 23.789 . . . 
run;
data WANT;    
  set HAVE ;         
  if ^LASTOBS then do;
    set HAVE(firstobs=2 
           rename=(ID =NEXTID
                   JAN=NEXTJAN
                   FEB=NEXTFEB
                   MAR=NEXTMAR
                   APR=NEXTAPR
                   MAY=NEXTMAY
                   JUN=NEXTJUN) )
            end=LASTOBS;              
    if ID=NEXTID then do;             
      if JAN=NEXTJAN then JAN=.;
      if FEB=NEXTFEB then FEB=.;
      if MAR=NEXTMAR then MAR=.;
      if APR=NEXTAPR then APR=.;
      if MAY=NEXTMAY then MAY=.;
      if JUN=NEXTJUN then JUN=.;
    end;
  end;
  drop NEXT:;
run;
JohnHoughton
Quartz | Level 8

You said your dataset was 10 years wide with a variable for each month, so you will want to avoid writing separate lines of code for each of the 120 or so variables.

I changed the sample data slightly to reflect that the columns are months&years. I also added a extra row for id5 to test that the code worked if the most recent month is a duplicate.

For each monthyear variable and each id, there is only one non-missing value , except where there is a duplicate when the client changes.

You can sort by all of the month-year variables from the using the double-hyphen (for example OCT14--MAR15), then use a lag to remove the duplicates.

 

My first pass at this failed because of how the lag function works if it is executed conditionally. To fix it I had to create and load the lageachmonth array. This is hard-coded to be the number of month-year variables ; 6 in this example.

 

It works, but I don't like that it has to process through the every variable in the do i=1 to dim(eachmonth) loop. I'd prefer to use a leave statement after a duplicate has been found, but with the current structure this won't work.

 

The lines of code I that I marked as optional are to be  used if you want the current client to be the the last one listed for each id. If you want the most recent client  to be listed first ,or simply don't care about the order, then leave out the optional lines.

 

data  HAVE;
  input ID OCT14 NOV14 DEC14 JAN15 FEB15 MAR15 ;
  cards;
1 20 25 33 12 10 11 
2 30 25 43 5 10 11 
3 30 35 53 78.123 . . 
3 . . . 78.123 15 30 
4 40 45 63 12 45 63 
5 . . 23.789 15 30 38 
5 20 25 23.789 . . . 
5 . . . . . 38
run;  
 



proc sort data=have   ; by id oct14--mar15;run;
data want (drop=i   lageachmonth1-lageachmonth6);
set have;
by id;
n=_N_; *optional line of code;
array eachmonth{*} oct14--mar15;
array lageachmonth{*} lageachmonth1-lageachmonth6; 
do i=1 to dim(eachmonth);
lageachmonth[i]=lag(eachmonth[i]);
if not(first.id) and eachmonth[i]=lageachmonth[i]   then  eachmonth[i]=.; 
end;
run;
*optional sort;
proc sort data=want  ; by id descending n;
run; 

 

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 1379 views
  • 0 likes
  • 4 in conversation