BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
UMAnalyst
Obsidian | Level 7

Hi everyone,

I would like to find duplicate data in street, city, and state by group (ID) and replace a value.
When found (see in table *,**), I would like to YREND value from the "duplicate" (or following line) to override the YREND value in the original line.

So, for ID A1007, _N_ 1 and 2 should really only be one line because the values in street, city, and state are identical. I want to replace 1937 with 1945 and remove _N_ = 2 for A1007.

 

For ID A1008, I would like to replace 1963 with 1977

I have found a way to identify the duplicate data in street, city, and state, but I'm struggling with how to replace the YREND value.

Thanks for your help.

 

 

 

 

 

data sample;
infile datalines dsd;
input @1 id $5. @7 entry 1. @9 start 4. @14 end 4. @20 street $20. @40 city $10. @52 state $2. ;
datalines;
A1007 1 1932 1937  111 HURON LN          WACO      TX
A1007 2 1938 1945  111 HURON LN          WACO      TX
A1007 3 1946 1953  789 HINES PL          ANN ARBOR MI
A1007 4 1954 1959  656 ELM ST            LALONDE   MI
A1007 5 1960 1988  555 WHO DAT AVE       ORLANDO   FL
A1008 1 1930 1944  323 PLYMOUTH RD 		 SEATTLE   WA
A1008 2 1945 1963  HOUSE                 PORTLAND  OR
A1008 3 1964 1977  HOUSE                 PORTLAND  OR
A1008 4 1978 1983  455 JOY RD            DETROIT   MI
A1008 5 1984 .     656 WISCONSIN AVE     LALONDE   MI
A1008 6 1995 9999  555 LAKE ST           LAPORTE   FL
;
run;
dm 'viewtable work._last_' viewtable;

 

 

 

 

 

 

WANT:
ID

 

 

 

 

 

A1007 1 1932 1945 111 HURON LN      WACO TX (YREND FROM DUPLICATE ROW)
A1007 3 1946 1953 789 HINES PL      ANN ARBOR MI
A1007 4 1954 1959 656 ELM ST        LALONDE MI
A1007 5 1960 1988 555 WHO DAT AVE   ORLANDO FL

A1008 1 1930 1944 323 PLYMOUTH RD   SEATTLE WA
A1008 2 1945 1977 HOUSE             PORTLAND OR (YREND FROM DUPLICATE ROW)
A1008 4 1978 1983 455 JOY RD        DETROIT MI
A1008 5 1984 .    656 WISCONSIN AVE LALONDE MI
A1008 6 1995 9999 555 LAKE ST       LAPORTE FL

 

 

 

I do not need to reset the var ENTRY ie., it's OK that A1007, 2 no long exists and the count is 1,3,4,5.

Thank you

 

1 ACCEPTED SOLUTION

Accepted Solutions
mkeintz
PROC Star

Read each group twice  (by group I mean a sequence of obs with matching ID/STATE/CITY/STREET values).

 

The first pass is to determine the revised END value (if needed).

The second pass is to output only the first obs (with modified END value) of each group:

 

 


data sample;
infile datalines dsd;
input @1 id $5. @7 entry 1. @9 start 4. @14 end 4. @20 street $20. @40 city $10. @52 state $2. ;
datalines;
A1007 1 1932 1937  111 HURON LN          WACO      TX
A1007 2 1938 1945  111 HURON LN          WACO      TX
A1007 3 1946 1953  789 HINES PL          ANN ARBOR MI
A1007 4 1954 1959  656 ELM ST            LALONDE   MI
A1007 5 1960 1988  555 WHO DAT AVE       ORLANDO   FL
A1008 1 1930 1944  323 PLYMOUTH RD 		 SEATTLE   WA
A1008 2 1945 1963  HOUSE                 PORTLAND  OR
A1008 3 1964 1977  HOUSE                 PORTLAND  OR
A1008 4 1978 1983  455 JOY RD            DETROIT   MI
A1008 5 1984 .     656 WISCONSIN AVE     LALONDE   MI
A1008 6 1995 9999  555 LAKE ST           LAPORTE   FL
;
run;

data want;
  if 0 then set sample;    /*Force compiler to keep variables in original order */
  do until (last.street);  /*Read a group of obs to get to desired END */
    set sample;
    by id state city street notsorted;
  end;
  do until (last.street);  /*Reread group (except END) and output only the first obs*/
    set sample (drop=end);
    by id state city street notsorted;
    if first.street=1 and last.street=0 then note='END from Duplicate Row';
    if first.street=1 then output;
  end;
run;

 

Editted note:  This program assumes, as per @UMAnalyst's note, that the data are sorted by ID/ENTRY.   So each "spell" of living at a given address will have a sequence of obs, in chronological order, with matching STATE/CITY/STREET.  So the data are grouped by state/city/street, but not sorted by those variables - hence the use of the NOTSORTED parameter in the BY statements.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

4 REPLIES 4
mkeintz
PROC Star

Read each group twice  (by group I mean a sequence of obs with matching ID/STATE/CITY/STREET values).

 

The first pass is to determine the revised END value (if needed).

The second pass is to output only the first obs (with modified END value) of each group:

 

 


data sample;
infile datalines dsd;
input @1 id $5. @7 entry 1. @9 start 4. @14 end 4. @20 street $20. @40 city $10. @52 state $2. ;
datalines;
A1007 1 1932 1937  111 HURON LN          WACO      TX
A1007 2 1938 1945  111 HURON LN          WACO      TX
A1007 3 1946 1953  789 HINES PL          ANN ARBOR MI
A1007 4 1954 1959  656 ELM ST            LALONDE   MI
A1007 5 1960 1988  555 WHO DAT AVE       ORLANDO   FL
A1008 1 1930 1944  323 PLYMOUTH RD 		 SEATTLE   WA
A1008 2 1945 1963  HOUSE                 PORTLAND  OR
A1008 3 1964 1977  HOUSE                 PORTLAND  OR
A1008 4 1978 1983  455 JOY RD            DETROIT   MI
A1008 5 1984 .     656 WISCONSIN AVE     LALONDE   MI
A1008 6 1995 9999  555 LAKE ST           LAPORTE   FL
;
run;

data want;
  if 0 then set sample;    /*Force compiler to keep variables in original order */
  do until (last.street);  /*Read a group of obs to get to desired END */
    set sample;
    by id state city street notsorted;
  end;
  do until (last.street);  /*Reread group (except END) and output only the first obs*/
    set sample (drop=end);
    by id state city street notsorted;
    if first.street=1 and last.street=0 then note='END from Duplicate Row';
    if first.street=1 then output;
  end;
run;

 

Editted note:  This program assumes, as per @UMAnalyst's note, that the data are sorted by ID/ENTRY.   So each "spell" of living at a given address will have a sequence of obs, in chronological order, with matching STATE/CITY/STREET.  So the data are grouped by state/city/street, but not sorted by those variables - hence the use of the NOTSORTED parameter in the BY statements.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
UMAnalyst
Obsidian | Level 7
This is the stuff of heroes. Thank you so much
PaigeMiller
Diamond | Level 26

May we assume that the data is sorted by id, state, city, street? May we assume that the last value of END is always the maximum of the two END values?

 

Please don't speak about variable YREND, there is no such variable. (I know what you mean, but you could help us by using the actual variable names).

--
Paige Miller
UMAnalyst
Obsidian | Level 7

The data are sorted by ID and ENTRY.

The YREND var in reality is called "END"

sas-innovate-white.png

Missed SAS Innovate in Orlando?

Catch the best of SAS Innovate 2025 — anytime, anywhere. Stream powerful keynotes, real-world demos, and game-changing insights from the world’s leading data and AI minds.

 

Register now

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 4 replies
  • 1216 views
  • 1 like
  • 3 in conversation