Hi everyone,
I would like to find duplicate data in street, city, and state by group (ID) and replace a value.
When found (see in table *,**), I would like to YREND value from the "duplicate" (or following line) to override the YREND value in the original line.
So, for ID A1007, _N_ 1 and 2 should really only be one line because the values in street, city, and state are identical. I want to replace 1937 with 1945 and remove _N_ = 2 for A1007.
For ID A1008, I would like to replace 1963 with 1977
I have found a way to identify the duplicate data in street, city, and state, but I'm struggling with how to replace the YREND value.
Thanks for your help.
data sample;
infile datalines dsd;
input @1 id $5. @7 entry 1. @9 start 4. @14 end 4. @20 street $20. @40 city $10. @52 state $2. ;
datalines;
A1007 1 1932 1937 111 HURON LN WACO TX
A1007 2 1938 1945 111 HURON LN WACO TX
A1007 3 1946 1953 789 HINES PL ANN ARBOR MI
A1007 4 1954 1959 656 ELM ST LALONDE MI
A1007 5 1960 1988 555 WHO DAT AVE ORLANDO FL
A1008 1 1930 1944 323 PLYMOUTH RD SEATTLE WA
A1008 2 1945 1963 HOUSE PORTLAND OR
A1008 3 1964 1977 HOUSE PORTLAND OR
A1008 4 1978 1983 455 JOY RD DETROIT MI
A1008 5 1984 . 656 WISCONSIN AVE LALONDE MI
A1008 6 1995 9999 555 LAKE ST LAPORTE FL
;
run;
dm 'viewtable work._last_' viewtable;
WANT:
ID
A1007 1 1932 1945 111 HURON LN WACO TX (YREND FROM DUPLICATE ROW)
A1007 3 1946 1953 789 HINES PL ANN ARBOR MI
A1007 4 1954 1959 656 ELM ST LALONDE MI
A1007 5 1960 1988 555 WHO DAT AVE ORLANDO FL
A1008 1 1930 1944 323 PLYMOUTH RD SEATTLE WA
A1008 2 1945 1977 HOUSE PORTLAND OR (YREND FROM DUPLICATE ROW)
A1008 4 1978 1983 455 JOY RD DETROIT MI
A1008 5 1984 . 656 WISCONSIN AVE LALONDE MI
A1008 6 1995 9999 555 LAKE ST LAPORTE FL
I do not need to reset the var ENTRY ie., it's OK that A1007, 2 no long exists and the count is 1,3,4,5.
Thank you
Read each group twice (by group I mean a sequence of obs with matching ID/STATE/CITY/STREET values).
The first pass is to determine the revised END value (if needed).
The second pass is to output only the first obs (with modified END value) of each group:
data sample;
infile datalines dsd;
input @1 id $5. @7 entry 1. @9 start 4. @14 end 4. @20 street $20. @40 city $10. @52 state $2. ;
datalines;
A1007 1 1932 1937 111 HURON LN WACO TX
A1007 2 1938 1945 111 HURON LN WACO TX
A1007 3 1946 1953 789 HINES PL ANN ARBOR MI
A1007 4 1954 1959 656 ELM ST LALONDE MI
A1007 5 1960 1988 555 WHO DAT AVE ORLANDO FL
A1008 1 1930 1944 323 PLYMOUTH RD SEATTLE WA
A1008 2 1945 1963 HOUSE PORTLAND OR
A1008 3 1964 1977 HOUSE PORTLAND OR
A1008 4 1978 1983 455 JOY RD DETROIT MI
A1008 5 1984 . 656 WISCONSIN AVE LALONDE MI
A1008 6 1995 9999 555 LAKE ST LAPORTE FL
;
run;
data want;
if 0 then set sample; /*Force compiler to keep variables in original order */
do until (last.street); /*Read a group of obs to get to desired END */
set sample;
by id state city street notsorted;
end;
do until (last.street); /*Reread group (except END) and output only the first obs*/
set sample (drop=end);
by id state city street notsorted;
if first.street=1 and last.street=0 then note='END from Duplicate Row';
if first.street=1 then output;
end;
run;
Editted note: This program assumes, as per @UMAnalyst's note, that the data are sorted by ID/ENTRY. So each "spell" of living at a given address will have a sequence of obs, in chronological order, with matching STATE/CITY/STREET. So the data are grouped by state/city/street, but not sorted by those variables - hence the use of the NOTSORTED parameter in the BY statements.
Read each group twice (by group I mean a sequence of obs with matching ID/STATE/CITY/STREET values).
The first pass is to determine the revised END value (if needed).
The second pass is to output only the first obs (with modified END value) of each group:
data sample;
infile datalines dsd;
input @1 id $5. @7 entry 1. @9 start 4. @14 end 4. @20 street $20. @40 city $10. @52 state $2. ;
datalines;
A1007 1 1932 1937 111 HURON LN WACO TX
A1007 2 1938 1945 111 HURON LN WACO TX
A1007 3 1946 1953 789 HINES PL ANN ARBOR MI
A1007 4 1954 1959 656 ELM ST LALONDE MI
A1007 5 1960 1988 555 WHO DAT AVE ORLANDO FL
A1008 1 1930 1944 323 PLYMOUTH RD SEATTLE WA
A1008 2 1945 1963 HOUSE PORTLAND OR
A1008 3 1964 1977 HOUSE PORTLAND OR
A1008 4 1978 1983 455 JOY RD DETROIT MI
A1008 5 1984 . 656 WISCONSIN AVE LALONDE MI
A1008 6 1995 9999 555 LAKE ST LAPORTE FL
;
run;
data want;
if 0 then set sample; /*Force compiler to keep variables in original order */
do until (last.street); /*Read a group of obs to get to desired END */
set sample;
by id state city street notsorted;
end;
do until (last.street); /*Reread group (except END) and output only the first obs*/
set sample (drop=end);
by id state city street notsorted;
if first.street=1 and last.street=0 then note='END from Duplicate Row';
if first.street=1 then output;
end;
run;
Editted note: This program assumes, as per @UMAnalyst's note, that the data are sorted by ID/ENTRY. So each "spell" of living at a given address will have a sequence of obs, in chronological order, with matching STATE/CITY/STREET. So the data are grouped by state/city/street, but not sorted by those variables - hence the use of the NOTSORTED parameter in the BY statements.
May we assume that the data is sorted by id, state, city, street? May we assume that the last value of END is always the maximum of the two END values?
Please don't speak about variable YREND, there is no such variable. (I know what you mean, but you could help us by using the actual variable names).
The data are sorted by ID and ENTRY.
The YREND var in reality is called "END"
Catch the best of SAS Innovate 2025 — anytime, anywhere. Stream powerful keynotes, real-world demos, and game-changing insights from the world’s leading data and AI minds.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.