Solved: by-group processing and value replace

UMAnalyst · Posted 10-25-2023 12:36 PM

Hi everyone,

I would like to find duplicate data in street, city, and state by group (ID) and replace a value.
When found (see in table *,**), I would like to YREND value from the "duplicate" (or following line) to override the YREND value in the original line.

So, for ID A1007, _N_ 1 and 2 should really only be one line because the values in street, city, and state are identical. I want to replace 1937 with 1945 and remove _N_ = 2 for A1007.

For ID A1008, I would like to replace 1963 with 1977

I have found a way to identify the duplicate data in street, city, and state, but I'm struggling with how to replace the YREND value.

Thanks for your help.

data sample;
infile datalines dsd;
input @1 id $5. @7 entry 1. @9 start 4. @14 end 4. @20 street $20. @40 city $10. @52 state $2. ;
datalines;
A1007 1 1932 1937  111 HURON LN          WACO      TX
A1007 2 1938 1945  111 HURON LN          WACO      TX
A1007 3 1946 1953  789 HINES PL          ANN ARBOR MI
A1007 4 1954 1959  656 ELM ST            LALONDE   MI
A1007 5 1960 1988  555 WHO DAT AVE       ORLANDO   FL
A1008 1 1930 1944  323 PLYMOUTH RD 		 SEATTLE   WA
A1008 2 1945 1963  HOUSE                 PORTLAND  OR
A1008 3 1964 1977  HOUSE                 PORTLAND  OR
A1008 4 1978 1983  455 JOY RD            DETROIT   MI
A1008 5 1984 .     656 WISCONSIN AVE     LALONDE   MI
A1008 6 1995 9999  555 LAKE ST           LAPORTE   FL
;
run;
dm 'viewtable work._last_' viewtable;

WANT:
ID

A1007 1 1932 1945 111 HURON LN      WACO TX (YREND FROM DUPLICATE ROW)
A1007 3 1946 1953 789 HINES PL      ANN ARBOR MI
A1007 4 1954 1959 656 ELM ST        LALONDE MI
A1007 5 1960 1988 555 WHO DAT AVE   ORLANDO FL

A1008 1 1930 1944 323 PLYMOUTH RD   SEATTLE WA
A1008 2 1945 1977 HOUSE             PORTLAND OR (YREND FROM DUPLICATE ROW)
A1008 4 1978 1983 455 JOY RD        DETROIT MI
A1008 5 1984 .    656 WISCONSIN AVE LALONDE MI
A1008 6 1995 9999 555 LAKE ST       LAPORTE FL

I do not need to reset the var ENTRY ie., it's OK that A1007, 2 no long exists and the count is 1,3,4,5.

Thank you

mkeintz · Posted 10-25-2023 01:08 PM

Read each group twice (by group I mean a sequence of obs with matching ID/STATE/CITY/STREET values).

The first pass is to determine the revised END value (if needed).

The second pass is to output only the first obs (with modified END value) of each group:


data sample;
infile datalines dsd;
input @1 id $5. @7 entry 1. @9 start 4. @14 end 4. @20 street $20. @40 city $10. @52 state $2. ;
datalines;
A1007 1 1932 1937  111 HURON LN          WACO      TX
A1007 2 1938 1945  111 HURON LN          WACO      TX
A1007 3 1946 1953  789 HINES PL          ANN ARBOR MI
A1007 4 1954 1959  656 ELM ST            LALONDE   MI
A1007 5 1960 1988  555 WHO DAT AVE       ORLANDO   FL
A1008 1 1930 1944  323 PLYMOUTH RD 		 SEATTLE   WA
A1008 2 1945 1963  HOUSE                 PORTLAND  OR
A1008 3 1964 1977  HOUSE                 PORTLAND  OR
A1008 4 1978 1983  455 JOY RD            DETROIT   MI
A1008 5 1984 .     656 WISCONSIN AVE     LALONDE   MI
A1008 6 1995 9999  555 LAKE ST           LAPORTE   FL
;
run;

data want;
  if 0 then set sample;    /*Force compiler to keep variables in original order */
  do until (last.street);  /*Read a group of obs to get to desired END */
    set sample;
    by id state city street notsorted;
  end;
  do until (last.street);  /*Reread group (except END) and output only the first obs*/
    set sample (drop=end);
    by id state city street notsorted;
    if first.street=1 and last.street=0 then note='END from Duplicate Row';
    if first.street=1 then output;
  end;
run;

Editted note: This program assumes, as per @UMAnalyst's note, that the data are sorted by ID/ENTRY. So each "spell" of living at a given address will have a sequence of obs, in chronological order, with matching STATE/CITY/STREET. So the data are grouped by state/city/street, but not sorted by those variables - hence the use of the NOTSORTED parameter in the BY statements.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

mkeintz · Posted 10-25-2023 01:08 PM

Read each group twice (by group I mean a sequence of obs with matching ID/STATE/CITY/STREET values).

The first pass is to determine the revised END value (if needed).

The second pass is to output only the first obs (with modified END value) of each group:


data sample;
infile datalines dsd;
input @1 id $5. @7 entry 1. @9 start 4. @14 end 4. @20 street $20. @40 city $10. @52 state $2. ;
datalines;
A1007 1 1932 1937  111 HURON LN          WACO      TX
A1007 2 1938 1945  111 HURON LN          WACO      TX
A1007 3 1946 1953  789 HINES PL          ANN ARBOR MI
A1007 4 1954 1959  656 ELM ST            LALONDE   MI
A1007 5 1960 1988  555 WHO DAT AVE       ORLANDO   FL
A1008 1 1930 1944  323 PLYMOUTH RD 		 SEATTLE   WA
A1008 2 1945 1963  HOUSE                 PORTLAND  OR
A1008 3 1964 1977  HOUSE                 PORTLAND  OR
A1008 4 1978 1983  455 JOY RD            DETROIT   MI
A1008 5 1984 .     656 WISCONSIN AVE     LALONDE   MI
A1008 6 1995 9999  555 LAKE ST           LAPORTE   FL
;
run;

data want;
  if 0 then set sample;    /*Force compiler to keep variables in original order */
  do until (last.street);  /*Read a group of obs to get to desired END */
    set sample;
    by id state city street notsorted;
  end;
  do until (last.street);  /*Reread group (except END) and output only the first obs*/
    set sample (drop=end);
    by id state city street notsorted;
    if first.street=1 and last.street=0 then note='END from Duplicate Row';
    if first.street=1 then output;
  end;
run;

Editted note: This program assumes, as per @UMAnalyst's note, that the data are sorted by ID/ENTRY. So each "spell" of living at a given address will have a sequence of obs, in chronological order, with matching STATE/CITY/STREET. So the data are grouped by state/city/street, but not sorted by those variables - hence the use of the NOTSORTED parameter in the BY statements.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

UMAnalyst · Posted 10-25-2023 01:11 PM

This is the stuff of heroes. Thank you so much

PaigeMiller · Posted 10-25-2023 01:12 PM

May we assume that the data is sorted by id, state, city, street? May we assume that the last value of END is always the maximum of the two END values?

Please don't speak about variable YREND, there is no such variable. (I know what you mean, but you could help us by using the actual variable names).

--
Paige Miller

UMAnalyst · Posted 10-26-2023 10:05 AM

The data are sorted by ID and ENTRY.

The YREND var in reality is called "END"

by-group processing and value replace

Re: by-group processing and value replace

Re: by-group processing and value replace

Re: by-group processing and value replace

Re: by-group processing and value replace

Re: by-group processing and value replace

by-group processing and value replace

Re: by-group processing and value replace

Re: by-group processing and value replace

Re: by-group processing and value replace

Re: by-group processing and value replace

Re: by-group processing and value replace

Registration is open

SAS Training: Just a Click Away