Solved: Removing repeated values not using first.id

Mscarboncopy · Posted 07-26-2021 11:56 AM

Hi.

I need to remove repeating data that are being generated to fill in blanks in a data set. I have several numeric vars and characters that correlate to each X coded (using only a few in the example). Vars D and E below are the ones filling in blanks. The numeric vars are not in any order, including var X. So some ids might have X going from 1 to 10 others only 1 and 2, others 10 and 11 etc X is 1 to to N. Dates repeat and it is okay to repeat, as I need the dates to be linked to each of these variables that are linked to one another. But D and E should be "cleaned" and be only in the first record of each ID. This does not work as I need to keep the repeated Ids.
if not first.id then call missing (D, E);
Is there a way to remove the duplicates in this case? Thank you!
Example
ID         X          A           B            C                D              E
1          3        date3       1    ‘character varB1’     101          ‘character not linked to varB’
1          3       date 3        2   ‘character varB2’       101         ‘character not linked to varB’
1         10      date10       5    ‘character varB5’    300          ‘character not linked to varB’
2         1        date1       13    ‘character var13’      500        ‘character not linked to varB’
2         1        date1       10     ‘character varB10’   500 ‘character not linked to varB’
2         2        date2       10     ‘character varB10’   700         ‘character not linked to varB’
2         2        date2        1 ‘character varB1’     700           ‘character not linked to varB’
3       10 date10       5       ‘character varB5’     400    ‘character not linked to varB’
3         11      date11      6      ‘character varB6’      600             ‘character not linked to varB’
3         11       date11    3       ‘character varB3’      600              ‘character not linked to varB’

Should look like:
ID      X       A              B C    D    E
1        3      date3   1    ‘charactervarB1’              101    ‘character not linked to varB’
1        3       date 3     2          ‘character varB2’

1       10       date10    5         ‘character varB5’             300    ‘character not linked to varB’
2       1         date1      13 ‘character varB13’ 500                   ‘character not linked to varB’
2       1        date1 10         ‘character varB10’
2 2        date2        10 ‘character varB10’          700    ‘character not linked to varB’
2       2       date2         1     ‘character varB1’
3     10       date10       5    ‘character varB5’ 400 ‘character not linked to varB’
3    11         date11      6            'character varB6’        600              ‘character not linked to varB’
3    11           date11    3 'character varB3’

Reeza · Posted 07-26-2021 12:11 PM

Is this for presentation purposes? Do you need it this way in a report or something? For most programming purposes you need the first one. If it's just for display, then use PROC REPORT and assign D/E as GROUP variables.

View solution in original post

ballardw · Posted 07-26-2021 12:03 PM

It appears that you want to use A not ID for identifying duplicates.

BY ID A;

if not first.A then call missing(D, E);

maybe. You don't show actual values for date an may not work.

Without an actual data set can't test.

Mscarboncopy · Posted 07-26-2021 12:20 PM

Thank you. I can't use A, as I need all the dates, even the repeating ones.

Reeza · Posted 07-26-2021 12:24 PM

Using it to set the other values to missing doesn't affect the A value itself...if the code suggested doesn't work can you explain why with some data examples that illustrate those cases?

Mscarboncopy · Posted 07-26-2021 12:35 PM

When I use A all goes to missing except the very first row:

ID   X                      A             B       C           D            E
1      1 2/25/2004      2      ‘yellow’   100 ‘long text1’
1     1                  2/25/2004    14     ‘black’    100       ‘long text1’
1     2                   2/26/2004 2      ‘green’    300       ‘longtext2’
1     2 2/26/2004     13      ‘white’    300    ‘longtext2’
1     2                  2/26/2004       14      ‘blue’      300      ‘longtext2’

ballardw · Posted 07-26-2021 12:38 PM

@Mscarboncopy wrote:

When I use A all goes to missing except the very first row:

ID   X                      A             B       C           D            E
1      1 2/25/2004      2      ‘yellow’   100 ‘long text1’
1     1                  2/25/2004    14     ‘black’    100       ‘long text1’
1     2                   2/26/2004 2      ‘green’    300       ‘longtext2’
1     2 2/26/2004     13      ‘white’    300    ‘longtext2’
1     2                  2/26/2004       14      ‘blue’      300      ‘longtext2’

IF that is output you need to show the corresponding INPUT and the actual code you ran, not one line of code but the entire datastep. Your example here isn't very good at showing "all goes to missing".

Mscarboncopy · Posted 07-26-2021 12:47 PM

I don't have an input in my sas code, as I am using a very long file that is given to me in SPSS format which I transform it to a sas file. What I showed previously is my input for ID 1. I am trying to have the following output:

ID              X          A                 B           C             D           E
1               1           2/25/2004    2      ‘yellow’ 100    ‘long text1’
1               1           2/25/2004 14       ‘black’
1               2    2/26/2004     2       ‘green’       300       ‘longtext2’
1               2    2/26/2004    13        ‘white’
1               2          2/26/2004    14         ‘blue’

Thank you.

ballardw · Posted 07-26-2021 03:02 PM

@Mscarboncopy wrote:

I don't have an input in my sas code, as I am using a very long file that is given to me in SPSS format which I transform it to a sas file. What I showed previously is my input for ID 1. I am trying to have the following output:

ID              X          A                 B           C             D           E
1               1           2/25/2004    2      ‘yellow’ 100    ‘long text1’
1               1           2/25/2004 14       ‘black’
1               2    2/26/2004     2       ‘green’       300       ‘longtext2’
1               2    2/26/2004    13        ‘white’
1               2          2/26/2004    14         ‘blue’

Thank you.

And the code attempted? The entire data step please.

Mscarboncopy · Posted 07-26-2021 03:42 PM

I used your suggestion but made a slight mistake. When Reeza mentioned the output would have been correct, I went back and found my mistake. I answered above with my apologies. Your suggestion worked perfectly. It is the solution. Thank you.

Reeza · Posted 07-26-2021 12:39 PM

Then you aren't using BallardW code, I suspect you're still using If First.A

if not first.A then call missing(D, E);

Mscarboncopy · Posted 07-26-2021 01:31 PM

Thank you Reeza. My apologies to both of you, as I made a mistake when I tried BallardW code. it worked. That is what I needed.

Mscarboncopy · Posted 07-26-2021 01:32 PM

Thank you so much. I tried your suggestion but made a slight mistake at first.

It works perfectly.

Reeza · Posted 07-26-2021 12:11 PM

Is this for presentation purposes? Do you need it this way in a report or something? For most programming purposes you need the first one. If it's just for display, then use PROC REPORT and assign D/E as GROUP variables.

Mscarboncopy · Posted 07-26-2021 12:27 PM

Thank you. Yes, that would be another good way to go about it. I was under the impression that the data set could be cleaned, but it is okay if it is a limitation. I can definitely create reports that would take care of that as you suggested. Good point. Much appreciated.

Reeza · Posted 07-26-2021 12:35 PM

It's not a limitation, it's trivial to set this in a data set (see BallardW code), but then it becomes problematic if you want to identify all D=600 so it's not a super useful or efficient structure. The only practical use for such a data structure is typically reporting.

Removing repeated values not using first.id

Re: Removing repeated values not using first.id

Re: Removing repeated values not using first.id

Re: Removing repeated values not using first.id

Re: Removing repeated values not using first.id

Re: Removing repeated values not using first.id

Re: Removing repeated values not using first.id

Re: Removing repeated values not using first.id

Re: Removing repeated values not using first.id

Re: Removing repeated values not using first.id

Re: Removing repeated values not using first.id

Re: Removing repeated values not using first.id

Re: Removing repeated values not using first.id

Re: Removing repeated values not using first.id

Re: Removing repeated values not using first.id

Re: Removing repeated values not using first.id

Classroom Training Available!