BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Mscarboncopy
Pyrite | Level 9

Hi.

I need to remove repeating data that are being generated to fill in blanks in a data set. I have several numeric vars and characters that correlate to each X coded (using only a few in the example). Vars D and E below are the ones filling in blanks. The numeric vars are not in any order, including var X. So some ids might have X going from  1 to 10 others only 1 and 2, others 10 and 11 etc X is 1 to to N. Dates repeat and it is okay to repeat, as I need the dates to be linked to each of these variables that are linked to one another. But D and E should be "cleaned" and be only in the first record of each ID. This does not work as I need to keep the repeated Ids.
if not first.id then call missing (D, E);
Is there a way to remove the duplicates in this case? Thank you!
Example
ID         X          A           B                C                    D                                E
1          3        date3       1    ‘character varB1’       101          ‘character not linked to varB’
1          3       date 3        2   ‘character varB2’       101         ‘character not linked to varB’
1         10      date10       5    ‘character varB5’      300          ‘character not linked to varB’
2         1        date1       13    ‘character var13’      500          ‘character not linked to varB’
2         1        date1       10     ‘character varB10’   500         ‘character not linked to varB’
2         2        date2       10     ‘character varB10’   700         ‘character not linked to varB’
2         2        date2        1       ‘character varB1’     700           ‘character not linked to varB’
3       10       date10       5       ‘character varB5’     400            ‘character not linked to varB’
3         11      date11      6      ‘character varB6’      600             ‘character not linked to varB’
3         11       date11    3       ‘character varB3’      600              ‘character not linked to varB’


Should look like:
ID      X       A              B                     C                        D                            E
1        3      date3       1          ‘charactervarB1’              101                    ‘character not linked to varB’
1        3       date 3     2          ‘character varB2’ 

1       10       date10    5         ‘character varB5’             300                    ‘character not linked to varB’
2       1         date1      13       ‘character varB13’           500                   ‘character not linked to varB’
2       1        date1       10         ‘character varB10’
2       2        date2        10         ‘character varB10’          700                ‘character not linked to varB’
2       2       date2         1           ‘character varB1’
3     10       date10       5          ‘character varB5’             400               ‘character not linked to varB’
3    11         date11      6            'character varB6’            600              ‘character not linked to varB’
3    11           date11      3           'character varB3’

 

1 ACCEPTED SOLUTION

Accepted Solutions
Reeza
Super User
Is this for presentation purposes? Do you need it this way in a report or something? For most programming purposes you need the first one. If it's just for display, then use PROC REPORT and assign D/E as GROUP variables.

View solution in original post

14 REPLIES 14
ballardw
Super User

It appears that you want to use A not ID for identifying duplicates.

 

BY ID A;

if not first.A then call missing(D, E);

 

maybe.  You don't show actual values for date an may not work.

 

Without an actual data set can't test.

Mscarboncopy
Pyrite | Level 9

Thank you. I can't use A,  as I need all the dates, even the repeating ones.

Reeza
Super User
Using it to set the other values to missing doesn't affect the A value itself...if the code suggested doesn't work can you explain why with some data examples that illustrate those cases?
Mscarboncopy
Pyrite | Level 9

When I use A all goes to missing except the very first row:

ID   X                      A             B       C             D            E
1      1               2/25/2004      2      ‘yellow’   100       ‘long text1’
1     1                  2/25/2004    14     ‘black’    100       ‘long text1’
1     2                   2/26/2004     2      ‘green’    300       ‘longtext2’
1     2                   2/26/2004     13      ‘white’    300      ‘longtext2’
1     2                  2/26/2004       14      ‘blue’      300      ‘longtext2’

ballardw
Super User

@Mscarboncopy wrote:

When I use A all goes to missing except the very first row:

ID   X                      A             B       C             D            E
1      1               2/25/2004      2      ‘yellow’   100       ‘long text1’
1     1                  2/25/2004    14     ‘black’    100       ‘long text1’
1     2                   2/26/2004     2      ‘green’    300       ‘longtext2’
1     2                   2/26/2004     13      ‘white’    300      ‘longtext2’
1     2                  2/26/2004       14      ‘blue’      300      ‘longtext2’


IF that is output you need to show the corresponding INPUT and the actual code you ran, not one line of code but the entire datastep. Your example here isn't very good at showing "all goes to missing".

Mscarboncopy
Pyrite | Level 9

I don't have an input in my sas code,  as I am using a very long file that is given to me in SPSS format which I transform it to a sas file.  What I showed previously is  my input for ID 1. I am trying to have the following output:

ID              X          A                 B           C             D           E
1               1           2/25/2004    2      ‘yellow’       100        ‘long text1’
1               1           2/25/2004   14       ‘black’      
1               2          2/26/2004     2       ‘green’       300       ‘longtext2’
1               2          2/26/2004    13        ‘white’       
1               2          2/26/2004    14         ‘blue’       

 

Thank you.

 

 

 

ballardw
Super User

@Mscarboncopy wrote:

I don't have an input in my sas code,  as I am using a very long file that is given to me in SPSS format which I transform it to a sas file.  What I showed previously is  my input for ID 1. I am trying to have the following output:

ID              X          A                 B           C             D           E
1               1           2/25/2004    2      ‘yellow’       100        ‘long text1’
1               1           2/25/2004   14       ‘black’      
1               2          2/26/2004     2       ‘green’       300       ‘longtext2’
1               2          2/26/2004    13        ‘white’       
1               2          2/26/2004    14         ‘blue’       

 

Thank you.

 

 

 


And the code attempted? The entire data step please.

Mscarboncopy
Pyrite | Level 9

I used your suggestion but made a slight mistake. When Reeza mentioned the output would have been correct, I went back and found my mistake. I answered above with my apologies. Your suggestion worked perfectly.  It is the solution. Thank you.

Reeza
Super User

Then you aren't using BallardW code, I suspect you're still using If First.A

if not first.A then call missing(D, E);

Mscarboncopy
Pyrite | Level 9

Thank you Reeza. My apologies to both of you, as I made a mistake when I tried BallardW code. it worked. That is what I needed.

Mscarboncopy
Pyrite | Level 9

Thank you so much. I tried your suggestion but made a slight mistake at first.

It works perfectly.

 

Reeza
Super User
Is this for presentation purposes? Do you need it this way in a report or something? For most programming purposes you need the first one. If it's just for display, then use PROC REPORT and assign D/E as GROUP variables.
Mscarboncopy
Pyrite | Level 9

Thank you. Yes, that would be another good way to go about it. I was under the impression that the data set could be cleaned, but it is okay if it is a limitation. I can definitely create reports that would take care of that as you suggested. Good point. Much appreciated. 

Reeza
Super User
It's not a limitation, it's trivial to set this in a data set (see BallardW code), but then it becomes problematic if you want to identify all D=600 so it's not a super useful or efficient structure. The only practical use for such a data structure is typically reporting.

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 14 replies
  • 2373 views
  • 2 likes
  • 3 in conversation