BookmarkSubscribeRSS Feed
Maisha_Huq
Quartz | Level 8

Hi All:

I have a dataset which has 3000+ observations but which has only 544 unique records; the rest are duplicate observations.

ID               in_sourcefile1    in_sourcefile2  in_sourcefile3                        <list of other vars where each observation has appropriate data filled in>
1                     1

2                     1

3                     1

4                     1

36                   1

52                   1     

101                 1

1                                                   1                 

2                                                   1

3                                                   1

1                                                                           1

2                                                                           1

52                                                                          1

.

.

.

Each duplicate observation for a client exists because it points out what other service file the record is in.

I would want the final dataset to look like this:

ID                    in_sourcefile1         in_sourcefile2          in_sourcefile3       ...       in_sourcefile30          <other vars>

1                          1                                    1                                 1                                                                   x

2                          1                                    1                                 1                                                                   x          

3                         1                                     1                                 1                                      1                           x                                      

4                         1                                   .                                   .                                        .                               . 

.

.

36                       1                                       .                                      .                                 .                           x

.

.

52                        1                                    .                                   .                                         1                      x

.

.

Could you let me know the most efficient way to do this?

Is update in a data step better than proc transpose if I have a lot of these in_sourcefile variables?

Thank you very much

9 REPLIES 9
ballardw
Super User

Do any of the <other vars> have different values for one or more variables based on which of the in_sourcefileX is indicated?

Maisha_Huq
Quartz | Level 8

Yes, they do actually...

Maisha_Huq
Quartz | Level 8

Will it be possible in this case?

The list of other vars is 1000+ variables long...

ballardw
Super User

The question actually becomes do you want those other variables, if so, the ones from which record.

I submit that you actually do not have "only 544 unique records" but probably 544 unique ID values. The data structure you want may well depend upon what you are going to do next.

I suspect that you might be better off transposing the data so that you you have a SourceField would have values of 1 to 30 depending on which of the 30 in_sourcefile variables is indicated.

Maisha_Huq
Quartz | Level 8

Hi ballardw,

Yes, that's right 544 unique IDs, not records.  I want to keep all of the other variables.

Where an "in_sourcefile=1", a group of the "other" vars have values which I want to keep.

If an in_sourcefile=null (cannot equal 0), I do not want to keep the values for that group of other variables.

Any ideas?

RichardinOz
Quartz | Level 8

If the following conditions are met in your data

- values of in_sourcefilen are either 1 or null (not zero)

- the most up to date values of the 'other' variables is in the last (or in all) record for each ID

then the following approach might work:

Proc sort equals /* preserves relative order within id */

          data = have

          out = step1

          ;

     By     ID ;

Run ;

/* create master dataset with unique ids */

Data want ;

     Set step1 (Keep = ID) ;

     By ID ;

     If First.ID ;

Run ;

/* use update to add the remaining data */

data want ;

     update want

               step1

               ;

     By ID ;

Run ;

Warning - untested code

Even if it works, examine closely to ensure this is the result you want.

Richard

Maisha_Huq
Quartz | Level 8

Hi Richard,

While the first condition is met, the second isn't (the last obs for each ID doesn't have the most up to date values.) I guess this means the mentioned steps won't work?

For background:  The dataset is the result of stacking 30ish datasets, where each dataset has a different set of unique IDs, its own in_sourcefile variable (values 1 or null), and its own set of created variables.  When stacked, the output dataset therefore has multiple records for each ID, where each additional record carries information for a one of those 30 datasets' "in_sourcefile" variable and the data for the variables created within that dataset.

RichardinOz
Quartz | Level 8

Maisha

The method I outlined will work if those 'other' variables are null except where the relevant in_sourcefilen = 1.

However, there is probably a better way to merge the data in the first place, if you still have access to the original source files.  Assuming they are all sorted by ID:

Data want ;

     Length    ID  in_sourcefile1 - in_sourcefileN      8 ;

     Merge     file1 (in = in_file1 keep = ID <list of variables to keep> )

                    file2 (in = In_file2 keep = ID <list of variables to keep> )

                    ...

                    ;

     By     ID ;

     Array in_source {*} in_sourcefile: ;     /* note semicolon modifier */

     Array in_file {*} in_file: ;                     /* note semicolon modifier */

     Do k = 1 to DIM (in_source) ;

          If in_file {k} then in_source {k} = 1 ;

     End ;

Run ;

If there is no overlap between the variables in each of the source files, other than ID, you can omit the keep = options.

Richard

pradeepalankar
Obsidian | Level 7

Hi Maisha,

it seems you want to collapse your observations based on ID then below code will work for you,

data test;

input ID in_sourcefile1    in_sourcefile2  in_sourcefile3 ;

cards;

1 1 . .

2 1 . .

3 1 . .

4 1 . .

36 1 . .

52 1 . .

101 1 . .

1 . 1 .

2 . 1 .

3 . 1 .

1 . . 1

2 . . 1

52 . . 1

;

proc sort data=test;by id;

data test;

update  test(obs=0) test;

by id;

run;

output:

IDin_sourcefile1in_sourcefile2in_sourcefile3
1111
2111
311
41
361
5211
1011

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 777 views
  • 6 likes
  • 4 in conversation