How to make dataset one record per client?

Reply
Frequent Contributor
Posts: 131

How to make dataset one record per client?

Hi All:

I have a dataset which has 3000+ observations but which has only 544 unique records; the rest are duplicate observations.

ID               in_sourcefile1    in_sourcefile2  in_sourcefile3                        <list of other vars where each observation has appropriate data filled in>
1                     1

2                     1

3                     1

4                     1

36                   1

52                   1     

101                 1

1                                                   1                 

2                                                   1

3                                                   1

1                                                                           1

2                                                                           1

52                                                                          1

.

.

.

Each duplicate observation for a client exists because it points out what other service file the record is in.

I would want the final dataset to look like this:

ID                    in_sourcefile1         in_sourcefile2          in_sourcefile3       ...       in_sourcefile30          <other vars>

1                          1                                    1                                 1                                                                   x

2                          1                                    1                                 1                                                                   x          

3                         1                                     1                                 1                                      1                           x                                      

4                         1                                   .                                   .                                        .                               . 

.

.

36                       1                                       .                                      .                                 .                           x

.

.

52                        1                                    .                                   .                                         1                      x

.

.

Could you let me know the most efficient way to do this?

Is update in a data step better than proc transpose if I have a lot of these in_sourcefile variables?

Thank you very much

Super User
Posts: 11,343

Re: How to make dataset one record per client?

Posted in reply to Maisha_Huq

Do any of the <other vars> have different values for one or more variables based on which of the in_sourcefileX is indicated?

Frequent Contributor
Posts: 131

Re: How to make dataset one record per client?

Yes, they do actually...

Frequent Contributor
Posts: 131

Re: How to make dataset one record per client?

Posted in reply to Maisha_Huq

Will it be possible in this case?

The list of other vars is 1000+ variables long...

Super User
Posts: 11,343

Re: How to make dataset one record per client?

Posted in reply to Maisha_Huq

The question actually becomes do you want those other variables, if so, the ones from which record.

I submit that you actually do not have "only 544 unique records" but probably 544 unique ID values. The data structure you want may well depend upon what you are going to do next.

I suspect that you might be better off transposing the data so that you you have a SourceField would have values of 1 to 30 depending on which of the 30 in_sourcefile variables is indicated.

Frequent Contributor
Posts: 131

Re: How to make dataset one record per client?

Hi ballardw,

Yes, that's right 544 unique IDs, not records.  I want to keep all of the other variables.

Where an "in_sourcefile=1", a group of the "other" vars have values which I want to keep.

If an in_sourcefile=null (cannot equal 0), I do not want to keep the values for that group of other variables.

Any ideas?

Super Contributor
Posts: 644

Re: How to make dataset one record per client?

Posted in reply to Maisha_Huq

If the following conditions are met in your data

- values of in_sourcefilen are either 1 or null (not zero)

- the most up to date values of the 'other' variables is in the last (or in all) record for each ID

then the following approach might work:

Proc sort equals /* preserves relative order within id */

          data = have

          out = step1

          ;

     By     ID ;

Run ;

/* create master dataset with unique ids */

Data want ;

     Set step1 (Keep = ID) ;

     By ID ;

     If First.ID ;

Run ;

/* use update to add the remaining data */

data want ;

     update want

               step1

               ;

     By ID ;

Run ;

Warning - untested code

Even if it works, examine closely to ensure this is the result you want.

Richard

Frequent Contributor
Posts: 131

Re: How to make dataset one record per client?

Posted in reply to RichardinOz

Hi Richard,

While the first condition is met, the second isn't (the last obs for each ID doesn't have the most up to date values.) I guess this means the mentioned steps won't work?

For background:  The dataset is the result of stacking 30ish datasets, where each dataset has a different set of unique IDs, its own in_sourcefile variable (values 1 or null), and its own set of created variables.  When stacked, the output dataset therefore has multiple records for each ID, where each additional record carries information for a one of those 30 datasets' "in_sourcefile" variable and the data for the variables created within that dataset.

Super Contributor
Posts: 644

Re: How to make dataset one record per client?

Posted in reply to Maisha_Huq

Maisha

The method I outlined will work if those 'other' variables are null except where the relevant in_sourcefilen = 1.

However, there is probably a better way to merge the data in the first place, if you still have access to the original source files.  Assuming they are all sorted by ID:

Data want ;

     Length    ID  in_sourcefile1 - in_sourcefileN      8 ;

     Merge     file1 (in = in_file1 keep = ID <list of variables to keep> )

                    file2 (in = In_file2 keep = ID <list of variables to keep> )

                    ...

                    ;

     By     ID ;

     Array in_source {*} in_sourcefile: ;     /* note semicolon modifier */

     Array in_file {*} in_file: ;                     /* note semicolon modifier */

     Do k = 1 to DIM (in_source) ;

          If in_file {k} then in_source {k} = 1 ;

     End ;

Run ;

If there is no overlap between the variables in each of the source files, other than ID, you can omit the keep = options.

Richard

Frequent Contributor
Posts: 106

Re: How to make dataset one record per client?

Posted in reply to Maisha_Huq

Hi Maisha,

it seems you want to collapse your observations based on ID then below code will work for you,

data test;

input ID in_sourcefile1    in_sourcefile2  in_sourcefile3 ;

cards;

1 1 . .

2 1 . .

3 1 . .

4 1 . .

36 1 . .

52 1 . .

101 1 . .

1 . 1 .

2 . 1 .

3 . 1 .

1 . . 1

2 . . 1

52 . . 1

;

proc sort data=test;by id;

data test;

update  test(obs=0) test;

by id;

run;

output:

IDin_sourcefile1in_sourcefile2in_sourcefile3
1111
2111
311
41
361
5211
1011
Ask a Question
Discussion stats
  • 9 replies
  • 299 views
  • 6 likes
  • 4 in conversation