Re: How to make dataset one record per client?

Maisha_Huq · Posted 03-24-2014 03:00 PM

Hi All:

I have a dataset which has 3000+ observations but which has only 544 unique records; the rest are duplicate observations.

ID in_sourcefile1 in_sourcefile2 in_sourcefile3 <list of other vars where each observation has appropriate data filled in>
1 1

2 1

3 1

4 1

36 1

52 1

101 1

1 1

2 1

3 1

1 1

2 1

52 1

.

Each duplicate observation for a client exists because it points out what other service file the record is in.

I would want the final dataset to look like this:

ID in_sourcefile1 in_sourcefile2 in_sourcefile3 ... in_sourcefile30 <other vars>

1 1 1 1 x

2 1 1 1 x

3 1 1 1 1 x

4 1 . . . .

.

36 1 . . . x

.

52 1 . . 1 x

.

Could you let me know the most efficient way to do this?

Is update in a data step better than proc transpose if I have a lot of these in_sourcefile variables?

Thank you very much

ballardw · Posted 03-24-2014 05:35 PM

Do any of the <other vars> have different values for one or more variables based on which of the in_sourcefileX is indicated?

Maisha_Huq · Posted 03-24-2014 06:20 PM

Yes, they do actually...

Maisha_Huq · Posted 03-24-2014 06:36 PM

Will it be possible in this case?

The list of other vars is 1000+ variables long...

ballardw · Posted 03-24-2014 06:46 PM

The question actually becomes do you want those other variables, if so, the ones from which record.

I submit that you actually do not have "only 544 unique records" but probably 544 unique ID values. The data structure you want may well depend upon what you are going to do next.

I suspect that you might be better off transposing the data so that you you have a SourceField would have values of 1 to 30 depending on which of the 30 in_sourcefile variables is indicated.

Maisha_Huq · Posted 03-24-2014 09:09 PM

Hi ballardw,

Yes, that's right 544 unique IDs, not records. I want to keep all of the other variables.

Where an "in_sourcefile=1", a group of the "other" vars have values which I want to keep.

If an in_sourcefile=null (cannot equal 0), I do not want to keep the values for that group of other variables.

Any ideas?

RichardinOz · Posted 03-24-2014 07:30 PM

If the following conditions are met in your data

- values of in_sourcefilen are either 1 or null (not zero)

- the most up to date values of the 'other' variables is in the last (or in all) record for each ID

then the following approach might work:

Proc sort equals /* preserves relative order within id */

data = have

out = step1

;

By ID ;

Run ;

/* create master dataset with unique ids */

Data want ;

Set step1 (Keep = ID) ;

By ID ;

If First.ID ;

Run ;

/* use update to add the remaining data */

data want ;

update want

step1

;

By ID ;

Run ;

Warning - untested code

Even if it works, examine closely to ensure this is the result you want.

Richard

Maisha_Huq · Posted 03-24-2014 08:47 PM

Hi Richard,

While the first condition is met, the second isn't (the last obs for each ID doesn't have the most up to date values.) I guess this means the mentioned steps won't work?

For background: The dataset is the result of stacking 30ish datasets, where each dataset has a different set of unique IDs, its own in_sourcefile variable (values 1 or null), and its own set of created variables. When stacked, the output dataset therefore has multiple records for each ID, where each additional record carries information for a one of those 30 datasets' "in_sourcefile" variable and the data for the variables created within that dataset.

RichardinOz · Posted 03-24-2014 10:19 PM

Maisha

The method I outlined will work if those 'other' variables are null except where the relevant in_sourcefilen = 1.

However, there is probably a better way to merge the data in the first place, if you still have access to the original source files. Assuming they are all sorted by ID:

Data want ;

Length ID in_sourcefile1 - in_sourcefileN 8 ;

Merge file1 (in = in_file1 keep = ID <list of variables to keep> )

file2 (in = In_file2 keep = ID <list of variables to keep> )

...

;

By ID ;

Array in_source {*} in_sourcefile: ; /* note semicolon modifier */

Array in_file {*} in_file: ; /* note semicolon modifier */

Do k = 1 to DIM (in_source) ;

If in_file {k} then in_source {k} = 1 ;

End ;

Run ;

If there is no overlap between the variables in each of the source files, other than ID, you can omit the keep = options.

Richard

pradeepalankar · Posted 03-25-2014 05:14 AM

Hi Maisha,

it seems you want to collapse your observations based on ID then below code will work for you,

data test;

input ID in_sourcefile1 in_sourcefile2 in_sourcefile3 ;

cards;

1 1 . .

2 1 . .

3 1 . .

4 1 . .

36 1 . .

52 1 . .

101 1 . .

1 . 1 .

2 . 1 .

3 . 1 .

1 . . 1

2 . . 1

52 . . 1

;

proc sort data=test;by id;

data test;

update test(obs=0) test;

by id;

run;

output:

ID	in_sourcefile1	in_sourcefile2	in_sourcefile3
1	1	1	1
2	1	1	1
3	1	1
4	1
36	1
52	1		1
101	1

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away