Hi All:
I have a dataset which has 3000+ observations but which has only 544 unique records; the rest are duplicate observations.
ID in_sourcefile1 in_sourcefile2 in_sourcefile3 <list of other vars where each observation has appropriate data filled in>
1 1
2 1
3 1
4 1
36 1
52 1
101 1
1 1
2 1
3 1
1 1
2 1
52 1
.
.
.
Each duplicate observation for a client exists because it points out what other service file the record is in.
I would want the final dataset to look like this:
ID in_sourcefile1 in_sourcefile2 in_sourcefile3 ... in_sourcefile30 <other vars>
1 1 1 1 x
2 1 1 1 x
3 1 1 1 1 x
4 1 . . . .
.
.
36 1 . . . x
.
.
52 1 . . 1 x
.
.
Could you let me know the most efficient way to do this?
Is update in a data step better than proc transpose if I have a lot of these in_sourcefile variables?
Thank you very much
Do any of the <other vars> have different values for one or more variables based on which of the in_sourcefileX is indicated?
Yes, they do actually...
Will it be possible in this case?
The list of other vars is 1000+ variables long...
The question actually becomes do you want those other variables, if so, the ones from which record.
I submit that you actually do not have "only 544 unique records" but probably 544 unique ID values. The data structure you want may well depend upon what you are going to do next.
I suspect that you might be better off transposing the data so that you you have a SourceField would have values of 1 to 30 depending on which of the 30 in_sourcefile variables is indicated.
Hi ballardw,
Yes, that's right 544 unique IDs, not records. I want to keep all of the other variables.
Where an "in_sourcefile=1", a group of the "other" vars have values which I want to keep.
If an in_sourcefile=null (cannot equal 0), I do not want to keep the values for that group of other variables.
Any ideas?
If the following conditions are met in your data
- values of in_sourcefilen are either 1 or null (not zero)
- the most up to date values of the 'other' variables is in the last (or in all) record for each ID
then the following approach might work:
Proc sort equals /* preserves relative order within id */
data = have
out = step1
;
By ID ;
Run ;
/* create master dataset with unique ids */
Data want ;
Set step1 (Keep = ID) ;
By ID ;
If First.ID ;
Run ;
/* use update to add the remaining data */
data want ;
update want
step1
;
By ID ;
Run ;
Warning - untested code
Even if it works, examine closely to ensure this is the result you want.
Richard
Hi Richard,
While the first condition is met, the second isn't (the last obs for each ID doesn't have the most up to date values.) I guess this means the mentioned steps won't work?
For background: The dataset is the result of stacking 30ish datasets, where each dataset has a different set of unique IDs, its own in_sourcefile variable (values 1 or null), and its own set of created variables. When stacked, the output dataset therefore has multiple records for each ID, where each additional record carries information for a one of those 30 datasets' "in_sourcefile" variable and the data for the variables created within that dataset.
Maisha
The method I outlined will work if those 'other' variables are null except where the relevant in_sourcefilen = 1.
However, there is probably a better way to merge the data in the first place, if you still have access to the original source files. Assuming they are all sorted by ID:
Data want ;
Length ID in_sourcefile1 - in_sourcefileN 8 ;
Merge file1 (in = in_file1 keep = ID <list of variables to keep> )
file2 (in = In_file2 keep = ID <list of variables to keep> )
...
;
By ID ;
Array in_source {*} in_sourcefile: ; /* note semicolon modifier */
Array in_file {*} in_file: ; /* note semicolon modifier */
Do k = 1 to DIM (in_source) ;
If in_file {k} then in_source {k} = 1 ;
End ;
Run ;
If there is no overlap between the variables in each of the source files, other than ID, you can omit the keep = options.
Richard
Hi Maisha,
it seems you want to collapse your observations based on ID then below code will work for you,
data test;
input ID in_sourcefile1 in_sourcefile2 in_sourcefile3 ;
cards;
1 1 . .
2 1 . .
3 1 . .
4 1 . .
36 1 . .
52 1 . .
101 1 . .
1 . 1 .
2 . 1 .
3 . 1 .
1 . . 1
2 . . 1
52 . . 1
;
proc sort data=test;by id;
data test;
update test(obs=0) test;
by id;
run;
output:
ID | in_sourcefile1 | in_sourcefile2 | in_sourcefile3 |
1 | 1 | 1 | 1 |
2 | 1 | 1 | 1 |
3 | 1 | 1 | |
4 | 1 | ||
36 | 1 | ||
52 | 1 | 1 | |
101 | 1 |
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.