DATA Step, Macro, Functions and more

update dataset

Reply
Super Contributor
Posts: 647

update dataset

After uploading the data from various sources we realized that patient birth year is wrong.How to update the dataset with new values for patient birth year without effecting the data and without having to upload each file to fix the wrong field.
The identifying key is by the provider.We have 15 providers like:
Aetna,Axium etc etc.So for each provider I need to update the patient birth year.
Super Contributor
Posts: 359

Re: update dataset

A starting point.

proc sql;
update datatable set birthdate = (select birthdate from source_table where
**match criteria**);
Quit;
Super Contributor
Posts: 647

Re: update dataset

I tried the following:
proc sql;
update final.spdata_2008_v1 as S set patient_birth_yr = (select r.patient_birth_yr from up_date as R where S.provider=R.provider);
Quit;

I got this error message:
5417 proc sql;
5418 update final.spdata_2008_v1 as S set patient_birth_yr = (select r.patient_birth_yr from try as R where S.provider=R.provider);
ERROR: Subquery evaluated to more than one row.
NOTE: Correlation values are: Provider='DIPLOMAT PHARMACY' .
5419 Quit;
Super Contributor
Posts: 359

Re: update dataset

Are all birth years for a given provider the same? That would seem strange. If not, you need more identifying info such as a patient identifier. If they are then a 'distinct' on the select might help.

The message is self explanitory. You are trying to update a single value with more than one result. Your subquery must return a single row.
Super Contributor
Posts: 647

Re: update dataset

Flip,
I tried this:
proc sql;
update final.spdata_2008_v1 as S set patient_birth_yr = (select distinct r.patient_birth_yr from up_date as R where S.provider=R.provider and s.patient_id=r.patient_id);
Quit;
Still gettig the error.
Super User
Posts: 5,260

Re: update dataset

The combination of provider and patient id must be unique in both tables to make this work, otherwise your subquery may result in with multiple rows.
/Linus
Data never sleeps
Contributor
Posts: 48

Re: update dataset

the problem when using distinct is that your subquery still gets more than one row for the same provider and patient_id. That's why distinct will not help.

Simple fix is that you update your where clause with some extra condition or group by statement.

The simliest way could be using min or max function with distinct.
But i don't think that is correct with some business logic.

(select distinct min(r.patient_birth_yr) from up_date as R where S.provider=R.provider and s.patient_id=r.patient_id);
Super Contributor
Posts: 359

Re: update dataset

Actually Linus, the unique condition only needs to be present on the subquery. It can update multiple records on the target table with a single record from the subquery.

That said, the thing that you must obtain is a unique record returned from the subquery for each record on the target table. We can only guess at the overall structure of your data, so you must figure out the best way to get a single record returned. Distinct may do it, assigning a record number and selecting the first or last one in a partition, are a couple of options.

Try collecting data from your source for a single patient to see the structure of the duplication. You may want to create a temporary dataset to eliminate dupplicates if that is easier for you to clean up.
Ask a Question
Discussion stats
  • 7 replies
  • 332 views
  • 0 likes
  • 4 in conversation