Hi! I have this dataset with old and new birth dates for each name. I'd like to get the birth date from the rows with Source = "NEW". But for names with no Source = "NEW" (as in the case below SARAH), I'd like to retain the old record.
I'd like to do it with only one SQL statement without need of joining tables, I'm thinking of a group by and min/ max statement.. but I can't get it right. Or is it not possible? Thank you.
data have;
infile datalines missover;
input NAME :$5. BIRTHDATE :$8. SOURCE :$3. ADDRESS :$8.;
datalines;
ANGEL 09122018 OLD USA
ANGEL 09112018 NEW
FELIX 07252017 OLD UK
FELIX 07252017 NEW
SARAH 09092018 OLD CANADA
;
run;
data want;
infile datalines missover;
input NAME :$5. BIRTHDATE :$8. SOURCE :$3. ADDRESS :$8.;
datalines;
ANGEL 09112018 NEW USA
FELIX 07252017 NEW UK
SARAH 09092018 OLD CANADA
;
run;
data have;
infile datalines missover;
input NAME :$5. BIRTHDATE :$8. SOURCE :$3. ADDRESS :$8.;
datalines;
ANGEL 09122018 OLD USA
ANGEL 09112018 NEW
FELIX 07252017 OLD UK
FELIX 07252017 NEW
SARAH 09092018 OLD CANADA
;
run;
proc sql;
create table want as
select *,max(ADDRESS) as new_ADDRESS
from have
group by name
having SOURCE=min(SOURCE) ;
quit;
If NEW is always the last record for the person, keep that last record.
data WANT; set HAVE; by NAME ; if last.NAME; run;
If not, sort and apply that same logic.
SAS BY grouping is really powerful in these situations. You can easily do it in SAS, in SQL it can be done but it's not as easy. I would do, two passes, one with the new only and then for records that have a single record, take that.
proc sql;
create table want as
/*select new records*/
select *
from have
where source='NEW'
union
/*add in records with a single entry and source='OLD', single NEW would already be included if possible*/
select *
from have
group by name
having count(name)=1 and source='OLD';
quit;
data have;
infile datalines missover;
input NAME :$5. BIRTHDATE :$8. SOURCE :$3. ADDRESS :$8.;
datalines;
ANGEL 09122018 OLD USA
ANGEL 09112018 NEW
FELIX 07252017 OLD UK
FELIX 07252017 NEW
SARAH 09092018 OLD CANADA
;
run;
proc sql;
create table want(drop=s) as
select *, SOURCE='NEW' as s
from have
group by name
having max(S)=s ;
quit;
Clever!
Thank you sir @PGStats One day I want to code like you. That's exactly the same I told my parents. i was advised by them to copy you shamelessly. Have a great night!
Note that @novinosrin's query can be extended to keep the address when there is one, like this:
proc sql;
create table want(drop=s) as
select
NAME,
Birthdate,
Source,
coalesce(address, max(address)) as address,
SOURCE='NEW' as s
from have
group by name
having max(S)=s ;
quit;
This works too ?
proc sql;
create table want(drop=s) as
select NAME,
Birthdate,
Source,
max(address) as address,
SOURCE='NEW' as s
from have
group by name
having max(S)=s ;
quit;
Sir, no need for coalesce coz implicit remerge will take care of it anyway. I was a little tipsy to not pay attention to it in the 1st place. Thanks as always and again
proc sql;
create table want(drop=s) as
select
NAME,
Birthdate,
Source,
max(address) as address,
SOURCE='NEW' as s
from have
group by name
having max(S)=s ;
quit;
@novinosrin, @ChrisNZ, well, I thought that if there was an address in a NEW record, I would want that one, even if the address in the OLD record was greater (lexicographically).
@PGStats Good thinking!
I did it slightly different, since the other solutions didn't seem to resolve keeping the address. I still think a solution using a data step with a by statement is probably better.
proc sql;
create table want as
select distinct
t1.name
,case when t2.name ne '' then t2.birthdate else t1.birthdate end as birthdate
,case when t2.name ne '' then t2.source else t1.source end as source
,t1.address
from have as t1
left join (select * from have where source = 'NEW') as t2
on t1.name=t2.name
where t1.address ne ''
;
quit;
data have;
infile datalines missover;
input NAME :$5. BIRTHDATE :$8. SOURCE :$3. ADDRESS :$8.;
datalines;
ANGEL 09122018 OLD USA
ANGEL 09112018 NEW
FELIX 07252017 OLD UK
FELIX 07252017 NEW
SARAH 09092018 OLD CANADA
;
run;
proc sql;
create table want as
select *,max(ADDRESS) as new_ADDRESS
from have
group by name
having SOURCE=min(SOURCE) ;
quit;
Thank you everyone here for presenting various solutions 🙂
I like the Coalesce idea. I forgot about using it!
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.