Solved: Joining a data set with missing values, to a data set with ID Codes

geneshackman · Posted 12-23-2022 10:53 AM

Hi all. I hope everyone is healthy.

Suppose I have a data set like this:

county	Treatment	Raceth	Percent
CountyA	vaccine	All	43
CountyB	vaccine	All	48
CountyC	vaccine	All	27
CountyD	vaccine	All	31
CountyE	vaccine	All	12
CountyF	vaccine	All	52
CountyG	vaccine	All	18
CountyA	vaccine	White	76
CountyB	vaccine	White	56
CountyD	vaccine	White	94
CountyE	vaccine	White	25
CountyF	vaccine	White	88
CountyG	vaccine	White	44
CountyA	vaccine	Black	93
CountyD	vaccine	Black	54
CountyE	vaccine	Black	57
CountyG	vaccine	Black	81
CountyB	vaccine	Asian	76
CountyC	vaccine	Asian	6
CountyD	vaccine	Asian	34
CountyD	vaccine	Hispanic	12
CountyF	vaccine	Hispanic	77
CountyG	vaccine	Hispanic	70

And another data set with ID codes for county

county	CountyID
CountyA	c01
CountyB	c02
CountyC	c03
CountyD	c04
CountyE	c05
CountyF	c06
CountyG	c07

How do I get a data set that has, for each race ethnic group, one line for each county ID code, even when the first data set has missing lines for those counties. So for example, the output data set would include these lines

county	CountyID	Treatment	Raceth	Percent
CountyA	c01	vaccine	Asian
CountyB	c02	vaccine	Asian	76
CountyC	c03	vaccine	Asian	6
CountyD	c04	vaccine	Asian	34
CountyE	c05	vaccine	Asian
CountyF	c06	vaccine	Asian
CountyG	c07	vaccine	Asian
CountyA	c01	vaccine	Hispanic
CountyB	c02	vaccine	Hispanic
CountyC	c03	vaccine	Hispanic
CountyD	c04	vaccine	Hispanic	12
CountyE	c05	vaccine	Hispanic
CountyF	c06	vaccine	Hispanic	77
CountyG	c07	vaccine	Hispanic	70

Thanks

Ksharp · Posted 12-25-2022 04:49 AM

data county ;
	infile cards ;
	input county $ countyID $;
cards ;
CountyA	c01
CountyB	c02
CountyC	c03
CountyD	c04
CountyE	c05
CountyF	c06
CountyG	c07
;
run ;
data have;
infile cards expandtabs truncover;
input county $ Treatment $ Raceth $ Percent;
cards;
CountyA	vaccine	All	43
CountyB	vaccine	All	48
CountyC	vaccine	All	27
CountyD	vaccine	All	31
CountyE	vaccine	All	12
CountyF	vaccine	All	52
CountyG	vaccine	All	18
CountyA	vaccine	White	76
CountyB	vaccine	White	56
CountyD	vaccine	White	94
CountyE	vaccine	White	25
CountyF	vaccine	White	88
CountyG	vaccine	White	44
CountyA	vaccine	Black	93
CountyD	vaccine	Black	54
CountyE	vaccine	Black	57
CountyG	vaccine	Black	81
CountyB	vaccine	Asian	76
CountyC	vaccine	Asian	6
CountyD	vaccine	Asian	34
CountyD	vaccine	Hispanic	12
CountyF	vaccine	Hispanic	77
CountyG	vaccine	Hispanic	70
;

proc sql;
create table want as
select a.*,b.percent
from ( select * from
(select distinct county,countyID from county),(select distinct Treatment,Raceth from have)
) as a 
     natural left join
     have as b 
 order by Raceth,countyID;
quit;

View solution in original post

AMSAS · Posted 12-23-2022 11:29 AM

Including Nonmatching Rows with the Left Outer Join

Take a look at the link above, and here's a simpler example with your data:

data county ;
	infile cards ;
	input county $ countyID $;
cards ;
CountyA	c01
CountyB	c02
CountyC	c03
CountyD	c04
CountyE	c05
CountyF	c06
CountyG	c07
;
run ;

data treatment ;
	infile cards;
	input county $ percent ;
cards ;
CountyB	10
CountyD	30
CountyE	40
CountyG	20
;
run ;

proc sql ;
	create table 
		work.want as
	select 
		county.*,
		treatment.percent 
	from
		work.county left join work.treatment
	on
	county.county=treatment.county 
;
quit ;

geneshackman · Posted 12-23-2022 11:39 AM

Hi Amsas,

Thanks for your response. The example data set I used was a little complex specifically because I wanted to be able to do this when there were different subgroups in the same data set. Would your method work when there are subgroups in the data set that has the measure values?

Sajid01 · Posted 12-24-2022 08:02 PM

The following code will give the outcome what you have desired.
I have taken your first table as treatment and the second table as county.

/* In the following steps a table containing County,CountyID and raceth is created*/
proc sql;
create table race as
select distinct UPCASE(TRIM(RACETH)) as RACETH from treatment
/*where  upcase(raceth)  in ('ASIAN','HISPANIC') */
order by raceth;
quit;
proc sql;
create table c0 as
select * from county (keep=county), race;
quit;
proc sql;
create table c1 as
select a.County, b.CountyID, a.raceth from c0 a, county b 
where a.county=b.county;
quit;

/*Combining the above table C1 with treatment table to give the desired outcome */
proc sql;
create table combined as
select a.county,a.countyID,a.raceth,"Vaccine" as Treatment, B.Percent from c1 A left  join Treatment B
on A.county=B.county and upcase(a.raceth)=upcase(b.raceth)
order by  raceth,county;
quit;

The outcome is as you had desired.

proc sql;
create table race as
select distinct UPCASE(TRIM(RACETH)) as RACETH from treatment
/*where  upcase(raceth)  in ('ASIAN','HISPANIC') */
order by raceth;
quit;
proc sql;
create table c0 as
select * from county (keep=county), race;
quit;
proc sql;
create table c1 as
select a.County, b.CountyID, a.raceth from c0 a, county b 
where a.county=b.county;
quit;

proc sql;
create table race as
select distinct UPCASE(TRIM(RACETH)) as RACETH from treatment
/*where  upcase(raceth)  in ('ASIAN','HISPANIC') */
order by raceth;
quit;
proc sql;
create table c0 as
select * from county (keep=county), race;
quit;

proc sql;
create table race as
select distinct UPCASE(TRIM(RACETH)) as RACETH from treatment
/*where  upcase(raceth)  in ('ASIAN','HISPANIC') */
order by raceth;
quit;
proc sql;
create table c0 as
select * from county (keep=county), race;
quit;The outcome will be as follows. This is the same as you wanted.

Ksharp · Posted 12-25-2022 04:49 AM

data county ;
	infile cards ;
	input county $ countyID $;
cards ;
CountyA	c01
CountyB	c02
CountyC	c03
CountyD	c04
CountyE	c05
CountyF	c06
CountyG	c07
;
run ;
data have;
infile cards expandtabs truncover;
input county $ Treatment $ Raceth $ Percent;
cards;
CountyA	vaccine	All	43
CountyB	vaccine	All	48
CountyC	vaccine	All	27
CountyD	vaccine	All	31
CountyE	vaccine	All	12
CountyF	vaccine	All	52
CountyG	vaccine	All	18
CountyA	vaccine	White	76
CountyB	vaccine	White	56
CountyD	vaccine	White	94
CountyE	vaccine	White	25
CountyF	vaccine	White	88
CountyG	vaccine	White	44
CountyA	vaccine	Black	93
CountyD	vaccine	Black	54
CountyE	vaccine	Black	57
CountyG	vaccine	Black	81
CountyB	vaccine	Asian	76
CountyC	vaccine	Asian	6
CountyD	vaccine	Asian	34
CountyD	vaccine	Hispanic	12
CountyF	vaccine	Hispanic	77
CountyG	vaccine	Hispanic	70
;

proc sql;
create table want as
select a.*,b.percent
from ( select * from
(select distinct county,countyID from county),(select distinct Treatment,Raceth from have)
) as a 
     natural left join
     have as b 
 order by Raceth,countyID;
quit;

geneshackman · Posted 12-25-2022 05:09 PM

Hi. I think this gets me the output I want. The log also has this:

NOTE: The execution of this query involves performing one or more Cartesian product joins that
can not be optimized.

I don't know if this is important, though, as the output still looks like what I want.

Thanks

Gene

Ksharp · Posted 12-26-2022 04:35 AM

That doesn't matter. That note imply the code of SQL would cost you a lot of time if you have a big table .

geneshackman · Posted 12-29-2022 10:47 AM

Thanks! I had a chance to try this and it worked.

Joining a data set with missing values, to a data set with ID Codes

Re: Joining a data set with missing values, to a data set with ID Codes

Re: Joining a data set with missing values, to a data set with ID Codes

Re: Joining a data set with missing values, to a data set with ID Codes

Re: Joining a data set with missing values, to a data set with ID Codes

Re: Joining a data set with missing values, to a data set with ID Codes

Re: Joining a data set with missing values, to a data set with ID Codes

Re: Joining a data set with missing values, to a data set with ID Codes

Re: Joining a data set with missing values, to a data set with ID Codes

SAS Innovate 2025: Register Now

SAS Training: Just a Click Away