I'm trying to figure out a way to compare the value of one variable with a different variable across different observations within a data step, and then output those observations where the data matches. I'm not quite sure if I'm going about this the right way, but I've been trying to do this with a hash table, where I essentially read the dataset in twice -- once in the hash table, and again through the data step. I'm then trying to find all observations in the hash table that match three different criteria -- where the PlantID and PrimeMover are the same, and where the primary fuel matches the secondary fuel. Here is my code so far:
DATA Test2;
if _N_=1 then do;
declare hash h(dataset:'Test', MULTIDATA:'YES');
h.definekey('PlantID','PrimeMover','Secondary');
h.definedata('Primary','Secondary');
h.Definedone();
end;
set Test;
if h.Find(key:PlantID,key:PrimeMover,key:Primary)=0 then output;
RUN;
And here's some sample data:
Year | PlantID | PrimeMover | Primary | Secondary |
2012 | 3 | ST | NG | NG |
2012 | 3 | ST | NG | FO2 |
2012 | 3 | ST | FO2 | NG |
2012 | 3 | ST | NG | BIT |
2012 | 4 | HY | WAT | WAT |
2012 | 7 | ST | BIT | BIT |
2012 | 7 | ST | BIT | NG |
2012 | 7 | ST | NG | BIT |
2012 | 8 | ST | BIT | WC |
2012 | 9 | GT | NG | DFO |
2012 | 9 | GT | NG | NG |
2012 | 10 | GT | NG | DFO |
2012 | 10 | GT | NG | NG |
2012 | 10 | ST | BIT | BIT |
2012 | 10 | ST | BIT | WC |
2012 | 26 | ST | DFO | DFO |
2012 | 26 | ST | BIT | DFO |
2012 | 26 | ST | BIT | BIT |
2012 | 26 | ST | BIT | WC |
If you take plant 3, for example, I want to output the first three observations, because the secondary fuel in the third observation matches the primary fuel in the first observation, and the secondary fuel in the second observation matches the primary fuel in the first observation. I don't care about the fourth observation because its secondary fuel is not overlapping with any of the primary fuels of the others. I know I'm not handling the hash multidata correctly at this point, but I'm mostly concerned with the find statement where I'm trying to get the observations where the secondary key matches the primary field of the current observation in the data step. I'm new to hashtags, so this may not be the best approach. I had also thought about writing a macro that would loop through the dataset twice to get those observations that match each other, but I figured I would try the hashtag approach first. Thanks for your help!
Is this close to what you're looking for? PS. Hashtags are different than Hash Tables, which is different from Hash
Basically join the table to itself and if the secondary matches the primary and the plant id is the same, but take only distinct records.
data have;
input Year PlantID PrimeMover $ Primary $ Secondary $;
cards;
2012 3 ST NG NG
2012 3 ST NG FO2
2012 3 ST FO2 NG
2012 3 ST NG BIT
2012 4 HY WAT WAT
2012 7 ST BIT BIT
2012 7 ST BIT NG
2012 7 ST NG BIT
2012 8 ST BIT WC
2012 9 GT NG DFO
2012 9 GT NG NG
2012 10 GT NG DFO
2012 10 GT NG NG
2012 10 ST BIT BIT
2012 10 ST BIT WC
2012 26 ST DFO DFO
2012 26 ST BIT DFO
2012 26 ST BIT BIT
2012 26 ST BIT WC
;;;;
proc sql;
create table want as
select distinct a.*
from have as a
join have as b
on a.Secondary=b.Primary
and a.PlantID=b.PlantID
and a.Primemover=b.primemover;
quit;
EDIT: add in condition for primemover to be the same as well as plant id.
Is this close to what you're looking for? PS. Hashtags are different than Hash Tables, which is different from Hash
Basically join the table to itself and if the secondary matches the primary and the plant id is the same, but take only distinct records.
data have;
input Year PlantID PrimeMover $ Primary $ Secondary $;
cards;
2012 3 ST NG NG
2012 3 ST NG FO2
2012 3 ST FO2 NG
2012 3 ST NG BIT
2012 4 HY WAT WAT
2012 7 ST BIT BIT
2012 7 ST BIT NG
2012 7 ST NG BIT
2012 8 ST BIT WC
2012 9 GT NG DFO
2012 9 GT NG NG
2012 10 GT NG DFO
2012 10 GT NG NG
2012 10 ST BIT BIT
2012 10 ST BIT WC
2012 26 ST DFO DFO
2012 26 ST BIT DFO
2012 26 ST BIT BIT
2012 26 ST BIT WC
;;;;
proc sql;
create table want as
select distinct a.*
from have as a
join have as b
on a.Secondary=b.Primary
and a.PlantID=b.PlantID
and a.Primemover=b.primemover;
quit;
EDIT: add in condition for primemover to be the same as well as plant id.
Thanks for this solution. I ended up using a different SQL approach where I counted the duplicate secondary sources within each primary source. This had the added benefit of giving me the number of overlapping sources, which I will need for later analysis.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.