Hello everyone,
I have a dataset with many columns, where I would like to compare the contents of each column with some other columns and create a variable if things do not match up. Specifically, create both a count variable and create a new variable that has the specific code that did not match up.
This is the code im currently using...
data adtn.ss7 (drop = i);
set adtn.ss6;
count=.;
check =.;
array items{35} sic36-sic70;
do i=1 to 35;
if items{i}~=sic1 then count=1;
if items{i}~=sic1 then check=sic1;
end;
run;
The problem Im having is that SAS counts a '1' if ANY of the variables in my items array (sic36-sic70) do not match sic1. However, I only want it to count a '1' and for my 'check' variable to populate if there isn't a match in my entire array.
To state it differently, I only want my counter variable to count if there is no match anywhere sic36-sic70. As in, it is ok if only one column within the array matches onto sic1. Furthermore, my array sic36-sic70 has a lot of missing values.
Your help is greatly appreciated!
Thanks!
I think you are facing endless problems as long as you keep that wide data structure. Here is how your problem would get simpler if you went to a long data structure :
/* Create two datasets. Sics with sics 1-35 and Items with sics 36-70 */
data Sics Items;
set adtc.ss;
array sics{35} sic1-sic35;
array items{35} sic36-sic70;
obs = _n_; /* Unique ID for each observation */
do i = 1 to 35;
	sic = sics{i};
	if not missing(sic) then output Sics;
	end;
do i = 1 to 35;
	sic = items{i};
	if not missing(sic) then output Items;
	end;
drop i sic1-sic70;
run;
/* Extract the sics from Sics that do not appear on the same observation in Items */
proc sql;
create table missSics as
select * from Sics as A
where sic not in (select sic from Items where obs=A.obs);
quit;
/* Count the distinct missing sics in each observation */
proc sql;
create table countSics as
select obs, count(distinct sic) as nMissSics
from missSics
group by obs;
quit; (untested)
Modify your loop to:
count = 1;
check = sic1;
do i = 1 to 35;
 if count then if items{i} = sic1 then call missing(count, check);
end;PG Stats, thanks so much for your quick reply. This has worked great. Another question for you...
What if I now wanted to figure out if a code (e.g., sic71) matched ANY of the codes within a range of columns (sic1-sic35)? I want to create a count variable if it does match (in a count variable) and I want it to regurgitate the code that it matched (in another variable) (just to double check the code it matched on). Furthermore, there code be blanks within the range of columns (sic1-sic35) and there could also be blanks on some of the variable's (sic71) observations. I tried to slightly modify the code you gave me. Let me know what you think. And thanks again!
data adtn.match1 (drop = i);
set adtn.match;
matchcount1 = 1;
sicmatch = sic71;
array items{35} sic1-sic35;
do i = 1 to 35;
if matchcount1 then items{i} = sic71 then call missing(matchcount1, sicmatch);
end;
run;
To get the first match:
data adtn.match1 (drop = i);
set adtn.match;
array items{35} sic1-sic35;
call missing(matchCount, sicMatch);
do i = 1 to 35;
	if not matchCount then if items{i} = sic71 then do;
		matchCount = 1;
		sicMatch = sic71;
		end;
	end;
run;PG,
Sorry for the delayed reply. Ive actually been pouring over this code / dataset trying to figure out how to get things correct.
So in reference to the first piece code that you helped me with. I used the following:
data adtc.ss7 (drop = i);
set adtc.ss6;
count1 = 1;
dm1 = sic1;
array items{35} sic36-sic70;
do i = 1 to 35;
if count1 then if items{i} = sic1 then call missing(count1, dm1);
end;
run;
data adtc.ss7 (drop = i);
set adtc.ss7;
count2 = 1;
dm2 = sic2;
array items{35} sic36-sic70;
do i = 1 to 35;
if count2 then if items{i} = sic2 then call missing(count2, dm2);
end;
run;
And I repeated the code for each SIC I need to compare for sic codes 1-35 (sic1-sic35). However, when Im done and add things
up across count1-35, Im not getting an accurate count. The reason is because the array (e.g., sic36-sic70) may have missing cells (sometimes only some missing, sometimes all) and in that case, any value (sic1-sic35) that is compared is treated as though it is not in the array and is counted. Second, the values Im comparing (sic1-sic35) may have duplicates. For example, sic1 may be 5149, sic2 may be 5149, sic3 may be 5149, etc.
So, I need to modify the code so that it takes into account missing values in the array (sic37-sic70) when comparing and I need to have it only count the unique sic's that didn't match. It seems like I need more complicated code and would need to do this in one step as opposed to the two steps Im currently doing it in. Or maybe, there's a better two step approach. However, Im not sure how to expand upon my code.
I think you are facing endless problems as long as you keep that wide data structure. Here is how your problem would get simpler if you went to a long data structure :
/* Create two datasets. Sics with sics 1-35 and Items with sics 36-70 */
data Sics Items;
set adtc.ss;
array sics{35} sic1-sic35;
array items{35} sic36-sic70;
obs = _n_; /* Unique ID for each observation */
do i = 1 to 35;
	sic = sics{i};
	if not missing(sic) then output Sics;
	end;
do i = 1 to 35;
	sic = items{i};
	if not missing(sic) then output Items;
	end;
drop i sic1-sic70;
run;
/* Extract the sics from Sics that do not appear on the same observation in Items */
proc sql;
create table missSics as
select * from Sics as A
where sic not in (select sic from Items where obs=A.obs);
quit;
/* Count the distinct missing sics in each observation */
proc sql;
create table countSics as
select obs, count(distinct sic) as nMissSics
from missSics
group by obs;
quit; (untested)
PG - That worked great! I really appreciate your help with all of this. 2 more questions -
1. In that last table created by the last line of code you just posted (countsics), how do I also get it to include the distinct SICs as columns for each observations.
For example,
obs nmisssics dm1 dm2 dm3 ........etc.
1 2 4959 4215 4800
2. What if I want to use similar code (similar to what you just sent me) to compare two arrays of numbers. The numbers are also currently arranged in columns (psic1-psic130 and match1-match30) However, this time I want it to count a comparison if there IS a match. For example, if the codes in the columns match1-match30 match ANY of the codes in psic1-psic130, I want it to create a table that counts how many matches there were and also list which codes were matched. Furthermore, there are repeats and missing values within psic1-psic390 and potentially missing values within match1-match30.
Thanks again for your invaluable help and patience, I dont take it lightly.
Question 1) I was hoping to entice you away from that wide data structure, but you want to go right back to it!
Since there is one obs in countSics for every obs in the original dataset and in the same order, just do a one-to-one merge:
data fullSics;
merge adtc.ss countSics;
run;(untested)
Question 2) Should be straitforward once you fully understand how the code works.
Haha, I follow your advice. Generally, I would like to keep wide format, for various reasons, but I dont mind doing certain computations long ways how you shown. Very helpful.
1 - So the merge code you just gave does not work well, since there is not an observation for every observation in the dataset (not every observation in the original dataset had an SIC that didnt match in the other array of SICs). Additionally, I need to have it matched up so that I know WHICH sic codes did not match, like so:
obs nmisssics dm1 dm2 dm3 ........etc.
1 2 4959 4215 4800
I suspect there is a line that could be added to the proc sql statement that created this table, but Im not sure what code to put in there:
/* Count the distinct missing sics in each observation */
proc sql;
create table adtc.countSics as
select obs, count(distinct sic) as nMissSics
from adtc.missSics
group by obs;
quit;
2 - I tried to modify the code you gave me in order to create a table that counts how many matches there were and also list which codes were matched. If you wouldnt mind letting me know this looks:
/* Create two datasets. sics1 with psics 1-130 and portfolio with psics 131-520 */
data adtc.sics1 adtc.portfolio;
set adtc.original;
array sics1{130} psic1-psic130;
array portfolio{390} psic131-psic520;
obs = _n_; /* Unique ID for each observation */
do i = 1 to 130;
sic = sics{i};
if not missing(sic) then output adtc.sics1;
end;
do i = 1 to 390;
sic = portfolio{i};
if not missing(sic) then output adtc.portfolio;
end;
drop i psic1-psic520;
run;
/* Extract the sics from Sics that appear on the same observation in Items */
proc sql;
create table adtc.matchSics as
select * from adtc.Sics1 as A
where sic in (select sic from adtc.portfolio where obs=A.obs);
quit;
/* Count the matched sics in each observation */
proc sql;
create table adtc.countmatchSics as
select obs, count(sic) as nMatches
from adtc.matchsics
group by obs;
quit;
Again, I appreciate your patience!
PG - Finally got things figured out based on your longways code. Big help, this code is changing my life! Anyhow, thanks again.
You don't mention whether your values are numeric or text but a similar approach with a different function will work for what I think you are looking for.
The functions WHICHN and WHICHC look to see which of a list of values matches a single value and returns that position or 0 if not found.
So you may want if the variables are numeric
If WhichN( sic1, of items[*]) = 0 then <whatever you want when SIC1 isn't found in the list>
or if character.
If WhichC( sic1, of items[*]) = 0 then <whatever you want when SIC1 isn't found in the list>
Note the comparison does not have to loop with any index it is using your array as a list.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.
