I am trying to move completely to SAS (from other statistical softwares) but I am having a difficult time getting used to "row by row" operations. For example, in this particular case, I am trying to compare rows within a given group and keep the rows that have not the identical ID number but are within the same industry and group.
Here is my sample:
DATA work.sample;
INPUT ID Industry Group $ Control;
DATALINES;
55203610 2542 AB 1
16124130 8062 AB 1
16124130 8062 AB 0
80479512 2062 AB 1
70321918 2062 AB 1
17312410 6712 AB 1
17311710 6712 AB 0
74623123 2082 AC 1
24245321 2082 AC 0
;
RUN;
Here is what the output needs to look like:
Data work.want;
INPUT ID Group $;
DATALINES;
80479512 AB
70321918 AB
17311710 AB
17312410 AB
74623123 AC
24245321 AC
;
RUN;
Here is my code:
proc sort data=work.sample;
by Group Industry ID;
quit;
DATA work.identified;
set work.sample;
by Group;
if lag(ID)^=ID and lag(Industry)=Industry then keep=1; else keep=0;
RUN;
DATA work.identified;
set work.identified;
if keep=1;
Run;
With that code, I am incorrectly identifying some of the first observations within a given group and the lagged observations that have a subsequent non-identical ID which is within the same group and industry. For instance, with the code above, I have misidentified 70321918, 17311710, and 24245321.
Any suggestions / help will be greatly appreciated.
Next cofe will give you the wanted results:
DATA work.sample;
format group industry id control;
INPUT ID Industry Group $ Control;
DATALINES;
55203610 2542 AB 1
16124130 8062 AB 1
16124130 8062 AB 0
80479512 2062 AB 1
70321918 2062 AB 1
17312410 6712 AB 1
17311710 6712 AB 0
74623123 2082 AC 1
24245321 2082 AC 0
;
RUN;
proc sort data=work.sample;
by Group Industry ID;
quit;
DATA work.identified;
set work.sample;
by Group industry ID;
retain lagID;
if not (first.industry and last.industry);
if first.id and last.id;
if ID ^= lagID then output;
lagID=ID;
RUN;
Next cofe will give you the wanted results:
DATA work.sample;
format group industry id control;
INPUT ID Industry Group $ Control;
DATALINES;
55203610 2542 AB 1
16124130 8062 AB 1
16124130 8062 AB 0
80479512 2062 AB 1
70321918 2062 AB 1
17312410 6712 AB 1
17311710 6712 AB 0
74623123 2082 AC 1
24245321 2082 AC 0
;
RUN;
proc sort data=work.sample;
by Group Industry ID;
quit;
DATA work.identified;
set work.sample;
by Group industry ID;
retain lagID;
if not (first.industry and last.industry);
if first.id and last.id;
if ID ^= lagID then output;
lagID=ID;
RUN;
Thanks @Shmuel, this is a great way of identifying the observations! Although the code does exactly what I was looking for, I am not quiet sure why you had to create a new column of lagged variables that are not within the same group or industry. Could you please elaborate on that line?
@Yegen wrote:
Thanks @Shmuel, this is a great way of identifying the observations! Although the code does exactly what I was looking for, I am not quiet sure why you had to create a new column of lagged variables that are not within the same group or industry. Could you please elaborate on that line?
LAG function gives sometimes false result when it is part of IF statement,
therefore I retain the value into lagID for next itteration.
You can drop this variable. I left it just for control.
Exactly, that's where I was facing an issue. Thanks for this clarification @Shmuel, I have just added a few comments and a question to the code below:
DATA work.identified;
set work.sample;
by Group industry ID;
retain lagID;
if not (first.industry and last.industry); *Excludes unique observations within industry and group;
if first.id and last.id; *Question: Not sure what this line exactly does. Does it state that if it's not a unique observation, then keep it? Then in the last two lines, the code checks whether or not the lagged observation is identical to the one considered?;
if ID ^= lagID then output; *With the line below, this shifts the lagged value to the LagID column;
lagID=ID;
RUN;
Relating to : if first.id and last id; - thus eliminating choosing from lines like:
16124130 8062 AB 1 16124130 8062 AB 0
where same id exists in same group and same industry in two rows.
Without above IF selection, you will get the first row kept as ID differs from previous row ID (lagID).
Comment that IF staement and check results.
I see. This makes A LOT of sense. So basically you are requiring to remove the observations (which are within the same group and industry) whenever they have the same ID (i.e., exactly as in the exampe you have given). Now I understand the code well! Thanks for all of your super helpful clarifications @Shmuel.
proc sql;
create table want as select id,group from sample where industry in (select distinct industry from sample group by industry having count(Industry)>1) group by industry,id having count( Industry)<=1 order by group;
quit;
@Jagadishkatam, when I run your code on the following sample, the output does not look correct. In particular, the observation with the ID 17311711 should have not been selected. Do you know where the mistake is happening? I would also like to know how to write this code using PROC SQL. Thanks.
DATA work.sample;
format group industry id control;
INPUT ID Industry Group $ Control;
DATALINES;
55203610 2542 AB 1
16124130 8062 AB 1
16124130 8062 AB 0
80479512 2062 AB 1
70321918 2062 AB 1
17312410 6712 AB 1
17311710 6712 AB 0
17311711 2082 AB 0
74623123 2082 AC 1
24245321 2082 AC 0
;
RUN;
proc sql;
create table want as
select id,group
from sample
where industry in (select distinct industry from sample group by industry having count(Industry )>1)
group by industry,id
having count( Industry)<=1
order by group;
quit;
Thank you @Yegen for your response. I understood where i am going wrong. Please try this code
proc sql;
create table want as select id,group from sample where industry in
(select Industry from (select count(Industry) as c, industry,id from sample group by id,industry having count(Industry)<=1) group by Industry having sum(c) >1)
order by group;
quit;
@Jagadishkatam, thanks. I have just tried your code with the following data but the output still does not look correct. In particular, observations with the ID 22222222 and 22222223 are selected, whereas they should have not been selected.
DATA work.sample;
format group industry id control;
INPUT ID Industry Group $ Control;
DATALINES;
11111111 2542 AB 1
11111112 2542 AB 1
22222222 8062 AB 1
22222223 8062 AC 0
;
RUN;
DATA work.sample;
INPUT ID Industry Group $ Control;
DATALINES;
55203610 2542 AB 1
16124130 8062 AB 1
16124130 8062 AB 0
80479512 2062 AB 1
70321918 2062 AB 1
17312410 6712 AB 1
17311710 6712 AB 0
74623123 2082 AC 1
24245321 2082 AC 0
;
RUN;
proc sql;
select *
from sample
group by group,industry
having count(distinct id) ne 1;
quit;
That is such a novel SQL code that identifies the correct observations! The distinct command is indeed very handy in this case. Thanks, @Ksharp.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.