BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Yegen
Pyrite | Level 9

I am trying to move completely to SAS (from other statistical softwares) but I am having a difficult time getting used to "row by row" operations. For example, in this particular case, I am trying to compare rows within a given group and keep the rows that have not the identical ID number but are within the same industry and group. 

Here is my sample:

DATA work.sample;
	INPUT ID Industry Group $ Control;
	DATALINES; 
	55203610 2542 AB 1
	16124130 8062 AB 1
	16124130 8062 AB 0
	80479512 2062 AB 1
	70321918 2062 AB 1
	17312410 6712 AB 1
	17311710 6712 AB 0
	74623123 2082 AC 1
	24245321 2082 AC 0
;
RUN;

Here is what the output needs to look like:

 

 

Data work.want;
	INPUT ID Group $;
	DATALINES; 
	80479512 AB
	70321918 AB
	17311710 AB
	17312410 AB
	74623123 AC
	24245321 AC
;
RUN;

 

 

Here is my code:

 

proc sort data=work.sample;
	by Group Industry ID;
quit;


DATA work.identified;
	set work.sample;
	by Group;
	if lag(ID)^=ID and lag(Industry)=Industry then keep=1; else keep=0;	
RUN;


DATA work.identified;
	set work.identified; 
	if keep=1;
Run;

 

With that code, I am incorrectly identifying some of the first observations within a given group and the lagged observations that have a subsequent non-identical ID which is within the same group and industry. For instance, with the code above, I have misidentified 70321918, 17311710, and 24245321.

Any suggestions / help will be greatly appreciated. 

 

1 ACCEPTED SOLUTION

Accepted Solutions
Shmuel
Garnet | Level 18

Next cofe will give you the wanted results:

 

DATA work.sample;
    format group industry id control;
    INPUT ID Industry Group $ Control;
    DATALINES; 
    55203610 2542 AB 1
    16124130 8062 AB 1
    16124130 8062 AB 0
    80479512 2062 AB 1
    70321918 2062 AB 1
    17312410 6712 AB 1
    17311710 6712 AB 0
    74623123 2082 AC 1
    24245321 2082 AC 0
;
RUN;
proc sort data=work.sample;
    by Group Industry ID;
quit;

DATA work.identified;
    set work.sample;
    by Group industry ID;
    retain lagID;
    if not (first.industry and last.industry);
    if first.id and last.id;
    if ID ^= lagID then output; 
    lagID=ID;
RUN; 

View solution in original post

12 REPLIES 12
Shmuel
Garnet | Level 18

Next cofe will give you the wanted results:

 

DATA work.sample;
    format group industry id control;
    INPUT ID Industry Group $ Control;
    DATALINES; 
    55203610 2542 AB 1
    16124130 8062 AB 1
    16124130 8062 AB 0
    80479512 2062 AB 1
    70321918 2062 AB 1
    17312410 6712 AB 1
    17311710 6712 AB 0
    74623123 2082 AC 1
    24245321 2082 AC 0
;
RUN;
proc sort data=work.sample;
    by Group Industry ID;
quit;

DATA work.identified;
    set work.sample;
    by Group industry ID;
    retain lagID;
    if not (first.industry and last.industry);
    if first.id and last.id;
    if ID ^= lagID then output; 
    lagID=ID;
RUN; 
Yegen
Pyrite | Level 9

Thanks @Shmuel, this is a great way of identifying the observations! Although the code does exactly what I was looking for, I am not quiet sure why you had to create a new column of lagged variables that are not within the same group or industry. Could you please elaborate on that line? 

Shmuel
Garnet | Level 18

@Yegen wrote:

Thanks @Shmuel, this is a great way of identifying the observations! Although the code does exactly what I was looking for, I am not quiet sure why you had to create a new column of lagged variables that are not within the same group or industry. Could you please elaborate on that line? 


LAG function gives sometimes false result when it is part of IF statement,

therefore I retain the value into lagID for next itteration.

You can drop this variable. I left it just for control.

Yegen
Pyrite | Level 9

Exactly, that's where I was facing an issue. Thanks for this clarification @Shmuel, I have just added a few comments and a question to the code below:

DATA work.identified;
    set work.sample;
    by Group industry ID;
    retain lagID;
    if not (first.industry and last.industry); *Excludes unique observations within industry and group;
    if first.id and last.id; *Question: Not sure what this line exactly does. Does it state that if it's not a unique observation, then keep it? Then in the last two lines, the code checks whether or not the lagged observation is identical to the one considered?;
    if ID ^= lagID then output;  *With the line below, this shifts the lagged value to the LagID column;
    lagID=ID;
RUN;

  

Shmuel
Garnet | Level 18

Relating to :  if first.id and last id;   -  thus eliminating choosing from lines like:

    16124130 8062 AB 1
    16124130 8062 AB 0

where same id exists in same group and same industry in two rows.

Without above IF selection, you will get the first row kept as ID differs from previous row ID (lagID).

 

Comment that IF staement and check results.

 

Yegen
Pyrite | Level 9

I see. This makes A LOT of sense. So basically you are requiring to remove the observations (which are within the same group and industry) whenever they have the same ID (i.e., exactly as in the exampe you have given). Now I understand the code well! Thanks for all of your super helpful clarifications @Shmuel

Jagadishkatam
Amethyst | Level 16
proc sql;
create table want as select id,group from sample where industry in (select distinct industry from sample group by industry having count(Industry)>1) group by industry,id having count( Industry)<=1 order by group;
quit;
Thanks,
Jag
Yegen
Pyrite | Level 9

@Jagadishkatam, when I run your code on the following sample, the output does not look correct. In particular, the observation with the ID 17311711 should have not been selected. Do you know where the mistake is happening? I would also like to know how to write this code using PROC SQL. Thanks. 

 

 

 

DATA work.sample;
    format group industry id control;
    INPUT ID Industry Group $ Control;
    DATALINES; 
    55203610 2542 AB 1
    16124130 8062 AB 1
    16124130 8062 AB 0
    80479512 2062 AB 1
    70321918 2062 AB 1 
    17312410 6712 AB 1
    17311710 6712 AB 0
    17311711 2082 AB 0
    74623123 2082 AC 1
    24245321 2082 AC 0
;
RUN;


proc sql;
  create table want as 
  select id,group 
  from sample 
  where industry in (select distinct industry from sample group by industry having count(Industry                     )>1) 
  group by industry,id 
  having count( Industry)<=1 
  order by group;
quit;

 

Jagadishkatam
Amethyst | Level 16

Thank you @Yegen for your response. I understood where i am going wrong. Please try this code

 

proc sql;
create table want as select id,group from sample where industry in 
(select Industry from (select count(Industry) as c, industry,id from sample group by id,industry having count(Industry)<=1) group by Industry having sum(c) >1) 
order by group;
quit;
Thanks,
Jag
Yegen
Pyrite | Level 9

@Jagadishkatam, thanks. I have just tried your code with the following data but the output still does not look correct. In particular, observations with the ID 22222222 and 22222223 are selected, whereas they should have not been selected. 

DATA work.sample;
    format group industry id control;
    INPUT ID Industry Group $ Control;
    DATALINES; 
    11111111 2542 AB 1
    11111112 2542 AB 1
    22222222 8062 AB 1
    22222223 8062 AC 0
;
RUN;
Ksharp
Super User
DATA work.sample;
	INPUT ID Industry Group $ Control;
	DATALINES; 
	55203610 2542 AB 1
	16124130 8062 AB 1
	16124130 8062 AB 0
	80479512 2062 AB 1
	70321918 2062 AB 1
	17312410 6712 AB 1
	17311710 6712 AB 0
	74623123 2082 AC 1
	24245321 2082 AC 0
;
RUN;
proc sql;
select *
 from sample
  group by group,industry
   having count(distinct id) ne 1;
quit;
Yegen
Pyrite | Level 9

That is such a novel SQL code that identifies the correct observations! The distinct command is indeed very handy in this case. Thanks, @Ksharp

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 12 replies
  • 31409 views
  • 6 likes
  • 4 in conversation