Solved: Re: Row by row comparison within group

Yegen · Posted 06-07-2017 01:49 AM

I am trying to move completely to SAS (from other statistical softwares) but I am having a difficult time getting used to "row by row" operations. For example, in this particular case, I am trying to compare rows within a given group and keep the rows that have not the identical ID number but are within the same industry and group.

Here is my sample:

DATA work.sample;
	INPUT ID Industry Group $ Control;
	DATALINES; 
	55203610 2542 AB 1
	16124130 8062 AB 1
	16124130 8062 AB 0
	80479512 2062 AB 1
	70321918 2062 AB 1
	17312410 6712 AB 1
	17311710 6712 AB 0
	74623123 2082 AC 1
	24245321 2082 AC 0
;
RUN;

Here is what the output needs to look like:

Data work.want;
	INPUT ID Group $;
	DATALINES; 
	80479512 AB
	70321918 AB
	17311710 AB
	17312410 AB
	74623123 AC
	24245321 AC
;
RUN;

Here is my code:

proc sort data=work.sample;
	by Group Industry ID;
quit;


DATA work.identified;
	set work.sample;
	by Group;
	if lag(ID)^=ID and lag(Industry)=Industry then keep=1; else keep=0;	
RUN;


DATA work.identified;
	set work.identified; 
	if keep=1;
Run;

With that code, I am incorrectly identifying some of the first observations within a given group and the lagged observations that have a subsequent non-identical ID which is within the same group and industry. For instance, with the code above, I have misidentified 70321918, 17311710, and 24245321.

Any suggestions / help will be greatly appreciated.

Shmuel · Posted 06-07-2017 02:30 AM

Next cofe will give you the wanted results:

DATA work.sample;
    format group industry id control;
    INPUT ID Industry Group $ Control;
    DATALINES; 
    55203610 2542 AB 1
    16124130 8062 AB 1
    16124130 8062 AB 0
    80479512 2062 AB 1
    70321918 2062 AB 1
    17312410 6712 AB 1
    17311710 6712 AB 0
    74623123 2082 AC 1
    24245321 2082 AC 0
;
RUN;
proc sort data=work.sample;
    by Group Industry ID;
quit;

DATA work.identified;
    set work.sample;
    by Group industry ID;
    retain lagID;
    if not (first.industry and last.industry);
    if first.id and last.id;
    if ID ^= lagID then output; 
    lagID=ID;
RUN;

View solution in original post

Shmuel · Posted 06-07-2017 02:30 AM

Next cofe will give you the wanted results:

DATA work.sample;
    format group industry id control;
    INPUT ID Industry Group $ Control;
    DATALINES; 
    55203610 2542 AB 1
    16124130 8062 AB 1
    16124130 8062 AB 0
    80479512 2062 AB 1
    70321918 2062 AB 1
    17312410 6712 AB 1
    17311710 6712 AB 0
    74623123 2082 AC 1
    24245321 2082 AC 0
;
RUN;
proc sort data=work.sample;
    by Group Industry ID;
quit;

DATA work.identified;
    set work.sample;
    by Group industry ID;
    retain lagID;
    if not (first.industry and last.industry);
    if first.id and last.id;
    if ID ^= lagID then output; 
    lagID=ID;
RUN;

Yegen · Posted 06-07-2017 02:58 AM

Thanks @Shmuel, this is a great way of identifying the observations! Although the code does exactly what I was looking for, I am not quiet sure why you had to create a new column of lagged variables that are not within the same group or industry. Could you please elaborate on that line?

Shmuel · Posted 06-07-2017 03:12 AM

@Yegen wrote:

Thanks @Shmuel, this is a great way of identifying the observations! Although the code does exactly what I was looking for, I am not quiet sure why you had to create a new column of lagged variables that are not within the same group or industry. Could you please elaborate on that line?

LAG function gives sometimes false result when it is part of IF statement,

therefore I retain the value into lagID for next itteration.

You can drop this variable. I left it just for control.

Yegen · Posted 06-07-2017 04:04 AM

Exactly, that's where I was facing an issue. Thanks for this clarification @Shmuel, I have just added a few comments and a question to the code below:

DATA work.identified;
    set work.sample;
    by Group industry ID;
    retain lagID;
    if not (first.industry and last.industry); *Excludes unique observations within industry and group;
    if first.id and last.id; *Question: Not sure what this line exactly does. Does it state that if it's not a unique observation, then keep it? Then in the last two lines, the code checks whether or not the lagged observation is identical to the one considered?;
    if ID ^= lagID then output;  *With the line below, this shifts the lagged value to the LagID column;
    lagID=ID;
RUN;

Shmuel · Posted 06-07-2017 04:26 AM

Relating to : if first.id and last id; - thus eliminating choosing from lines like:

    16124130 8062 AB 1
    16124130 8062 AB 0

where same id exists in same group and same industry in two rows.

Without above IF selection, you will get the first row kept as ID differs from previous row ID (lagID).

Comment that IF staement and check results.

Yegen · Posted 06-07-2017 04:31 AM

I see. This makes A LOT of sense. So basically you are requiring to remove the observations (which are within the same group and industry) whenever they have the same ID (i.e., exactly as in the exampe you have given). Now I understand the code well! Thanks for all of your super helpful clarifications @Shmuel.

Jagadishkatam · Posted 06-07-2017 02:33 AM

proc sql;
create table want as select id,group from sample where industry in (select distinct industry from sample group by industry having count(Industry)>1) group by industry,id having count( Industry)<=1 order by group;
quit;

Thanks,
Jag

Yegen · Posted 06-07-2017 03:05 AM

@Jagadishkatam, when I run your code on the following sample, the output does not look correct. In particular, the observation with the ID 17311711 should have not been selected. Do you know where the mistake is happening? I would also like to know how to write this code using PROC SQL. Thanks.

DATA work.sample;
    format group industry id control;
    INPUT ID Industry Group $ Control;
    DATALINES; 
    55203610 2542 AB 1
    16124130 8062 AB 1
    16124130 8062 AB 0
    80479512 2062 AB 1
    70321918 2062 AB 1 
    17312410 6712 AB 1
    17311710 6712 AB 0
    17311711 2082 AB 0
    74623123 2082 AC 1
    24245321 2082 AC 0
;
RUN;


proc sql;
  create table want as 
  select id,group 
  from sample 
  where industry in (select distinct industry from sample group by industry having count(Industry                     )>1) 
  group by industry,id 
  having count( Industry)<=1 
  order by group;
quit;

Jagadishkatam · Posted 06-07-2017 04:55 AM

Thank you @Yegen for your response. I understood where i am going wrong. Please try this code

proc sql;
create table want as select id,group from sample where industry in 
(select Industry from (select count(Industry) as c, industry,id from sample group by id,industry having count(Industry)<=1) group by Industry having sum(c) >1) 
order by group;
quit;

Thanks,
Jag

Yegen · Posted 06-07-2017 06:36 AM

@Jagadishkatam, thanks. I have just tried your code with the following data but the output still does not look correct. In particular, observations with the ID 22222222 and 22222223 are selected, whereas they should have not been selected.

DATA work.sample;
    format group industry id control;
    INPUT ID Industry Group $ Control;
    DATALINES; 
    11111111 2542 AB 1
    11111112 2542 AB 1
    22222222 8062 AB 1
    22222223 8062 AC 0
;
RUN;

Ksharp · Posted 06-07-2017 09:07 AM

DATA work.sample;
	INPUT ID Industry Group $ Control;
	DATALINES; 
	55203610 2542 AB 1
	16124130 8062 AB 1
	16124130 8062 AB 0
	80479512 2062 AB 1
	70321918 2062 AB 1
	17312410 6712 AB 1
	17311710 6712 AB 0
	74623123 2082 AC 1
	24245321 2082 AC 0
;
RUN;
proc sql;
select *
 from sample
  group by group,industry
   having count(distinct id) ne 1;
quit;

Yegen · Posted 06-07-2017 04:30 PM

That is such a novel SQL code that identifies the correct observations! The distinct command is indeed very handy in this case. Thanks, @Ksharp.

Classroom Training Available!