Solved: Re: Need SAS data step logic

neelima · Posted 01-25-2017 08:56 PM

Hi All,

I have data as below and need to find out which CUSTOMER has given a "5" score and then immediately followed with a missing score. can you please suggest how to achive this in Data step.

c	SCORE	COMMENT
A	4	I like the food
A	2	The service is very bad
A	3	The is no parking
A	5	The food is expensive
B	2	The food is cold
B	5	It tastes good
B	.
B	3	I like the drink
C	4	The dessert is tasty
C	2	I don't like the services
C	3	It is too expensive

Thanks in advance for your help!

Patrick · Posted 01-25-2017 09:08 PM

One way to go.

data have;
  infile datalines truncover dlm='|';
  input customer :$8.  SCORE :best32.  COMMENT $40.;
  datalines;
A|4|I like the food
A|2|The service is very bad
A|3|The is no parking
A|5|The food is expensive
B|2|The food is cold
B|5|It tastes good
B|.| 
B|3|I like the drink
C|4|The dessert is tasty
C|2|I don't like the services
C|3|It is too expensive
;
run;

data want;
  set have;
  if lag(customer)=customer and lag(score)=5 and missing(score) then output;
run;

View solution in original post

Patrick · Posted 01-25-2017 09:08 PM

One way to go.

data have;
  infile datalines truncover dlm='|';
  input customer :$8.  SCORE :best32.  COMMENT $40.;
  datalines;
A|4|I like the food
A|2|The service is very bad
A|3|The is no parking
A|5|The food is expensive
B|2|The food is cold
B|5|It tastes good
B|.| 
B|3|I like the drink
C|4|The dessert is tasty
C|2|I don't like the services
C|3|It is too expensive
;
run;

data want;
  set have;
  if lag(customer)=customer and lag(score)=5 and missing(score) then output;
run;

neelima · Posted 01-25-2017 09:22 PM

Hi Pat,

Thanks a lot for your help!.

neelima · Posted 01-25-2017 09:24 PM

Hi Pat,

Also can you please help how to find out which CUSTOMER has comment on "parking" AND "expensive" on same dataset.

Patrick · Posted 01-25-2017 09:56 PM

Here you go.

data have;
  infile datalines truncover dlm='|';
  input customer :$8.  SCORE :best32.  COMMENT $40.;
  datalines;
A|4|I like the food
A|2|The service is very bad
A|5|The is no parking
A|.|The food is expensive
B|3|The food is cold
B|5|It tastes good
B|.| 
B|3|I like the drink
C|4|The dessert is tasty
C|2|I don't like the services
C|3|It is too expensive
;
run;

data want;
  set have;
  by customer notsorted;
  _lag_score=lag(score);
  retain score_flg expensive_flg parking_flg;
  if not first.customer and _lag_score=5 and missing(score) then score_flg=1;
  if findw(comment,'expensive',' ','i') then expensive_flg=1;
  if findw(comment,'parking',' ','i') then parking_flg=1;
  if last.customer then
    do;
      output;
      call missing(score_flg, expensive_flg, parking_flg);
    end;
run;

For the future: Try to have a single question per thread. Start a new thread with new or additional questions; reference an existing thread if the question is related to it.

LaurieF · Posted 01-25-2017 09:09 PM

I'm not a big fan of the lag function, but this is a situation where it works well.

data source;
infile cards dsd dlm='09'x;   /* Separated by tabs, as in the posted data */
attrib c length=$ 1;
attrib score length=3;
attrib comment length=$ 30;
input c
      score
      comment;
cards;
A	4	I like the food
A	2	The service is very bad
A	3	The is no parking
A	5	The food is expensive
B	2	The food is cold
B	5	It tastes good
B	.	 
B	3	I like the drink
C	4	The dessert is tasty
C	2	I don't like the services
C	3	It is too expensive
;
run;

data score_plus_null;
set source;
if c = lag(c)                 /* If it's still the same customer */
 & missing(score)             /* if the current score is missing */
 & lag(score) = 5;            /* if the previous score is 5 */
run;

LaurieF · Posted 01-25-2017 09:34 PM

Here's one way of doing it.

data target;
set source;
by c;
if first.c then
   count = 0;
if indexw(comment, 'expensive')
 | indexw(comment, 'parking') then
   count + 1;
if last.c;
if count = 2;
keep c;
run;

neelima · Posted 01-25-2017 09:48 PM

Hi Laurie,

It works. thanks a lot for your shift response.

Patrick · Posted 01-25-2017 09:59 PM

@LaurieF

What happens if expensive and parking are in the same comment, or if there are two comments with parking but no expensive?

LaurieF · Posted 01-25-2017 10:13 PM

There's really no way of getting around that with (apparently) free-form text fields: it would require natural language parsing. If the responses however were from a finite unmodifiable list, then it would be quite straightforward but we'd need to know what the finite list contained. I always prefer to cope with codes, rather than text strings, for such a problem. But we don't always have that luxury, of course.

Your first problem: it'd work unmodified, but the first paragraph is important; the second would be a case of adding up the strings individually. Again, refer to first paragraph!

My first rule of data analysis: know thy data…

Patrick · Posted 01-25-2017 11:48 PM

Requirement

which CUSTOMER has comment on "parking" AND "expensive" on same dataset.

@LaurieF

What I've tried to point out to you is, that the code you've posted doesn't meet the requirements as posted by the OP.

You're just counting per record if there is at least one occurrence of a word and though you would return a "requirement met" if there are only two occurrences of word parking. On the other hand you would miss the cases where parking and expensive both exists in a single record (your "count" would be 1 instead of 2).

LaurieF · Posted 01-26-2017 01:55 AM

Yes. I almost completely agree. But this brings up one of the main problem with the puzzles we are set here, and one of the main problems I have in my career (touched on by "Know Thy Data"). Without a solid amount of data to test problems on, it's really easy to make assumptions.

In a perfect world, with lots of data, I would check the cardinality of the text field to check whether it was free-form text or not. If I could confirm that it was the case (very high cardinality), I would tailor my code to fit. With a small set of data, I made an assumption based on my understanding of the spec, and I would argue that I was correct, even if I wasn't right <grin>.

But that doesn't matter. I have (mostly silently) disagreed with some of the answers that people have given in the past, but the question setter has been happy so it is of no import. I think it is less likely that a single slanted interpretation of a specification is going to be accepted if there are extensive source data from which to formulate a solution. I blanch when I'm given test data which has been put together by a business analyst which may only have a hundred or a thousand observations in it. If it's a banking system, give me all the customers, all the accounts and all their transactions. It'll take me a wee bit longer to analyse it, but I will eventually come up with a solution which is more likely to last.

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away