Hi All,
I have data as below and need to find out which CUSTOMER has given a "5" score and then immediately followed with a missing score. can you please suggest how to achive this in Data step.
c | SCORE | COMMENT |
A | 4 | I like the food |
A | 2 | The service is very bad |
A | 3 | The is no parking |
A | 5 | The food is expensive |
B | 2 | The food is cold |
B | 5 | It tastes good |
B | . | |
B | 3 | I like the drink |
C | 4 | The dessert is tasty |
C | 2 | I don't like the services |
C | 3 | It is too expensive |
Thanks in advance for your help!
One way to go.
data have;
infile datalines truncover dlm='|';
input customer :$8. SCORE :best32. COMMENT $40.;
datalines;
A|4|I like the food
A|2|The service is very bad
A|3|The is no parking
A|5|The food is expensive
B|2|The food is cold
B|5|It tastes good
B|.|
B|3|I like the drink
C|4|The dessert is tasty
C|2|I don't like the services
C|3|It is too expensive
;
run;
data want;
set have;
if lag(customer)=customer and lag(score)=5 and missing(score) then output;
run;
One way to go.
data have;
infile datalines truncover dlm='|';
input customer :$8. SCORE :best32. COMMENT $40.;
datalines;
A|4|I like the food
A|2|The service is very bad
A|3|The is no parking
A|5|The food is expensive
B|2|The food is cold
B|5|It tastes good
B|.|
B|3|I like the drink
C|4|The dessert is tasty
C|2|I don't like the services
C|3|It is too expensive
;
run;
data want;
set have;
if lag(customer)=customer and lag(score)=5 and missing(score) then output;
run;
Hi Pat,
Thanks a lot for your help!.
Hi Pat,
Also can you please help how to find out which CUSTOMER has comment on "parking" AND "expensive" on same dataset.
Here you go.
data have;
infile datalines truncover dlm='|';
input customer :$8. SCORE :best32. COMMENT $40.;
datalines;
A|4|I like the food
A|2|The service is very bad
A|5|The is no parking
A|.|The food is expensive
B|3|The food is cold
B|5|It tastes good
B|.|
B|3|I like the drink
C|4|The dessert is tasty
C|2|I don't like the services
C|3|It is too expensive
;
run;
data want;
set have;
by customer notsorted;
_lag_score=lag(score);
retain score_flg expensive_flg parking_flg;
if not first.customer and _lag_score=5 and missing(score) then score_flg=1;
if findw(comment,'expensive',' ','i') then expensive_flg=1;
if findw(comment,'parking',' ','i') then parking_flg=1;
if last.customer then
do;
output;
call missing(score_flg, expensive_flg, parking_flg);
end;
run;
For the future: Try to have a single question per thread. Start a new thread with new or additional questions; reference an existing thread if the question is related to it.
I'm not a big fan of the lag function, but this is a situation where it works well.
data source;
infile cards dsd dlm='09'x; /* Separated by tabs, as in the posted data */
attrib c length=$ 1;
attrib score length=3;
attrib comment length=$ 30;
input c
score
comment;
cards;
A 4 I like the food
A 2 The service is very bad
A 3 The is no parking
A 5 The food is expensive
B 2 The food is cold
B 5 It tastes good
B .
B 3 I like the drink
C 4 The dessert is tasty
C 2 I don't like the services
C 3 It is too expensive
;
run;
data score_plus_null;
set source;
if c = lag(c) /* If it's still the same customer */
& missing(score) /* if the current score is missing */
& lag(score) = 5; /* if the previous score is 5 */
run;
Here's one way of doing it.
data target;
set source;
by c;
if first.c then
count = 0;
if indexw(comment, 'expensive')
| indexw(comment, 'parking') then
count + 1;
if last.c;
if count = 2;
keep c;
run;
Hi Laurie,
It works. thanks a lot for your shift response.
What happens if expensive and parking are in the same comment, or if there are two comments with parking but no expensive?
There's really no way of getting around that with (apparently) free-form text fields: it would require natural language parsing. If the responses however were from a finite unmodifiable list, then it would be quite straightforward but we'd need to know what the finite list contained. I always prefer to cope with codes, rather than text strings, for such a problem. But we don't always have that luxury, of course.
Your first problem: it'd work unmodified, but the first paragraph is important; the second would be a case of adding up the strings individually. Again, refer to first paragraph!
My first rule of data analysis: know thy data…
Requirement
which CUSTOMER has comment on "parking" AND "expensive" on same dataset.
What I've tried to point out to you is, that the code you've posted doesn't meet the requirements as posted by the OP.
You're just counting per record if there is at least one occurrence of a word and though you would return a "requirement met" if there are only two occurrences of word parking. On the other hand you would miss the cases where parking and expensive both exists in a single record (your "count" would be 1 instead of 2).
Yes. I almost completely agree. But this brings up one of the main problem with the puzzles we are set here, and one of the main problems I have in my career (touched on by "Know Thy Data"). Without a solid amount of data to test problems on, it's really easy to make assumptions.
In a perfect world, with lots of data, I would check the cardinality of the text field to check whether it was free-form text or not. If I could confirm that it was the case (very high cardinality), I would tailor my code to fit. With a small set of data, I made an assumption based on my understanding of the spec, and I would argue that I was correct, even if I wasn't right <grin>.
But that doesn't matter. I have (mostly silently) disagreed with some of the answers that people have given in the past, but the question setter has been happy so it is of no import. I think it is less likely that a single slanted interpretation of a specification is going to be accepted if there are extensive source data from which to formulate a solution. I blanch when I'm given test data which has been put together by a business analyst which may only have a hundred or a thousand observations in it. If it's a banking system, give me all the customers, all the accounts and all their transactions. It'll take me a wee bit longer to analyse it, but I will eventually come up with a solution which is more likely to last.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.