Hello, I am a SAS novice and learning SAS on my own. Here is what I'm trying to do:
I have a data file, book.txt (attached), that records customer purchases at Amazon and Barnes&Nobel, along with some customer demographic variables. I want to write a program that reads the data in books.txt and generates a count dataset, that is, for each customer, counts the number of books purchased from Barnes&Noble. I want to keep the demographic variables, and then print the first 10 records of this new dataset.
Afterwards, I want to ignore the demographic variables, so I'm only left with the top 10 customers that bought the most book, and the number of books they purchased from Barnes&Nobel.
For reference, I'm using the University Edition on a Mac at the time. However, I do have the following code that was written for Windows (don't know which SAS edition) that I've been trying to modify to use in the University Edition, but I keep getting errors.
--------------------
libname ABC 'C:\ABC';
proc import datafile="C:\ABC\books.txt" out=mydata1 dbms=tab replace;
getnames=yes;
run;
proc import datafile="C:\ABC\books.txt" out=mydata1 dbms=dlm replace;
delimiter='09'x;
getnames=yes;
run;
proc sort data=Mydata;
by domain;
run;
PROC SQL;
CREATE TABLE BNN AS
SELECT t1.userid,
t1.education,
t1.region,
t1.hhsz,
t1.age,
t1.income,
t1.child,
t1.race,
t1.country,
t1.domain,
t1.date,
t1.product,
t1.qty,
t1.price
FROM ABC.mydata t1
WHERE t1.domain CONTAINS 'barnes';
QUIT;
proc sort data=abc.barnes;
by userid;
run;
data count;
set abc.barnes;
by userid;
cnt+1;
if first.userid then cnt=1;
run;
proc print data=count (obs=10);
title 'Top 10 Observations of Customers with Most Purchases from B&N';
run;
proc sort data=Mydata;
by domain;
run;
PROC SQL;
SELECT Domain, COUNT(userid)
FROM Mydata
GROUP BY domain
ORDER BY domain;
QUIT;
Thanks for your quick reply!
I manged to fix things so I don't get an error, however, the code is not doing what I want it to yet. All I get is the complete table, with the original number of rows. I am not able to find the customers that purchase the most books from B&N (the top 10.) It's as if none of the code after the first invoking of proc sql (where I create the table) happens. I don't even get an error. I just don't get the final result I want. My code was modified to this:
---------------
libname mydata '/folders/myfolders/';
proc import datafile="/folders/myfolders/books.txt" out=mydata1 dbms=tab replace;
getnames=yes;
run;
proc import datafile="/folders/myfolders/books.txt" out=mydata1 dbms=dlm replace;
delimiter='09'x;
getnames=yes;
run;
proc sort data=Mydata;
by domain;
run;
PROC SQL;
CREATE TABLE BNN AS
SELECT t1.userid,
t1.education,
t1.region,
t1.hhsz,
t1.age,
t1.income,
t1.child,
t1.race,
t1.country,
t1.domain,
t1.date,
t1.product,
t1.qty,
t1.price
FROM mydata t1
WHERE t1.domain CONTAINS 'barnes';
QUIT;
/* nothing works after this */;
proc sort data=mydata.BNN; /*I tried putting BNN, the name of the table, here*/;
by userid;
run;
data count;
set mydata.BNN; /*I put BNN for barnes*/;
by userid;
cnt+1;
if first.userid then cnt=1;
run;
proc print data=count (obs=10);
title 'First 10 Observations of customer count the number of books purchased from B&N in
2007';
run;
proc sort data=mydata.BNN; /*added .BNN*/;
by domain;
run;
PROC SQL;
SELECT Domain, COUNT(userid)
FROM mydata.BNN /*added .BNN*/;
GROUP BY domain
ORDER BY domain;
QUIT;
You should post your log so we can see the errors. You try and import the same file twice, you only need one proc import the one that works.
You our call your data mydata1 so the rest of your code should use that name to start off with - yourSQL query doesn't.
First fix fix pall your references so they point to th correct table names. Don't try and use a libname until you have it all working, so your code should have any LIBNAME.TABLE_NAME at this point. If yo still have errors post your log.
Thanks for the tips. So far, I've managed to import my text file into a table, and remove all unwanted rows (i.e., with domain "amazon.com".) Now, I need to add up the number of books (variable "qty") for each customer. After that, I need to sort the results so I could find out who bought the most books. I want the top 10 customers based on their purchase amount. Any help will be appreciated. My current code is below. The raw data file is attached to my first message in this thread. Thank you.
--------------------------------------------
libname mydata '/folders/myfolders/';
proc import datafile="/folders/myfolders/books.txt" out=books;
getnames=yes;
run;
proc sort data=books;
by userid;
run;
PROC SQL;
CREATE TABLE BNN AS
SELECT t1.userid,
t1.education,
t1.region,
t1.hhsz,
t1.age,
t1.income,
t1.child,
t1.race,
t1.country,
t1.domain,
t1.date,
t1.product,
t1.qty,
t1.price
FROM books t1
WHERE t1.domain CONTAINS 'barnes'; /* Only keeping purchase records for Barnes and Nobel */;
QUIT;
I would recommend looking into proc means with a class or BY statement. Use the OUTPUT statement to save the results to a table.
As a self-professed novice, there are several issues here that you will need to address eventually. I'm going to suggest a few places to start. Don't expect that they will get you all the way to a solution.
First, on the syntax, you will need to know how to save a SAS data set permanently. There are related topics, like the LIBNAME statement, and how to retrieve a permanently saved SAS data set. But you will need to know why the program sometimes makes SAS data sets temporary and sometimes makes them permanent.
Second, regarding the planning ... the structure for your data is suspect in a few ways. You need to thoroughly clean your data before starting to analyze it. The clue that this is necessary is your WHERE condition in the initial PROC SQL. What is actually in the data that represents Barnes & Noble? Does it always contain a lower-case "b"? Are there variations in spelling (either the spacing or the capitalization)? Thas has to be fixed before any type of analysis begins. Also ... demographics rarely change from purchase to purchase. Keeping the demographic variables on each purchase expands the size of your data, slows down processing, and opens the possibility for unintended differences from one purchase to the next. A better approach would maintain two data sets. The fields that never change for a userid would be kept in one data set, and the fields that change purchase by purchase would be kept in a second data set. As needed, the data sets could be combined, matching on userid. (Combining data sets might represent another topic to be learned.)
As I mentioned, more will be needed to obtain the analysis results you requested. But (my opinion only) this would be the right place to start.
Good luck.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.