Hello friends!
First of all, I'm a newbie in the SAS world and I'm starting my journey recently, so sorry if I don't explain very well what I need, an thanks for all support.
I have a table with various customer purchase data, divided by month. What I need is to see if the customer_Id of the cost_set20 table appears in the subsequent tables (nov20, dec20, jan21, [...] aug21)
And then whenever it appears I need to mark in another table the client_id, the last month it appeared, and the number of times it appeared in the query that was made.
The truth is that I have no idea how to do this, I understand the logical sequence, but I wouldn't know how to assemble a query that would do this for me.I have a test coming soon for a job in the area and I wanted to get there by understanding how to do this. Thank you all
It would be helpful if you can show us a small portion of one of these data sets, and the exact data set names of ALL of the relevant data sets. Also, I think the answer will depend on how many customers you have (approximately), the answer for millions of customers might be different than if you are talking about 25 customers.
Come on, I'll try to gather the data here...
For starters, my initial base (customers who bought in September 2021) has approximately 555,000 lines and the set of customers between October 2021 and August 2022 has 8 million lines divided between these months.
First we have my acrdbase_mpr_exp base table with the September purchases. it has some fields that we don't have in the others, but the customer identification field (customer_Id) we will have in all of them. Only the date part needs attention as it is in DDMMYYYY format.
The customer tables in the following months are:
WORK.ACRD_OCT21 WORK.ACRD_NOV21 WORK.ACRD_DEC21 WORK.ACRD_JAN22 WORK.ACRD_FEB22 WORK.ACRD_MAR22 WORK.ACRD_APR22 WORK.ACRD_MAY22 WORK.ACRD_JUN22 WORK.ACRD_JUL22 WORK.ACRD_AUG22
And they look like
I believe you just need to cross the customer Id of the September table with the same data in the others to identify if the same customers returned, but I would still need to check how many times this customer appears
Do you really need to add variables?
If what you really want to know how many times a customer appears in each month you don't need to do any comparisons or complex look up.
Might take a little while depending on the size of the data sets. The statement Set Work.accrd_: ; will use all the data sets that start with the "Accrd_" as a list. If you have other similar names that you do not want then you will need to list the sets by name.
The proc summary will group records by client_id and the month value of dt_checkout and the output data set will have a variable _freq_ with the count.
data combined; set work.accrd_: ; run; proc summary data=combined nway; class client_id dt_checkout; format dt_checkout yymon.; output out=want (drop=_type_); run;
In the long run you may find that naming your data sets with a year then month number, such as Accrd_202204 for Apr 22 is preferred in the long run. For one thing the numbers will sort properly. Second you can use lists such as Set Accrd_202202- Accrd202210 ; to get the February through October data, or Accrd_2022: to get all the sets for just the year 2022.
Caution: combining data is this manner will fail if you have like named variables with different data types and may generate notes about different lengths encountered if the same variable in different sets have different lengths.
To put into my own words what @ballardw is proposing, one long data set (which is obtained by set work.accrd_: ; ) is a much better choice to do this analysis — and to do most analyses — than trying to do lookups on many tables in some sort of loop.
Or to put this into other people's words, see Maxim 19.
There are multiple ways to do this;
1. For each table summarize per month how much each ID was used, basically calculate the summary statistic for each month. Name each column the month plus some identifier for the stat.
2. Left join these tables together using the First table as the LEFT table to keep only ID's from that table. The Summary stats become columns months.
3. Use an Array and a a data step to identify the last month.
Pros: Step by step easy to do each step
Cons: repetitive as you're doing the same thing multiple times for different tables. You could reduce this with macros but being a beginner you should start with manual code first.
A more SAS method:
1. Create a view that appends all tables together
2. Summarize the view into the summary stats per month
3. Filter for only ID's in the first table
4. Calculate the stats needed (last month) and number of query's
Pros: Code is simpler to run
Cons: With larger data sets this is harder to test if you're not already familiar with SAS coding.
FYI - if you present data as images we'd have to type that out to type out a solution. If you cannot present actual data in a data step, including fake data as a table is preferable. Generic questions without data will get generic answers without code typically.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.