Forgive me, I am a self-taught user so if I use language that is not the "norm," that's why. I'm using 9.2
Here's my issue. I recently needed to pull a specific set of records from a relational database. I needed phone numbers and addresses for customers who had made a complaint in late 2013 through the end of 2015. I first went into one specific table (complaints) and pulled complaint ID numbers based on the date the complaint was received (October 1, 2013 to December 31, 2015). These complaint numbers are unique identifiers that reside on most, but not all, tables in this database. Because I needed data from other tables residing in the same database, I then used the complaint ID numbers to pull data from these other tables so that I would end up with the same subset of customer complaints for each additional set of data I drew. Because there were more than 1,000 complaint IDs, I ran 999 of these IDs against the table first and created a file (person1) and then ran the remaining 222 IDs against the same table to produce a second file (person2). When I merged person1 and person2, I lost 41 records, i.e., 41 complaint IDs. I can see the IDs in my syntax. They are there but these same 41 complaints IDs and accompanying data do not show up in the resulting dataset.
Here's the syntax. Because including the 1,221 IDs would make this long and cumbersome, I didn't include them. I did reference where they resided within the syntax.
Step 1: This is the syntax to pull the complaint IDs:
data complaintid (keep=complaint_id);
set database.complaint_vw;
if date_reported >= '1oct2013:00:00:00'dt
& date_reported <='31dec2015:00:00:00'dt ;
if complaint_type ne 99;
if inactive ne 'Y';
run;
Step 2: Using the complaint IDs from Step 1 (I exported the data into Excel then copied and pasted the complaint IDs into my syntax), I start pulling records from another table.
data person1 (keep = case_number complaint_id person_id date_reported);
set database.complaint;
where complaint_id = 168
or complaint_id = 174
or complaint_id = 176
or complaint_id = 177 etc. etc., 999 in all ;
run;
data person2 (keep = case_number complaint_id person_id date_reported);
set database.complaint;
where complaint_id = 678
or complaint_id = 889
or complaint_id = 989 etc. etc., 222 in all ;
run;
proc sort data = person1;
by person_id;
run;
proc sort data = person2;
by person_id;
run;
Now, right here is where I lose my 41 records. They simply do not make it through this merge. They are in person1 before the merge.
data personfinal;
merge person1 person2;
by person_id;
run;
Any ideas? Thank you in advance.
I would say that about here is the start of any issue:
Step 2: Using the complaint IDs from Step 1 (I exported the data into Excel then copied and pasted the complaint IDs into my syntax), I start pulling records from another table.
I suspect that a missing id or two is the most likely because of the sheer number of things you had to type.
At this point what you are apparently looking to do is to JOIN tables on a criteria. Typically this is the work of Proc SQL after you identify the base records.
Proc sql;
create table Ids as
select distinct complaint_id
from complaintid;
quit; /*Note QUIT is used with Sql not run*/
will generate a data set with the unique values of complaint id. If you need unique combinations of complaint_id and other variables (such as type of complaint) then
select distinct complaint_id, typecomplaint /* note the comma, Sql uses LOTS of commas*/
Then you would join this information with another table
Proc Sql;
create table somethingcombined as
select b.*
from Ids left join someotherdataset as b on
Ids.complaint_id = b.complaint_id;
quit;
But the question is why did you have to split the ids into two groups?
In the future if you find yourself doing something like this:
where complaint_id = 168
or complaint_id = 174
or complaint_id = 176
you should be aware of the IN operator to look at list of values
Where complaint_id in (168 174 176 ...);
I would say that about here is the start of any issue:
Step 2: Using the complaint IDs from Step 1 (I exported the data into Excel then copied and pasted the complaint IDs into my syntax), I start pulling records from another table.
I suspect that a missing id or two is the most likely because of the sheer number of things you had to type.
At this point what you are apparently looking to do is to JOIN tables on a criteria. Typically this is the work of Proc SQL after you identify the base records.
Proc sql;
create table Ids as
select distinct complaint_id
from complaintid;
quit; /*Note QUIT is used with Sql not run*/
will generate a data set with the unique values of complaint id. If you need unique combinations of complaint_id and other variables (such as type of complaint) then
select distinct complaint_id, typecomplaint /* note the comma, Sql uses LOTS of commas*/
Then you would join this information with another table
Proc Sql;
create table somethingcombined as
select b.*
from Ids left join someotherdataset as b on
Ids.complaint_id = b.complaint_id;
quit;
But the question is why did you have to split the ids into two groups?
In the future if you find yourself doing something like this:
where complaint_id = 168
or complaint_id = 174
or complaint_id = 176
you should be aware of the IN operator to look at list of values
Where complaint_id in (168 174 176 ...);
Thank you! I will defintely teach myself proc SQL and will use the IN operator moving forward.
Despite there being 1,221 IDs, I didn't miss any in my copy/paste. The IDs are all in the syntax, just not in the dataset.
The reasoin I split this into two was because when I tried to use all 1,221 in one pass, SAS gave me an error saying it could only accept 1,000. I figured there was a solution to that but I didn't have time to look into it so went with the MERGE idea (which, obviously, did not work).
Hi Noelle,
I'd like to make two suggestions. Feel free to use them or not:
Using either would help clean up the code. They would also help make it more clear what's going on and why the results aren't how you expect.
I hope this helps!
Thank you! I will definitely use the IN operator moving forward and will learn OUTPUT. I have done stuff like this in the past, just not this many, where I pulled data on a specific ID and those long, long lists are a pain to scroll through. Tips like this help me a lot.
Thanks again!
The problem is that you were merging on person_id. If you have one person with two complaints, they will only have one record in the personfinal dataset. You have 41 (or fewer) people with multiple compliants.
If you want one record per complaint, then use a SET statement rather than a MERGE in the last DATA step.
By the way, this is an ideal application for PROC SQL. You can do the entire combination in one step. There are some good books for learning PROC SQL.
Interestingly, there were many people with multiple complaints. It might be 41 or fewer but I had this happen - one person had six complaints within my designated timeframe but only five of the six complaints showed up after the merge.
In the meantime, I will try using SET but it really sounds like proc SQL is what I need to learn.
Thank you so much!
Here's another reason to learn SQL. Consider all the work you did pulling the data into Excel, formatting the WHERE condition ... compared with:
proc sql;
create table person1 as select case_number, complaint_id, person_id, date_reported from database.complaint
where complaint_id in (select distinct complaint_id from COMPLAINT_ID);
quit;
My SQL isn't great, so it's possible you will need to debug this. But SQL makes it easy to extract from one data set based on extracted data from another data set. (Just to differentiate, COMPLAINT_ID in uppercase is the data set name, while complaint_id in lowercase is the variable name. SAS won't care about upper vs. lower case in this example.)
Good luck.
Thank you! I'm on Amazon right now.
Here is an example showing how some complaint IDs from dataset PERSON1 were lost (more precisely: were overwritten) due to the inappropriate MERGE by person ID (as pointed out by @Doc_Duke):
data p1;
input complaint_id person_id;
cards;
12 1
34 1
45 2
56 2
60 3
;
data p2;
input complaint_id person_id;
cards;
72 1
80 1
84 2
97 3
;
data p_final;
merge p1 p2;
by person_id; /* strongly discouraged! */
run;
proc print data=p_final noobs;
run;The number of distinct complaint IDs drops from 9 to 5. Maybe the following log messages (from the MERGE step) could have warned you:
INFO: The variable complaint_id on data set WORK.P1 will be overwritten by data set WORK.P2. NOTE: MERGE statement has more than one data set with repeats of BY values.
Only "maybe", because the INFO message is printed only if system option MSGLEVEL is set to I (or i), which I recommend:
options msglevel=I;And the NOTE is triggered only by cases where duplicate key values (here: person_id values) occur in both datasets in the same BY group (i.e. with the same person_id) as is the case for person_id 1 in the above example. The other two person IDs demonstrate that this is not the only situation where complaint IDs are overwritten.
Thank you so very much for the explanation of what probably happened. The solutions will help me moving forward but I was still wondering why merge didn't work in the first place so thank you for this.
Also, I'm going to set that message level to I and see what happens.
Thanks again!
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.
