About takpdpb7

ballardw · ‎11-12-2020

If a BY variable (or group) has more than one record in one data set then the matches from the other set will get duplicated. If both data sets have more than on record with the same by variable values the result can be somewhat unpredictable and generally not desired. If this is the case then you may need to provide example data and what you expect for the result as the techniques are likely to require more than a simple merge.

ballardw · ‎08-06-2020

Any procedure or data step code that uses a BY statement will require all of the data involved to be sorted BY the varaiables. ALL the sets. So sort WORK.ALLBIRTHS. Multiple lengths of character variables are common from different data sources. Easiest fix is in any data step giving you that message is to place a LENGTH statement before any SET, MERGE, UPDATE or MODIFY statement using the sets to specify the length to use. The value of the length should be at least as long as the longest defined length for the variables used. Proc contents can tell you that if you don't know how to check. For future reference, when asking questions about error or warning messages, copy from the LOG the entire procedure or data step code that generates the message and all notes, warnings or errors. Then on the forum paste the code and messages into a code box opened with the </> icon to preserve formatting. This is important as many errors come with diagnostic characters that tell where the issue occurred but the main windows on this form will reformat text and the diagnostics aren't going to appear as they should. Plus the code boxes set things apart.

mklangley · ‎07-27-2020

data have; sas_date = '01JAN2020'd; dob = sas_date; dob2 = sas_date; format dob date9. dob2 mmddyys10.; run;

Reeza · ‎06-08-2020

Ideally you would connect SAS directly do your DB so you avoid any data type issues when importing/exporting the data.

ballardw · ‎05-20-2020

The first problem I see is in this block of code: *Match based on First Name, Last name and DOB; data Names; merge BAB2 allbirths; by MOTHER_GNAME MOTHER_LNAME MDOB; if CaseID="" then delete; if SFN_NUM=" " then delete; run; proc sort data=BAB2; by MOTHER_GNAME MOTHER_LNAME MDOB;run; proc sort data=allbirths; by MOTHER_GNAME MOTHER_LNAME MDOB; run; *Match based on First AND Last name; data Names; merge BAB2 allbirths; by MOTHER_GNAME MOTHER_LNAME MDOB; if CaseID="" then delete; if SFN_NUM=" " then delete; run; The second date step overwrites the first version of Names which does not show anything that preserves or uses it in between. So you do not have a match on all three fields. I have done similar processes and as soon as you identify a match then the matched records need to be removed from BOTH data sets to avoid re-matching already matched persons. Since all of your merges involve the exact same data sets then when Mary Jones 1/1/1990 matches in the BAB2 and Allbirths sets, BOTH of those are available. So they BOTH match Mary Smith 1/1/1990 in your First name and Dob match. And then you stack the result sets. Multiplying the matches. Here is an example of removing the matched values and creating reduced sets in a merge. *Match based on First Name, Last name and DOB; data matchNames1 bab2only allbirthonly ; merge BAB2 (in=inbabs) allbirths (in=inall); by MOTHER_GNAME MOTHER_LNAME MDOB; if inbabs and inall then output MatchNames1; else if inbabs then output bab2only; else output allbirthonly; run; The IN= dataset option creates a temporary variable that is 1 (true) when the current record from that set contributes to the merge and 0 otherwise. So when all of the IN= variables are true then all of the data sources contributed and you have a full match. You can create multiple data sets in a single pass. The explicit OUTPUT tells when to write to which set name. You likely need to have a Keep or Drop statement associated with one or more of the Babs2only and Allbirthonly so their structure stays the same (only the variables they to begin with). I would suggest a similar step at each Merge creating a different Match, Babsonly and Allbirthsonly at every step. Something else to consider is how Proc Import works with Excel files. It will only examine 20 rows of data by default before assigning variable type (character or numeric) and Length. Which can lead to truncation of names if the first 20 last names are all less than 20 characters and then you have a long name like Cartwright-Chickering (21 characters) so the "g" at the end would get truncated. It would be better to open the file in your spreadsheet and the save as CSV to import. Then use the option GUESSINGROWS=MAX; So the entire file is read before the length of variables is set. With your current example where you are setting names to 13 characters, I am also wondering if that length is appropriate. I have some pretty small files, only a few hundred people and 25 characters for each of first and last name currently provides me with 2 unused character (i.e I have names of 23 characters). My code was intended to allow for some so that when I read later data I wasn't likely to have issues and I picked 25 when the longest name I had in the data was 17 characters. I'm glad I picked something on the order of 25 as otherwise I would have a slew of sets with potential issues of mismatched name lengths. But I am extremely leery of your First Name Dob match as a blind faith match unless both of your sets are VERY small. And you may have to use a manual step or two at the end to find things where spelling errors have crept in "Marry" instead of "Mary" or nicknames "Bobby" instead of "Roberta". Or you can look for other tools that do probabilistic matching. The CDC website has a tool name LINKPLUS that you can download that will take two text files and do the matches. One nice thing is that you don't have to change variable "names" as you can tell it to match "FirstName" in one set to "NameFirst" in the second set. The result file will match through a hierarchy of variables you specify (including Address fields if you have them, even if just postal code or city name) and give a probability of matches between values.

Online Status	Offline
Date Last Visited	‎11-12-2020 07:15 PM

Re: Unsure about which "BY" variables to use?

Re: Unsure about which "BY" variables to use?

Unsure about which "BY" variables to use?

Merging Excel datafile and 2 SAS datafiles?

How to convert datew. format to mmddyyw. format?

Re: Change the date format (weird data)

Change the date format (weird data)

Merging datasets

Re: Unsure about which "BY" variables to use?

Re: Merging Excel datafile and 2 SAS datafiles?

Re: How to convert datew. format to mmddyyw. format?

Re: Change the date format (weird data)

Re: Merging datasets