About ErinLM

ErinLM · ‎10-28-2023

Thank you for your response. I posted this over 3 weeks ago, so I have already found a solution (provided below). data test; input patID testID_1 date_1 $ result_1 $ testID_2 date_2 $ result_2 $ testID_3 date_3 $ result_3 $ testID_4 date_4 $ result_4 $; datalines; 1243 1 01/01/2022 P 2 01/12/2022 N 1 01/01/2022 P 1 01/01/2022 P 6495 4 05/23/2022 N 1 03/04/2022 P 1 03/13/2022 N 6 10/12/2022 P 4712 3 07/23/2022 P 3 07/23/2022 P 1 06/15/2022 P 6 05/17/2022 N 3021 2 06/09/2022 N 1 06/12/2022 N 4 12/01/2022 P 1 06/12/2022 N 8537 5 04/01/2022 P 5 04/01/2022 P 1 07/18/2022 N 3 06/05/2022 P ; run; data test; set test; test_str_1=catx('_', OF testID_1--result_1); test_str_2=catx('_', OF testID_2--result_2); test_str_3=catx('_', OF testID_3--result_3); test_str_4=catx('_', OF testID_4--result_4); n+1; run; /*Restructure Data*/ DATA redo; SET test; ARRAY test_str{*} test_str: ; DO i=1 TO dim(test_str); compare=test_str{i}; OUTPUT; END; KEEP patID compare; RUN; /*Identify and Count Duplicates*/ PROC SQL; DELETE FROM work.restrct WHERE compare = ''; CREATE TABLE dups (drop = n) AS SELECT *, count(*) AS count FROM redo GROUP BY patID, compare ORDER BY n; QUIT; PROC SORT DATA=dups OUT=dropdup NODUPKEY ; WHERE count > 1; BY patID compare count; RUN; PROC TRANSPOSE DATA=dropdup OUT=final (drop=_name_) ; VAR compare; BY patID count; RUN;

ErinLM · ‎10-04-2023

Hello, I'm hoping someone can point me in the right direction. I have a dataset containing up to 10 test types (in no particular order) and the corresponding test dates and test results. Unfortunately, something went wrong when our participants were uploading the data, and many tests were uploaded multiple times. I need to identify the records with duplicate entries (where the same test/date/result was entered more than once), and it would be good if I could identify how many duplicates (a count) for each record. As an example, I have: patID testID_1 date_1 result_1 testID_2 date_2 result_2 testID_3 date_3 result_3 testID_4 date_4 result_4 1243 1 01/01/20 P 2 01/12/20 N 1 01/01/20 P 1 01/01/20 P 6495 4 05/23/20 N 1 03/04/20 P 1 03/13/20 N 6 10/12/20 P 4712 3 07/23/20 P 3 07/23/20 P 1 06/15/20 P 6 05/17/20 N 3021 2 06/09/20 N 1 06/12/20 N 4 12/01/20 P 1 06/12/20 N 8537 5 04/01/20 P 5 04/01/20 P 1 07/18/20 N 3 06/05/20 P I want: patID testID_1 date_1 result_1 testID_2 date_2 result_2 testID_3 date_3 result_3 testID_4 date_4 result_4 dup_cnt 1243 1 01/01/20 P 2 01/12/20 N 1 01/01/20 P 1 01/01/20 P 3 6495 4 05/23/20 N 1 03/04/20 P 1 03/13/20 N 6 10/12/20 P 0 4712 3 07/23/20 P 3 07/23/20 P 1 06/15/20 P 6 05/17/20 N 2 3021 2 06/09/20 N 1 06/12/20 N 4 12/01/20 P 1 06/12/20 N 2 8537 5 04/01/20 P 5 04/01/20 P 1 07/18/20 N 3 06/05/20 P 2 I've created concatenated strings for each test (e.g., 1_01/01/20_P or 1_01/12/20_N) for comparison; however, that's still 10 variables that have to be compared in combination. I started to try proc compare, but I would have to list out 45 combinations (I think) and I'm still not sure how to get the count for each record. Are there better, more efficient, approaches to do this? There are lots of posts on looking for duplicates across rows, but I'm struggling to find information on identifying duplicates across columns, but within the row. Can anyone point me in the right direction? Thank you, -EM Code for example "have" table above if anyone wants it. data test; input patID testID_1 date_1 $ result_1 $ testID_2 date_2 $ result_2 $ testID_3 date_3 $ result_3 $ testID_4 date_4 $ result_4 $; datelines; 1243 1 01/01/2022 P 2 01/12/2022 N 1 01/01/2022 P 1 01/01/2022 P 6495 4 05/23/2022 N 1 03/04/2022 P 1 03/13/2022 N 6 10/12/2022 P 4712 3 07/23/2022 P 3 07/23/2022 P 1 06/15/2022 P 6 05/17/2022 N 3021 2 06/09/2022 N 1 06/12/2022 N 4 12/01/2022 P 1 06/12/2022 N 8537 5 04/01/2022 P 5 04/01/2022 P 1 07/18/2022 N 3 06/05/2022 P ; run; data test_comp; set test; test_str_1=catx('_', OF testID_1--result_1); test_str_2=catx('_', OF testID_2--result_2); test_str_3=catx('_', OF testID_3--result_3); test_str_4=catx('_', OF testID_4--result_4); run;

ErinLM · ‎08-19-2022

Wow! I appreciate the detail, but that is a very odd interpretation of what I said. The comparison is still the original date to the 2 reference dates. The difference is that the variable pulled is an alphanumeric code. In short, instead of returning a number for the week, I'm just trying to pull the assigned code for any record with a date between the reference dates. DateRef_min DateRef_max Week Assign 12/29/2019 1/4/2020 1 1ac345 1/5/2020 1/11/2020 2 3jw743 1/12/2020 1/18/2020 3 8sz938 Thank you for your input.

ErinLM · ‎08-19-2022

This works perfectly! I apologize, my data is sensitive so I cannot provide it, hence the dummy data. I knew I had seen some proc sql code that did this, but my googling wasn't bringing up any examples referencing 2 different files. I forgot about the option to do a. and b. Thank you so much!

ErinLM · ‎08-19-2022

Thank you, I have tried to use the week function, but it doesn't work for what I need. The data I provided is just a quick dummy dataset. For my purposes, January 1 could fall within the last week of the previous year, or December 29th, 30th, or 31st could fall within the first week of the current year. The weeks are set up on a different set of rules. So I can't use this function for every date as I would have to find every exception and correct it (across 20 years). I have a file of records with dates, and a separate file that has a date range and a third parameter. I need to assign the 3rd parameter when the original date falls within the range on the reference file. I have another instance where I need to do this same process but with the 3rd parameter contains an alphanumeric code, so even if the week function worked, I would still need an alternative approach for the alphanumeric variable. I appreciate your help!

ErinLM · ‎08-19-2022

Hello! I have 2 separate files, and I need to pull a value from the second file based on conditions between both files. File 1 ID Date 1 1/2/2020 2 1/8/2020 3 2/24/2020 4 3/16/2020 File 2 DateRef_min DateRef_max Week 12/29/2019 1/4/2020 1 1/5/2020 1/11/2020 2 1/12/2020 1/18/2020 3 1/19/2020 1/25/2020 4 1/26/2020 2/1/2020 5 2/2/2020 2/8/2020 6 2/9/2020 2/15/2020 7 2/16/2020 2/22/2020 8 2/23/2020 2/29/2020 9 3/1/2020 3/7/2020 10 3/8/2020 3/14/2020 11 3/15/2020 3/21/2020 12 FILE WANT ID Date Week 1 1/2/2020 1 2 1/8/2020 2 3 2/24/2020 9 4 3/16/2020 12 What I'm trying to do is: IF date >= DateRef_min & date <= DataRef_max THEN week = week, but I'm not sure how to code this across 2 different files. Can someone help me understand the best approach? Thank you!

ErinLM · ‎07-20-2022

Thank you so much. I actually tried this multiple times and kept getting an error. Now it works, of course. I appreciate the help!

ErinLM · ‎07-20-2022

Hello, I want to do something very simple, but it is either so simple that I can't figure it out or SAS really overcomplicates it. I have a dataset that includes multiple variables including a count and a population. I want to add the rate (per 100,000) to the dataset. So I have... STATE VAR1 VAR2 VAR3 CASES POP AL … … … 100 50000 GA … … … 50 100000 and I want... STATE VAR1 VAR2 VAR3 CASES POP RATE AL … … … 100 50000 0.2 GA … … … 50 100000 0.05 I can't find a code example that does this. It shouldn't be hard, but nothing works. A reference or sample code would be much appreciated. Thank you in advance.

ErinLM · ‎04-01-2022

Thank you so much! I'm checking it now, but it looks like it worked. I've never seen the "declare hash" code that you used, but it seems to have worked brilliantly! I really appreciate your time, and thanks for teaching me something new!

ErinLM · ‎04-01-2022

Sorry, that was a typo. I've corrected the original post.

ErinLM · ‎04-01-2022

Hello SAS Community, I am trying to do something that I think should be easy, but I have not stumbled upon the right search terms to find an example. I have 2 datasets. The first is my set of records. The second is a reference dataset. For example: Dataset 1: RECORD STATE COUNTYFIPS 001 AL 005 002 AL 005 003 AL 005 004 AL 133 005 AK 020 006 AK 020 007 AK 020 DATASET 2: a reference dataset containing the STATE, FIPS, and the rural/urban code for all 50 states. STATE COUNTYFIPS RU_CODE AL 001 3 AL 003 4 AL 005 6 . . . AL 133 4 AK 013 6 AK 016 6 AK 020 3 . . . DATASET NEED: RECORD STATE COUNTYFIPS RU_CODE 001 AL 005 6 002 AL 005 6 I know I have to use a condition where the STATE and FIPS match between the 2 datasets, but how do I pull over just the RU_CODE code that matches for each record when I have over 17K records, across 50 states. My dataset has dozens of records that are in the same state and county. Can someone point me in the right direction?

ErinLM · ‎03-17-2022

This isn't a finished, cleaned, stock program for routine processing. I suspect this was done because the programmer was testing the code/variables in pieces. The raw dataset is 20K records by 300 or so variables. As each variable or set of variables was cleaned and tested, they created a new data step to clean and test the next set. There are about a dozen data steps like this, followed by proc freqs/means/prints, working through a handful of variables at a time. Overall, the program is about 2K lines of code. There is value in re-visiting the blocks of code as I'm still cleaning and analyzing variables. There are still dozens of variables to go. Of course, when I find a mistake in one of the data steps and need to compare between the previous and current dataset, I have to reset the data from the beginning and start over (rare). I'm mostly just trying to determine if I need to take the time to restructure the code. I understand why it was done it blocks. It's far easier to read, troubleshoot, and debug small sets of code/variables instead of running one giant data step then having to go back and forth between the data step and the output code. But I've had other programmers tell me that block coding like this is poor practice and that I should restructure it or break it out into separate programs. I don't know. I'm just trying to make sure I don't continue something that is poor practice or could cause a problem; however, if there is a good reason for it, I don't want to spend time fussing with it.

ErinLM · ‎03-17-2022

Quick Question, I am working with code someone else wrote. The programs have multiple data steps, and the previous coder used the following coding scheme. My question is, why did this person use " DATA © SET © ", essentially setting the new dataset the same as the old each time? Standard convention, as I was taught, is to use a new name for each data step. So I don't know if this is just a more advanced way to code or if there are potential benefits, drawbacks, etc to this methodology. I can think of some problems that could occur.... The file is for cleaning data, so it is broken up into dozens of data steps with procs for testing in between - if that helps. * Create copy of dataset ; DATA data_copy; SET IN.data; RUN; * Set new SAS dataset name as macro variable ; %LET copy = work.data_copy; DATA © SET © <code here>; RUN; DATA © SET © <code here>; RUN;

Online Status	Offline
Date Last Visited	‎11-01-2023 01:06 PM

Re: Counting duplicates across columns, but within rows

Counting duplicates across columns, but within rows

Re: Conditional coding across files

Re: Conditional coding across files

Re: Conditional coding across files

Conditional coding across files

Re: How to calculate a simple rate

How to calculate a simple rate

Re: transfer value based on conditions between 2 datasets

Re: transfer value based on conditions between 2 datasets

Re: Conditional coding across files

Re: Data step naming convention

Re: Data step naming convention

Re: Data step naming convention

How to calculate a simple rate

Re: Counting duplicates across columns, but within rows

Counting duplicates across columns, but within rows

Re: Conditional coding across files

Re: Conditional coding across files

Re: Conditional coding across files

Conditional coding across files

Re: How to calculate a simple rate

How to calculate a simple rate

Re: transfer value based on conditions between 2 datasets

Re: transfer value based on conditions between 2 datasets

transfer value based on conditions between 2 datasets

Re: Data step naming convention

Data step naming convention