About fwashburn

fwashburn · ‎07-17-2018

Astounding - thanks for taking a look. A UID can repeat up to 9 times, and each time it repeats, it can have up to 171 variables. Among the 171 other variables, every one is in a character format (though some appear as numeric), because there is a potential for leading zeroes. I may be misunderstanding your statement on the order not mattering, but it definitely matters which values match with the UID. My end goal is to create a comprehensive dataset with one record for one UID, and all unique nonmissing otherids and anotherids listed out. I don't think that there will be more than a few hundred total, and I expect that on average, there will be maybe 5 otherids and 10 anotherids per UID; it's just getting them into the proper structure that's stumping me.

fwashburn · ‎07-17-2018

BallardW: Thank you for your framing of the problem - I know this is a crazy conundrum and I wish it weren't so complicated; it'd sure make my life a lot easier! Each UID repeats up to 9 times in the 5 million+ record dataset, with the majority of UIDs occurring 1 to 5 times. The 171 non-UID variables comprise 2 sets of other unique identifiers (90 potential values for otherid; 81 potential values for anotherid) that are attached to a UID. A UID may appear twice and have 80 otherid's that are the same in both records, but 10 otherid's that are different. Basically I want my end dataset to show one record for one UID, with all unique otherid and anotherid values listed out to the right. But I'm not certain that this is possible without some very verbose code.

fwashburn · ‎07-17-2018

Tom, thank you so much for this information. I've seen documentation on the UPDATE statement; however, it seems that I would need to write out many lines of code just to make sure that none of my 171 non-UID variables get overwritten in the process (based on what I read here: http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000202975.htm). While I'm not opposed to writing out a lot of code, it does introduce plenty of room for user error. Do you know if there is a more user-friendly approach?

fwashburn · ‎07-17-2018

Hi all! I'm working in SAS 9.4 with a very large dataset (5 million plus records, 172 columns) that has bits and pieces of information scattered throughout. The unique ID (I'll call UID) is repeated on multiple rows. I want to collapse all instances of a UID into one row, keeping any nonmissing, unique values, without overwriting any other nonmissing, unique values. After hours of searching for solutions, I found this: http://support.sas.com/kb/32/288.html - but alas, it's only for Enterprise Guide. This note describes exactly what I'm trying to do though. Here's a very simplified example of my messy dataset: data have; input uid $ otherid1 $ otherid2 $ anotherid1 $ anotherid2 $; datalines; xyz$3tyu 00012345 00123456 03456 78901 xyz$3tyu 00012345 01204789 34512 78901 ;

fwashburn · ‎10-17-2017

Hi Reeza! Sorry for being unclear. What I want as output is something like this: Obs dovisit person_id sex nvisit fvisit avisit nvisit_2 fvisit_2 avisit_2 nvisit_3 fvisit_3 avisit_3 nvisit_4 fvisit_4 avisit_4 nvisit_4 fvisit_5 avisit_5 nvisit_5 fvisit_6 avisit_6 nvisit_6 1 17198 212 F T T T T F T 2 17542 265 M F F T 3 17176 365 M T F F T F F 4 17518 444 M T F F T F F T F F F F T T F F T T T T (I realized that I should've had the same dates for same IDs in my sample code, as the data I'm actually working with has multiple variables for the same ID AND same date.) How would I use PROC TRANSPOSE with a BY statement here, more specifically? I'm not sure where to start. If I just use the following: proc transpose data=visits; by person_id; run; It gives me the same result as the original code I posted.

fwashburn · ‎10-17-2017

Howdy folks! I have a large dataset with IDs that have anywhere from one to seven observations, and I'm trying to collapse them into one per ID. It's from an Excel file, but I'll try to mimic some of the data here for the sake of the following code: data visits; input dovisit date9. person_id sex :$1. nvisit :$1. fvisit :$1. avisit :$1.; datalines; 18dec2007 444 M T F F 18dec2007 444 M T F F 20dec2008 444 M F F T 23apr2009 444 M T T T 31mar2010 444 M F F F 10jan2007 365 M T F F 10jan2007 365 M T F F 11jan2008 265 M F F T 01feb2007 212 F T T T 01feb2007 212 F T F T ; run; /*create a data set of the duplicates using DUPOUT= option */ proc sort data=visits dupout=visits_dup nodupkey; by person_id; run; /* Create a macro variable with the variable names that are to */ /* be merged. The variables considered BY variables are excluded */ /* from going into the macro variable using the NOT IN operator. */ /* The resulting macro variable is in the format varname=varname_2 */ proc sql noprint; select trim(name) || '=' || trim(name) || '_2' into :varlist separated by ' ' from DICTIONARY.COLUMNS WHERE LIBNAME EQ "WORK" and MEMNAME EQ "VISITS" and upcase(name) not in ('PERSON_ID' 'DOVISIT' 'SEX'); quit; /*Merge the two data sets using the macro variable to rename the */ /*common variables in the second (duplicates) data set. */ data merged; merge visits visits_dup (rename=(&varlist)); by person_id; run; proc print; run; Now what I want to modify this code to do is to have columns nvisit, fvisit, avisit, nvisit_2, fvisit_2, avisit_2, nvisit_3, fvisit_3, avisit_3, and so on, all the way to _7, but when I try to modify the code, specifically this line: select trim(name) || '=' || trim(name) || '_2' ...nothing I do seems to stick. The only change I've been able to make so far without getting an error: select trim(name) || '=' || trim(name) || '_2' || '_3' turns my nvisit_2 into an nvisit_2_3 and so on, instead of actually creating the separate column nvisit_3. I'm sure I'm making a syntax error but I'm not sure how to fix it. Thank you so much for taking a look! (PS - This code is almost identical to what I found at Collapse observations in BY-Group so values from duplicate observations have new names; all I did here was add some more datalines because the code was originally only written to collapse two observations.)

fwashburn · ‎09-11-2017

Unfortunately when I convert the file to a CSV format, it screws with a lot of the other data, so I haven't been able to do that successfully. Otherwise, that would be a great call!

fwashburn · ‎09-11-2017

You are correct; this is a huge unstructured data problem. I've been wracking my brain trying to figure out if there's any code that would force the values into uniformity but thinking that it may not be possible. I'll try out your code and report back; thanks for your response!

fwashburn · ‎09-11-2017

Howdy folks! This has been puzzling me for quite sometime now...I'm working with some very dirty data that came to me in the form of 41 Excel spreadsheets. One of the issues with this data is that some of the date values look like this: That 9/26/201310/7/13 looked like this in the original data file: So what I'm trying to figure out is - is there a way to remove everything in a variable value EXCEPT for the most recent date? That way all my single date values will remain the same, but all my multiple date values will only keep the most current and valid date. Thanks so much for your time! I've tried searching for solutions to this but I think I'm just not using the right keywords.

fwashburn · ‎09-07-2017

You're wonderful! Thank you! I'll slow down and read more carefully next time.

fwashburn · ‎09-07-2017

Hi Reeza, thanks so much for your input! Here's what I see in the COMPRESS documentation: a or A adds alphabetic characters to the list of characters. c or C adds control characters to the list of characters. d or D adds digits to the list of characters. f or F adds the underscore character and English letters to the list of characters. g or G adds graphic characters to the list of characters. h or H adds a horizontal tab to the list of characters. i or I ignores the case of the characters to be kept or removed. k or K keeps the characters in the list instead of removing them. l or L adds lowercase letters to the list of characters. n or N adds digits, the underscore character, and English letters to the list of characters. o or O processes the second and third arguments once rather than every time the COMPRESS function is called. Using the O modifier in the DATA step (excluding WHERE clauses), or in the SQL procedure, can make COMPRESS run much faster when you call it in a loop where the second and third arguments do not change. p or P adds punctuation marks to the list of characters. s or S adds space characters (blank, horizontal tab, vertical tab, carriage return, line feed, and form feed) to the list of characters. t or T trims trailing blanks from the first and second arguments. u or U adds uppercase letters to the list of characters. w or W adds printable characters to the list of characters. x or X adds hexadecimal characters to the list of characters. So would I do something like this? DATA MERGED.MERGECLN; SET MERGED.MERGECLN; Q_HASSSACARD_HFH = COMPRESS(Q_HASSSACARD_HFH,H); RUN; Or is there more to it? I'm a total newbie to the COMPRESS function.

fwashburn · ‎09-07-2017

Howdy folks! Long time lurker; first time poster - let me know if I gave you enough info below on this issue. I've read 41 Excel files into SAS (some xls, some xlsx), reformatted them, concatenated them into one SAS datafile, and am now trying to recode some of the variables. Alas, it seems that some variables were read in as multi-line data (that is, someone used Alt+Enter in Excel when entering data). So, "YES (RECEIPT)" and "YES (RECEIPT)" look exactly the same, but when I run this code: DATA MERGED.MERGETEST; SET MERGED.MERGECLN; IF Q_HASSSACARD_HFH = 'YES (RECEIPT)' THEN RE_Q_HASSSACARD_HFH = 'YES'; RUN; It only reformats some of the "YES (RECEIPT)" values, and leaves two of them untouched. I went back to the original Excel files and confirmed that these 2 leftover values were in fact "YES (ALT+ENTER) (RECEIPT)" values. I've tried: STRIPping the variable DATA MERGED.MERGECLN; SET MERGED.MERGECLN; Q_HASSSACARD_HFH = STRIP(Q_HASSSACARD_HFH); RUN; and COMPRESSing the variable DATA MERGED.MERGECLN; SET MERGED.MERGECLN; Q_HASSSACARD_HFH = COMPRESS(Q_HASSSACARD_HFH); RUN; ...to no avail. I still have those leftover "YES (RECEIPT)" variable values. What am I missing here? Thank you so much for taking a look.

Online Status	Offline
Date Last Visited	‎02-13-2020 08:22 PM

Re: Collapse multiple rows of data into a single row within a group

Re: Collapse multiple rows of data into a single row within a group

Re: Collapse multiple rows of data into a single row within a group

Collapse multiple rows of data into a single row within a group

Re: Collapse Up to Seven Observations by ID

Collapse Up to Seven Observations by ID

Re: How to remove extraneous dates from variables with multiple dates ...

Re: How to remove extraneous dates from variables with multiple dates ...

How to remove extraneous dates from variables with multiple dates list...

Re: Removing Multi-Line Breaks (ALT+ENTER) in Variables from Imported ...

Collapse multiple rows of data into a single row within a group

Re: Collapse multiple rows of data into a single row within a group

Re: Collapse multiple rows of data into a single row within a group

Re: Collapse multiple rows of data into a single row within a group

Collapse multiple rows of data into a single row within a group

Re: Collapse Up to Seven Observations by ID

Collapse Up to Seven Observations by ID

Re: How to remove extraneous dates from variables with multiple dates ...

Re: How to remove extraneous dates from variables with multiple dates ...

How to remove extraneous dates from variables with multiple dates list...

Re: Removing Multi-Line Breaks (ALT+ENTER) in Variables from Imported ...

Re: Removing Multi-Line Breaks (ALT+ENTER) in Variables from Imported ...

Removing Multi-Line Breaks (ALT+ENTER) in Variables from Imported .xls...