About UniversitySas

UniversitySas · ‎11-14-2018

So my long data looks like this: data source; INFILE DATALINES dlm='#'; length ID $ 3 VAR $ 32 CLASSIFICATION $ 32; input ID VAR CLASSIFICATION; DATALINES; 201 # PepsiCo # Name 201 # 111 main street # AddressLine1 201 # Philadelphia # City 201 # PA # State 201 # 21491 # Zipcode 201 # None # Relationship 201 # Charity # Status 201 # Help # Purpose 201 # 9001 # DollarAmount 201 # CocaCola # Name 201 # 245 Cork street # AddressLine1 201 # Floor 43 # AddressLine2 201 # Philadelphia # City 201 # PA # State 201 # 21492 # Zipcode 201 # None # Relationship 201 # Charity # Status 201 # Build Factory # Purpose 201 # 4100 # CurrentContribution 201 # 13101 # TotalAmount 331 # Adidas # Name 331 # 115 walnut avenue # AddressLine1 331 # New York # City 331 # NY # State 331 # 255191 # Zipcode 331 # Charity # Status 331 # Help # Purpose 331 # 143 # FutureContribution 331 #143 # TotalAmount 334 # Nike # Name 334 # 123 Stevens Road # ForeignAddressLine1 334 # apartment 4D # ForeignAddressLine2 334 # Denpasar # City 334 # Bali # State 334 # 2512 # Zipcode 334 # N/A # Relation 334 # Help # Purpose 334 # 2000 # CurrentContribution 334 # 2000 # TotalAmount ; RUN; The output I am after is this: The basic structure of the data is this: IDs appear multiple times, and they have names associated to them. The last time the ID appears, there is an additional entry called "total" which sums up the entire amount for "contribution" for that ID. Then the New ID starts. Every "Name" has: an "Address line 1". Sometimes it contains the string "Foreign", in which case I want the column that says "Foreign" to say "Y" and "N" otherwise. There is sometimes also an "Address line 2", but NOT always. Regardless, I want to force an "Address line 2" into the data. There is always a "city", "state" and "postcode". Sometimes "Relation" is not there, so I want to force it there. It should always comes 1 row after "postcode", and 1 row before "status" Sometimes "status" is not there. It should always come two rows after "postcode" in the long data file. Needs to be forced too. Sometimes "Purpose" is not there. It always comes three rows after "postcode" in the long data file. Needs to be forced too. "Contribution Amount" is always there - but it either contains the string "current" or "future" in it. if it says "Future" I want the "Future" column to say "Y", if not then "N". For rows that are sometimes not there, and have to be forced in; here I want the corresponding value to be a ".". I actually made a similar post earlier - but the data at the time was IMPOSSIBLE to work with. By some stroke of luck, the raw data I received is singificantly improved in that it actually has the row names (even though they're not always listed). So I don't need to guess the structure of the data anymore. Any help is seriously appreciated. Thanks!

UniversitySas · ‎11-14-2018

Worked perfectly, thank you so much

UniversitySas · ‎11-14-2018

Hi Chris, thanks so much for your reply. I am not currently at my work station but will be sure to update this post once I can try your code out. Im not sure if you remember me, but you have helped me in many of my posts. Do you have any resources or suggestions in terms of how to improve with specifically Data-cleaning in SAS? I can deal with packages like VBA Excel, Stata and MATLAB (the latter two for analysis only) just fine, but SAS seems so incredibly un-intuitive and complicated for the most trivial tasks (e.g., removing substrings..). The help files/guides also read as jargon most of the time. Sorry if this is considered off-topic.

UniversitySas · ‎11-13-2018

Hello, I have the following data: data source_unclean; INFILE DATALINES ; input ID VAR & $19. CLASSIFICATION $ ; DATALINES; 201 PepsiCo returndata__irs990pf__supplementaryinformation__grantorcontriapprovedforfuture__recipientbusinessname__businessnameline1 201 111 main street returndata__irs990pf__supplementaryinformation__grantorcontriapprovedforfuture__recipientusaddress__addressline1 201 Charity returndata__irs990pf__supplementaryinformation__grantorcontriapprovedforfuture__recipientusaddress__classification 331 CocaCola returndata__irs990pf__supplementaryinformationgrp__grantorcontriapprovedforfuture__recipientbusinessname__businessnameline1 331 1823 unicorn street returndata__irs990pf__supplementaryinformationgrp__grantorcontriapprovedforfuture__recipientusaddress__addressline1 331 Charity returndata__irs990pf__supplementaryinformationgrp__grantorcontriapprovedforfuture__recipientusaddress__classification RUN; I want to clean this so it appears only as: data source_clean; INFILE DATALINES ; input ID VAR & $19. CLASSIFICATION $ ; DATALINES; 201 PepsiCo grantorcontriapprovedforfuture__recipientbusinessname__businessnameline1 201 111 main street grantorcontriapprovedforfuture__recipientusaddress__addressline1 201 Charity grantorcontriapprovedforfuture__recipientusaddress__classification 331 CocaCola grantorcontriapprovedforfuture__recipientbusinessname__businessnameline1 331 1823 unicorn street grantorcontriapprovedforfuture__recipientusaddress__addressline1 331 Charity grantorcontriapprovedforfuture__recipientusaddress__classification RUN; So that the block of text "returndata__irs990pf__supplementaryinformation__" or "returndata__irs990pf__supplementaryinformationgrp" is ommitted. How would I go about doing this? any help is appreciated! edit: I'm not sure what I am doing incorrectly for the sample code above, but it does not display the input correctly. there should only be three columns, and I want to go from this: to this:

UniversitySas · ‎10-19-2018

That's correct. My input code in the original post should have the column "Varname" instead of "VAR". Sorry for the mixup!

UniversitySas · ‎10-17-2018

Sorry - don't know why it didn't copy properly before - edited it now. Edit: Never mind, got it: DATA string; SET source; WHERE varname contains 'street'; RUN;

UniversitySas · ‎10-17-2018

Hello, For example; I am trying to keep records which only have the string "street" or "corn" in them, regardless of position/upper/lowercase or anything else - how would I go about doing this? data source; INFILE DATALINES ; input ID VAR & $19. CLASSIFICATION $ ; DATALINES; 201 PepsiCo XYZ 201 111 main street XYZ 201 Charity XYZ 331 CocaCola XYZ 331 1823 unicorn street XYZ 331 Charity XYZ 331 Nike XYZ 331 123 brock avenue XYZ ; RUN; For instance: I tried: DATA substrings; SET source; IF indexw(var,'street'); RUN; but it creates an empty data set. any help is much appreciated, cheers!

UniversitySas · ‎10-16-2018

This worked perfectly - thanks so much. Also, I'm curious about your previous code: input ID UNCLEANVARIABLE2 $25. @1 @@; What is the purpose of the "@1" and "@@" here? I tried to read up the definitions, but they didnt really make sense. Thanks

UniversitySas · ‎09-29-2018

Thanks for that, I seem to be getting an error though. Would you mind telling me what I'm doing wrong? I've used this code: data EXAMPLE1; SET test_code; CATEGORY_NO+1; if CATEGORY_NO=10 then CATEGORY_NO=1; select(CATEGORY_NO); %* Grantor ; when(1 ) ; %* Street ; when(2 ) if countw(UNCLEANVARIABLE)=3 then do; input ID UNCLEANVARIABLE2 $25. @1 @@; if UNCLEANVARIABLE2 in('Floor','Road') then do; CLEANVARIABLE=catx(' ',UNCLEANVARIABLE,UNCLEANVARIABLE2); input; end; end; %* City ; when(3 ) ; %* State ; when(4 ) ; %* Postcode ; when(5 ) ; %* Relationship; when(6 ) if upcase(UNCLEANVARIABLE) ne 'NONE' then do; CLEANVARIABLE='N/A'; output; CATEGORY_NO+1; CLEANVARIABLE=UNCLEANVARIABLE; end; otherwise; end; CLEANVARIABLE=coalescec(CLEANVARIABLE,UNCLEANVARIABLE); output; RUN; Where SET test_code is just the full data set of the sample I posted here. The error I am getting is: ERROR: No DATALINES or INFILE statement. Also, how come in your code you have the variables grantor, street, city etc... but they do not appear in the final output in a separate column? If I wanted to add the additional: Status, Purpose, and Contribution, would that just be by appending the above code with: %* Status; when(7 ) ; %* Relationship ; when(8 ) ; %* Contribution; when(9 ) ; CLEANVARIABLE=coalescec(CLEANVARIABLE,Var); output; Thanks!

UniversitySas · ‎09-27-2018

In total I have 21 IDs to process, and of the 21, around 16 have constant errors in them. In terms of total observations, there are over 4 million. The majority of the errors are the missing "relationship", I'd say about 80% of them. The remaining 20% of the errors are 19% missing "status", and the final 1% of errors are the miscellaneous spillovers such as "Floor" or "Street" being read into the "city".. That's still 40,000 records to deal with, but I'm sure once the initial errors are dealt with, it'll be easier to identify some sort of pattern to deal with it.

UniversitySas · ‎09-27-2018

Hi Chris, thanks so much for your response. Unfortunately I am indeed stuck with the data, and have accepted that a lot of manual scrubbing is going to be needed. Although, I am still pretty new to SAS and coding, so I really appreciate your response. Could I just clarify how I would amend the code you've written so that I wouldn't need to input the raw data myself, and could just use an already imported data set? In addition, I would like to have labelled rows, like the following: IF x=1 THEN Var_Name="Grantor"; ELSE IF x=2 THEN Var_Name= "Street"; ELSE IF x=3 THEN Var_Name = "City"; ELSE IF x=4 THEN Var_Name= "State"; ELSE IF x=5 THEN Var_Name = "Postcode"; ELSE IF x=6 THEN Var_Name= "Relationship"; ELSE IF x=7 THEN Var_Name = "Status"; ELSE IF x=8 THEN Var_Name= "Purpose"; ELSE IF x=0 THEN Var_Name= "Contribution"; So that I can then transpose this data into a wide form. Thanks again!

UniversitySas · ‎09-26-2018

Hello, Say I have the following dataset: data example1; INFILE DATALINES ; input ID UncleanVariable $25. ; DATALINES; 1 Cyclone Limited 1 123 Center Street 1 Orlando 1 FL 1 12245 1 None 1 101(a) 1 Fund equipment 1 10000 1 Lagoon Corp 1 3814 Wakefield Ave 1 Oakland 1 CA 1 19406 1 KL21 1 Subsidise staff 1 200 2 Imagine Sports 2 4556 Sun Valley 2 Road 2 Raleigh 2 NC 2 21020 2 None 2 Airfares 2 14000 ; RUN; Here, there are a few errors per ID. Each row should technically be defined as follows: Row 1 = Grantor Row 2 = Street Row 3 = City Row 4 = State Row 5 = Postcode Row 6 = Relationship Row 7 = Status Row 8 = Purpose Row 9 = Contribution Amount The output I am looking for is: data solution1; INFILE DATALINES dsd; input ID CleanedVariable ~ $30. Category $25. ; DATALINES; 1,Cyclone Limited,Grantor 1,123 Center Street,Street 1,Orlando,City 1,FL,State 1,12245,Postcode 1,Parent company,Relationship 1,101(a),Status 1,Fund equipment,Purpose 1,10000, Contribution Amount 1,Lagoon Corp,Grantor 1,3814 Wakefield Ave,Street 1,Oakland,City 1,CA,State 1,19406,Postcode 1,N/A,Relationship 1,KL21,Status 1,Subsidise staff,Purpose 1,200,Contribution Amount 2,Imagine Sports,Grantor 2,4556 Sun Valley Road,Street 2,Raleigh,City 2,NC,State 2,21020,Postcode 2,Subsidiary,Relationship 2,Missing,Status 2,Airfares,Purpose 2,14000,Contribution Amount ; RUN; There are a few problems I want to address. I'm not sure if there is a one-size fits all solution, so that is okay if there isn't. Let me first visualise the problem, with a few screenshots. This is the original data: Error 1 - The value for "relationship" is missing in the input data, so the "status" row is read prematurely. Is there a way to adjust this so that every Sixth row is either "none" or "None", and if not, insert the value "N/A" between row 5 and 6 in the original set? My criteria is that the value should ALWAYS be "none" or "None" and if not, "N/A" is input. Error 2 - Here the address row has spilled over to the city row. Except for manually correcting this, is there a way to fix this spillover in a big data set? The pattern I've seen is that usually it's words like "floor" that spill over from the address. Or if there are more than two spaces in the address line, it will spillover. So we have 4556[1 space]Sun[2nd space]Valley[3rd space] Road. Error 3 - arises because of error 2. Error 4 - Assuming errors 1-3 are all addressed, there is a new error, very similar to error 1. Here the value for "Status" is missing, and should be replaced as "N/A" or "Missing" to indicate there was no value for this. The only criteria I can think of is that there should never any spaces contained in the value of this row, but it can contain brackets () and alphanumeric values. Ideally, my cleaned data should look like this: So Ideally, I would like to correct all the Errors, but in terms of importance; Errors 1, 4, 2. Thanks in advance for any help Edit: In the last screenshot, "solution1" it should read "Street" not "Street Address" - my mistake.

UniversitySas · ‎09-23-2018

So let's say I have the following fields: DATA HAVE; INFILE DATALINES ; input ID NAME $ STREET $ CITY $ STATE $ POSTCODE $ RELATIONSHIP $ STATUS $ PURPOSE $ DONATION $; DATALINES; 201 AAA Market Philadelphia PA 4109 Parent Open Counselling 10000 201 ABC Chestnut Arlington TX 1093 None Open General 1500 201 BCD Walnut Walnut Sidney NY 3201 None Open General 1999 201 EFG Cross Kansas TX 1091 Parent Close Sports 1491 202 EFG Cluedo Street Phoenix AZ 2012 Close General 1900 ; RUN; Which gives me the following output: You can see there are three problems here: 1) The street "walnut" has been imported twice, shifting the values in the columns by 1 extra space incorrectly. 2) The street "Cluedo Street" has been imported over two lines, instead of just one line, causing a similar problem to what was mentioned above. 3) There is an omission for "Relationship" in the final row. Where the incorrectly imported data should read "none", it missing altogether, and reads "close", so even in the absence of the first two errors, the "Relationship" Column here would read "close" instead of "none". Let's suppose there are 1000's of issues similar, but not identical to the ones above, in a data set with millions of observations. They will be similar in the sense that it's usually a random or repeated omission for some finite number of fields, OR, duplicate values have been entered, or values have spanned more columns than they should have. Assuming the exact same column names as above, is it plausible that one could create some kind of criteria, or program that could reasonably adjust most of these issues? Or do I simply need to request a cleaner data set?

UniversitySas · ‎09-21-2018

Is the next set of data going to be missing the row 2 address line or the row 1 name line and just have the row 3? I'm hoping this isn't the case! The data I'm working with isn't as predictable as I was expecting, unfortunately. Although, I think with Datasp's response I should be able to come up with some creative solution to handle any other unforeseen irregularities (fingers crossed). Thanks again for all your help guys, I really appreciate it! I'll try out the latest solution.

UniversitySas · ‎09-21-2018

Thanks for this - apologies about the mix-up. The data I am working with is actually huge, it wasn't until I implemented the first solution that I realised this issue persisted in the data. I'll give this a try; and at the same time educate myself about loops! They look a fair bit trickier in SAS than other programs I've worked with haha. Cheers.

Online Status	Offline
Date Last Visited	‎10-16-2020 10:11 PM

Is there a faster way to join using an 'or' statement?

Re: PROC SQL Joining on Substrings?

PROC SQL Joining on Substrings?

Re: Calculating percentile for a variable but based on another variabl...

Re: Calculating percentile for a variable but based on another variabl...

Calculating percentile for a variable but based on another variable?

Is there a way to get a percentile as a column, based on by time?

Re: Set variable value = TRUE/FALSE for all occurrences, once the firs...

Set variable value = TRUE/FALSE for all occurrences, once the first on...

Do LOOP based on the value of a variable?

Re: PROC SQL Joining on Substrings?

Re: Set variable value = TRUE/FALSE for all occurrences, once the firs...

Re: There is no matching %IF statement for the %ELSE

Re: Renaming an entry by trimming the string?

Re: Renaming an entry by trimming the string?

How to multiply across rows?

Re: Calculating percentile for a variable but based on another variabl...

Re: Setting value to 0 if missing - a way to check for all variables?

Transpose Data from long to wide; caveat: some inconsistent row data

Re: Renaming an entry by trimming the string?

Re: Renaming an entry by trimming the string?

Renaming an entry by trimming the string?

Re: Searching conditional on substring?

Re: Searching conditional on substring?

Searching conditional on substring?

Re: How to conditionally insert a row in long form data

Re: How to conditionally insert a row in long form data

Re: How to conditionally insert a row in long form data

Re: How to conditionally insert a row in long form data

How to conditionally insert a row in long form data

Cleaning randomly messy data

Re: Best way to "transpose"/make wide this data

Re: Best way to "transpose"/make wide this data