07-19-2018 12:47 PM - edited 07-20-2018 03:32 PM
Hi, this is just a continuation from the last few questions I asked since the topic evolved/changed to this. I was wondering if there was a way to read data right to left?
07-19-2018 12:55 PM
If you read each line in as a very long text string, then
gives you the country name. The -1 in the scan function tells SAS to read the first word from the right end of the text string. So if you use a -2, it gives you the STATE and ZIP.
07-19-2018 01:14 PM
This is awesome, however how would I read the line as a very long text string? Do I have to give it some command?
data want; infile "\\server\path\myfile.csv" truncover; input textstr $200.; run;
07-19-2018 01:21 PM
Why taking extra pain reading the data from right to left. Just read it as it is and then re-order the variables how you like. You can use dictionary tables to find the order of the variables in the dataset and can re-order then by simply using RETAIN.
options validvarname=v7; /* Converts variable as SAS naming convention */ proc import datafile='/sample/raw_data.csv' out=raw_data dbms=CSV replace; run; proc sql; select name into:name separated by " " from dictionary.columns where libname="WORK" and memname="RAW_DATA" order by varnum desc; /* Order how variables exist in dataset */ quit; data raw_data; retain &name; set raw_data; run;
07-19-2018 01:24 PM
That's a good point.
Why read right to left, when reading as a csv file ought to get you proper separation of the fields without going right to left or left to right.
07-19-2018 01:33 PM
07-19-2018 03:22 PM - edited 07-19-2018 03:25 PM
Bad data always exists everywhere. How you handle is what it matters. If your data is not properly delimited by source then it might cause issues downstream. Suggest your source system to change from there end to send data with proper delimiters.
You only can just give a suggestion and end of the day you still need to struggle with this kind of bad data. What percent of your data is having issue, if less then read the delimited data as ease and then deal with the records having issue. Looking at your data make sense why your trying to read it from right to left. You can extract State/Zip and Country and only need to struggle with City. You have a good news! SASHELP.ZIPCODE dataset can get you city using the zipcode you extracted and then remove the City from street.
infile datalines dlm=',' dsd missover;
input Street :$50. City :$20. State_ZIP :$20. Country :$10.;
288 York Street,New Haven,CT 06511,USA
301-399 South Boulevard Drive,Bainbridge,GA 39819,USA
150-151 Tremont Street,Boston,MA 02111,USA
2395 Ingleside Avenue,Macon,GA 31204,USA
1007 Merchant Street,Ambridge,PA 15003,USA
859 Washington Avenue,Miami Beach,FL 33139,USA
974 Great Plain Avenue Needham MA 02492,USA, ,
139 Lynnfield Street,Peabody,MA 01960,USA
180 Nassau Street Princeton,NJ 08542,USA,
563 Carlsbad Village Drive,Carlsbad,CA 92008,USA
run; data test ; set raw_data; array vars _All_; do over vars;
/* If any of the variable is missing */ if missing(vars) then str=catx(" ", of _character_); end; if not missing(str) then do; /* Extract Country and Remove from Street if present */ Country=SCAN(str,-1); Street=TRANSTRN(Street,strip(SCAN(str,-1)),""); /* Extract State_ZIP and Remove from Street if present */ State_ZIP=CATX(" ",SCAN(str,-3),SCAN(str,-2)); Street=TRANSTRN(Street,strip(CATX(" ",SCAN(str,-3),SCAN(str,-2))),""); /* Extract Zipcode from the string */ Zip_Code=INPUT(SCAN(str,-2),5.); end; format Zip_Code z5.; run; /* Get the City for the records where data is corrupted only */ proc sql; create table update_City_Missing as select a.*,b.CITY as City_ from test a left join sashelp.zipcode b on (a.Zip_Code=b.zip) ; quit; /* Find and replace City */ data Want(drop=str Zip_Code City_); set update_City_Missing; if Not missing(City_) and find(Street,strip(City_),'i') then do; City=City_; Street=TRANSTRN(Street,strip(City_),""); end; run;
Hope this helps!
07-19-2018 02:09 PM
Please stop creating new threads on this topic.
You are getting pieces mixed up, referencing partial code from multiple answers related to parts of the problem,
NOT providing actual raw text of the input file (or at least not indicating if what you show is the only structure of the file), none of the code that you have used to read the file to begin with
and I think in a least a few places using only part of the code suggested for single bits of the problem.
INPUT is not going to help because the options to control reading things on an input statement get extremely complicated when used with delimited data as then the user has to write an entire parse of garbage.
Several bits have been provided to use the data that you have read to pull the pieces you say you need. But other than a very generic "I've written this so far but it seems to be incorrect" you do not indicate what manner it is incorrect.
Data test; input street city state_zip country; datalines; 288 York Street,New Haven,CT 06511,USA 301-399 South Boulevard Drive,Bainbridge,GA 39819,USA 150-151 Tremont Street,Boston,MA 02111,USA ; country=scan(textstr,-1,','); state_zip=scan(txtstr, -2, ','); city=scan(txtstr,-1,','); run;
Which besides attempting read character values in numeric variables
the multiple error lines similar:
20 country=scan(textstr,-1,','); ------- 180 ERROR 180-322: Statement is not valid or it is used out of proper order.
because the DATALINES must be the last part of the code. no programming statements are allowed after the end of the datalines.
In the past you have implied that proc import read the data. Was that actually true? Then use that resulting data set.
And use program code to select the values you say you need.
07-19-2018 02:18 PM
07-19-2018 03:30 PM
@Ramsha There's several problems caused by creating new threads, besides irritating others:
1. Only a partial view of the problem. Although it does make sense to break a bigger problem down to smaller steps but unfortunately you've left out key details each time.
2. Continuity, solutions posted in other threads are ignored and we're not sure which we should respond to anymore, especially when you don't mark the previous ones solved. Someone searching may get all of these as results and no longer be able to follow the thread or solution either.
3. Work shown is implied/stated to be yours, when its actually a solution someone else proposed which can be rude. Don't know if you've done this, but it happens.
All of the above make it difficult to help you and the probability of getting an answer decreases.
Back to your original question, use SCAN() to read the last three items but there's no guarantee of correctness beyond that. If you're only interested in those fields you'll be fine. If you need to read the full data in correctly, this won't be the full solution. I'm going to bet (and I'd put money on it) that you'll find other issues you need to code around as well. However this can get you started. For your current example you could go one step further and try and get the City as well, but that won't work if the city is two names which will happen.
See the Miami Beach line for example and I added that as well to show the issue if you try that.
data have; length address $60.; informat address $60.; input address $60.; cards; 288 York Street,New Haven,CT 06511,USA 301-399 South Boulevard Drive,Bainbridge,GA 39819,USA 150-151 Tremont Street,Boston,MA 02111,USA 2395 Ingleside Avenue,Macon,GA 31204,USA 1007 Merchant Street,Ambridge,PA 15003,USA 859 Washington Avenue,Miami Beach,FL 33139,USA 974 Great Plain Avenue Needham MA 02492,USA, , 974 Great Plain Avenue Miami Beach MA 02492,USA, , 139 Lynnfield Street,Peabody,MA 01960,USA 180 Nassau Street Princeton,NJ 08542,USA, 563 Carlsbad Village Drive,Carlsbad,CA 92008,USA ;;;; run; proc print data=have;run; data want; set have; Country = scan(address, -1, ", "); ZIP = scan(address, -2, ", "); State = scan(address, -3, ", "); *find the last location of the state acronym, search from end otherwise you may find incorrect vlaues; x= index(address, trim(state)); line1=substr(address, 1, x-1); run;