Read data right to left with varying length?

Reply
Contributor
Posts: 32

Read data right to left with varying length?

[ Edited ]

Hi, this is just a continuation from the last few questions I asked since the topic evolved/changed to this. I was wondering if there was a way to read data right to left?

Respected Advisor
Posts: 3,294

Re: Read data right to left with varying length?

If you read each line in as a very long text string, then

 

country=scan(textstr,-1,',');

 

gives you the country name. The -1 in the scan function tells SAS to read the first word from the right end of the text string. So if you use a -2, it gives you the STATE and ZIP.

--
Paige Miller
Contributor
Posts: 32

Re: Read data right to left with varying length?

Posted in reply to PaigeMiller

This is awesome, however how would I read the line as a very long text string? Do I have to give it some command?

Respected Advisor
Posts: 3,294

Re: Read data right to left with varying length?


@Ramsha wrote:

This is awesome, however how would I read the line as a very long text string? Do I have to give it some command?


 

 

data want;
    infile "\\server\path\myfile.csv" truncover;
    input textstr $200.;
run;

 

--
Paige Miller
Contributor
Posts: 32

Re: Read data right to left with varying length?

[ Edited ]
Posted in reply to PaigeMiller
 
PROC Star
Posts: 635

Re: Read data right to left with varying length?

Why taking extra pain reading the data from right to left. Just read it as it is and then re-order the variables how you like. You can use dictionary tables to find the order of the variables in the dataset and can re-order then by simply using RETAIN.

options validvarname=v7; /* Converts variable as SAS naming convention */
proc import datafile='/sample/raw_data.csv'
			out=raw_data
			dbms=CSV
			replace;
run;

proc sql;
select name into:name separated by " "
from dictionary.columns
where libname="WORK" and memname="RAW_DATA"
order by varnum desc; /* Order how variables exist in dataset */
quit;

data raw_data;
retain &name;
set raw_data;
run;
Thanks,
Suryakiran
Respected Advisor
Posts: 3,294

Re: Read data right to left with varying length?

Posted in reply to SuryaKiran

That's a good point.

 

Why read right to left, when reading as a csv file ought to get you proper separation of the fields without going right to left or left to right.

--
Paige Miller
Contributor
Posts: 32

Re: Read data right to left with varying length?

[ Edited ]
Posted in reply to PaigeMiller


City=scan(txtstr,-1,',');

run;

proc print data=test;
run;

Super User
Posts: 24,027

Re: Read data right to left with varying length?

Include the records you're having issues with, specifically the 4 from the bottom in the previous post. This data can be read in many ways correctly, but you're still not solving your actual issue.
Contributor
Posts: 32

Re: Read data right to left with varying length?

[ Edited ]
 
PROC Star
Posts: 635

Re: Read data right to left with varying length?

[ Edited ]

Bad data always exists everywhere. How you handle is what it matters. If your data is not properly delimited by source then it might cause issues downstream. Suggest your source system to change from there end to send data with proper delimiters. 

 

You only can just give a suggestion and end of the day you still need to struggle with this kind of bad data. What percent of your data is having issue, if less then read the delimited data as ease and then deal with the records having issue. Looking at your data make sense why your trying to read it from right to left. You can extract State/Zip and Country and only need to struggle with City. You have a good news! SASHELP.ZIPCODE dataset can get you city using the zipcode you extracted and then remove the City from street.

 

data raw_data;
infile datalines dlm=',' dsd missover;
input Street :$50. City :$20. State_ZIP :$20. Country :$10.;
datalines;
288 York Street,New Haven,CT 06511,USA
301-399 South Boulevard Drive,Bainbridge,GA 39819,USA
150-151 Tremont Street,Boston,MA 02111,USA
2395 Ingleside Avenue,Macon,GA 31204,USA
1007 Merchant Street,Ambridge,PA 15003,USA
859 Washington Avenue,Miami Beach,FL 33139,USA
974 Great Plain Avenue Needham MA 02492,USA, ,
139 Lynnfield Street,Peabody,MA 01960,USA
180 Nassau Street Princeton,NJ 08542,USA,
563 Carlsbad Village Drive,Carlsbad,CA 92008,USA
;
run; data test ; set raw_data; array vars _All_; do over vars;
/* If any of the variable is missing */ if missing(vars) then str=catx(" ", of _character_); end; if not missing(str) then do; /* Extract Country and Remove from Street if present */ Country=SCAN(str,-1); Street=TRANSTRN(Street,strip(SCAN(str,-1)),""); /* Extract State_ZIP and Remove from Street if present */ State_ZIP=CATX(" ",SCAN(str,-3),SCAN(str,-2)); Street=TRANSTRN(Street,strip(CATX(" ",SCAN(str,-3),SCAN(str,-2))),""); /* Extract Zipcode from the string */ Zip_Code=INPUT(SCAN(str,-2),5.); end; format Zip_Code z5.; run; /* Get the City for the records where data is corrupted only */ proc sql; create table update_City_Missing as select a.*,b.CITY as City_ from test a left join sashelp.zipcode b on (a.Zip_Code=b.zip) ; quit; /* Find and replace City */ data Want(drop=str Zip_Code City_); set update_City_Missing; if Not missing(City_) and find(Street,strip(City_),'i') then do; City=City_; Street=TRANSTRN(Street,strip(City_),""); end; run;

 

Hope this helps!

Thanks,
Suryakiran
Super User
Posts: 13,950

Re: Read data right to left with varying length?

Please stop creating new threads on this topic.

 

 

You are getting pieces mixed up, referencing partial code from multiple answers related to parts of the problem,

NOT providing actual raw text of the input file (or at least not indicating if what you show is the only structure of the file), none of the code that you have used to read the file to begin with

and I think in a least a few places using only part of the code suggested for single bits of the problem.

 

INPUT is not going to help because the options to control reading things on an input statement get extremely complicated when used with delimited data as then the user has to write an entire parse of garbage.

 

Several bits have been provided to use the data that you have read to pull the pieces you say you need. But other than a very generic "I've written this so far but it seems to be incorrect" you do not indicate what manner it is incorrect.
Example with:

Data test;

input street city state_zip country;
datalines;

288 York Street,New Haven,CT 06511,USA
301-399 South Boulevard Drive,Bainbridge,GA 39819,USA
150-151 Tremont Street,Boston,MA 02111,USA
;
country=scan(textstr,-1,',');
state_zip=scan(txtstr, -2, ',');
city=scan(txtstr,-1,',');
run;

Which besides attempting read character values in numeric variables

 

the multiple error lines similar:

20   country=scan(textstr,-1,',');
     -------
     180

ERROR 180-322: Statement is not valid or it is used out of proper order.


  

because the DATALINES must be the last part of the code. no programming statements are allowed after the end of the datalines.

 

 

In the past you have implied that  proc import read the data. Was that actually true? Then use that resulting data set.

And use program code to select the values you say you need.

Contributor
Posts: 32

Re: Read data right to left with varying length?

Hey man- chill, its all good- I'm new to this and just learning. I make new threads because the main idea of the topics change and I know when people google, they just want a reference.
Super User
Posts: 24,027

Re: Read data right to left with varying length?

@Ramsha There's several problems caused by creating new threads, besides irritating others:

1. Only a partial view of the problem. Although it does make sense to break a bigger problem down to smaller steps but unfortunately you've left out key details each time. 

2. Continuity, solutions posted in other threads are ignored and we're not sure which we should respond to anymore, especially when you don't mark the previous ones solved. Someone searching may get all of these as results and no longer be able to follow the thread or solution either. 

3. Work shown is implied/stated to be yours, when its actually a solution someone else proposed which can be rude. Don't know if you've done this, but it happens. 

 

All of the above make it difficult to help you and the probability of getting an answer decreases. 

 

Back to your original question, use SCAN() to read the last three items but there's no guarantee of correctness beyond that. If you're only interested in those fields you'll be fine. If you need to read the full data in correctly, this won't be the full solution. I'm going to bet (and I'd put money on it) that you'll find other issues you need to code around as well. However this can get you started. For your current example you could go one step further and try and get the City as well, but that won't work if the city is two names which will happen. 

See the Miami Beach line for example and I added that as well to show the issue if you try that. 

 

 

Good Luck. 

 

data have;
length address $60.;
informat address $60.;
input address $60.;
cards;
288 York Street,New Haven,CT 06511,USA
301-399 South Boulevard Drive,Bainbridge,GA 39819,USA
150-151 Tremont Street,Boston,MA 02111,USA
2395 Ingleside Avenue,Macon,GA 31204,USA
1007 Merchant Street,Ambridge,PA 15003,USA
859 Washington Avenue,Miami Beach,FL 33139,USA
974 Great Plain Avenue Needham MA 02492,USA, ,
974 Great Plain Avenue Miami Beach MA 02492,USA, ,
139 Lynnfield Street,Peabody,MA 01960,USA
180 Nassau Street Princeton,NJ 08542,USA,
563 Carlsbad Village Drive,Carlsbad,CA 92008,USA
;;;;
run;

proc print data=have;run;

data want;
set have;
Country = scan(address, -1, ", ");
ZIP = scan(address, -2, ", ");
State = scan(address, -3, ", ");


*find the last location of the state acronym,
search from end otherwise you may find incorrect vlaues;

x= index(address, trim(state));
line1=substr(address, 1, x-1);

run;
Ask a Question
Discussion stats
  • 13 replies
  • 133 views
  • 1 like
  • 5 in conversation