12-19-2012 04:31 PM
Please find attached the csv file with the data I am trying to clean. My data contains the brand of a camera, its model #, its price (price1 or price2-- whichever column is populated; both will never be populated at the same time and refers to the same camera) and discamt: how much the camera is discounted. Then I use an If statement to set the value of the variable to 0 if it is missing so that I can add them together later. Since I am new to SAS, if there is a more efficient way to do this pls let me know. For some reason, when I run this program, it generates "invalid data for discamt" in the log but the resulting dataset is fine so I have ignored this.
data bbnov21zoom check_bbnov21zoom;
length brand $ 80 model $ 25;
infile "G:\irobot2\irobot-eval\data\nov21data\bb_nov21\zoombbnv21.csv" firstobs=2 dlm="|" dsd;
input brand model price1 price2 discamt;
if price1="" then price1=0;
if price2="" then price2=0;
if discamt="" then discamt=0;
if index(model, 'B0000') then output check_bbnov21zoom;
else output bbnov21zoom;
Now my question is: how do I parse this dataset using the PRX Perl regex function within SAS two times:
1) once to extract the brandname of the camera which is a substring from the variable "brand"? For eg, I just need the manufacturer's name which is Sony or Canon or Panasonic, etc. I have written this partial code below but I don't know where to insert it within the datasteps or how to make it work
if _n_=1 then
if pos>0 then
keep brand regprice model manuf;
2) From the "model" variable I need to extract the model# only which is the combination of letters followed by digits, for eg, A4000 (in A4000ISBLU) or VR-310 (in VR-310 BLACK). The code below which is copied from another program does this correctly but I don't know where to insert this code and how to make it work within this program.
Your help is so greatly appreciated. Thanks and hope to hear from you soon.
12-19-2012 05:25 PM
If your data continues to have the brand as the first item in the name and is always one word you don't need to parse;
BrandName = scan(Brand,1);
12-20-2012 09:41 AM
Thank you, I did not know that, it is so much easier! I still need help with parsing for the model# though. Anyone out there who can help me please?
12-20-2012 10:36 AM
Hello. I think this is getting close to what you want...but looking at the model variable, I'm not sure how you parse that properly.
data bbnov21zoomT check_bbnov21zoomT; infile "G:\irobot2\irobot-eval\data\nov21data\bb_nov21\zoombbnv21.csv" firstobs=2 dlm="|" dsd TRUNCOVER; length netprice regprice discamt 8. newbrand $50. newmodel $50.; input brand :$100. model :$50. price1 :8.2 price2 :8.2 _discamt :$10. ; if price1 = . then price1=0; if price2 = . then price2=0; if _discamt in ('','"') then discamt=0; else discamt = input(_discamt,8.); netprice= price1+price2; regprice=netprice+discamt; newbrand = scan(brand,1); newmodel = scan(model,1,' '); drop _discamt; if index(model, 'B0000') then output check_bbnov21zoomT; else output bbnov21zoomT; run;
By the way: the invalid discamt error was because there was a double-quote character at the end of one of the lines, being read into a numeric field.
12-20-2012 01:48 PM
Thank you very much TimArm. I now understand why that invalid discamt was appearing. Thanks also for the code, I will try it and also learn from it.