Hi, I'm a new SAS user and I'm trying to see if it can help me for my internship and organizing this data set. I need it organized
code (4889871), packing style (4/40 CT), brand (chilis), product description (BUN BRIOCHE SLIDER 2 INCH), QTY (2).
You should post example text into a text box opened on the forum with the </> icon that appears above the main message windows. Open the box and the paste.
T QTY UNIT ITEM # BRAND CUBE WEIGHT DESCRIPTION lbs. PACK SIZE UNIT PRICE AMOUNT Cust PO#: #: Ship to#:------------------------------------------------------------- ------- -- ---------------------------------- --------------------------------- --------- F 2 CS 4889871 CHILI'S 3.40 21.00 BUN BRIOCHE SLIDER 2 INCH 4 / 40 CT 44.010 88.02 UNITED STATES ------------------------------------------------------------- ------- -- ---------------------------------- --------------------------------- --------- F 40 CS 7147063 CHILI'S 74.84 633.60 CAKE CHOC MOLTEN 6 / 6 CT 65.800 2,632.00 UNITED STATES ------------------------------------------------------------- ------- -- ---------------------------------- --------------------------------- --------- F 2 CS 4352926 SARGNTO 2.40 50.00 CHEESE CURD BRD 6 / 4# 86.710 173.42 UNITED STATES ------------------------------------------------------------- ------- -- ---------------------------------- --------------------------------- --------- F 12 CS 1954528 ISLDOAS 7.80 336.00 DRINK MIX PINA COLADA FRZN 12 / 32 OZ 90.940 1,091.28 UNITED STATES ------------------------------------------------------------- ------- -- ---------------------------------- --------------------------------- --------- F 120 CS 7190382 LAMB 150.00 3,600.00POTATO FRY STR 3/8 STLTH SKONUNITED STATES ------------------------------------------------------------- 6 ------- / -- 5 LB ---------------------------------- 37.410 --------------------------------- 4,489.20 --------- F 2 CS 4316244 BIPC 1.37 60.00 SAUCE ALFREDO VEG BASE 6 / 5 LB 67.380 134.76 UNITED STATES ------------------------------------------------------------- ------- -- ---------------------------------- --------------------------------- --------- F 40 CS 2055184 SIMPLOT 53.90 1,650.00Strawberry-SlicedCHECK/CONFRM------------------------------------------------------------- 6 ------- / -- 6.5 LB ---------------------------------- 100.370 --------------------------------- 4,014.80 ---------- Cust PO#:SAMPLE Order #: Ship to#:------------------------------------------------------------- ------- -- ---------------------------------- --------------------------------- ---------
This way we can see easily the text in discussion.
My first instinct with data such as this is to go back to the source and see if it is available in another file format as parsing such text requires very good explanations of what will appear in the file and your example has some consistency issues.
If a source such as a delimited file, where a special character separates each topic such as a |, comma, tab or similar that would be easier. Or if there is a fixed column file where topic starts in the same column in each line.
Consider this bit from your example:
F 12 CS 1954528 ISLDOAS 7.80 336.00 DRINK MIX PINA COLADA FRZN 12 / 32 OZ 90.940 1,091.28 UNITED STATES ------------------------------------------------------------- ------- -- ---------------------------------- --------------------------------- --------- F 120 CS 7190382 LAMB 150.00 3,600.00POTATO FRY STR 3/8 STLTH SKONUNITED STATES ------------------------------------------------------------- 6 ------- / -- 5 LB ---------------------------------- 37.410 --------------------------------- 4,489.20 ---------
I might suspect that the line with "POTATO FRY" is partially corrupted is as there is no space between the 3,600.00 and POTATO, plus is the "units" STLTH? About the only thing that allows us to tell where the units appears is based on the presence of the / character but this line doesn't appear to match .
Making a character delimited file at the source would be preferable. I also suspect that something may have already been done to the text to wrap lines and that is going to be a source of difficulty.
BTW, this is the kind of stuff I would normally be charging $75/hour or more to deal with poorly structured files.
UNITED STATES appearing on the same line also varies from the majority of the line
@ballardw wrote:
BTW, this is the kind of stuff I would normally be charging $75/hour or more to deal with poorly structured files.
UNITED STATES appearing on the same line also varies from the majority of the line
From the consulting rates I see these days that's incredibly low. $150-300/hour for data/programmers is the norm now.
When I was at SAS, I dealt with a lot of s* files like this generated by system logging. So, I agree with the comment about this is more of a consulting thing. Let me give you some pointers.
Use attrib statement to set all of your variables at the top of the data step.
Focus on input pointer adjustments. The exact method is up to you in the discovery. You are working with the double @ and the single one along with adjustments such as @(numeric). here is a very old input statement example showing you changes to the read location. It can be a lot more complex including holding the record and parsing it. SAS document on INPUT is voluminous. Look at the standalone input statement too. Grab the line, parse it, move to next but hold the SAS record until complete:
input @15 name $10.; input @a name $10.; input @(b*3) name $10.; input @1 day 1. @5 weekday $10. @1 @weekday sales 8.2; input @23 length 4. +5 width 4.; input name $10. #2 id 3-4; input name age / id 3-4; input name $ 1-8 age 11-12; input name $ age; input name $char8. +2 income comma6.; input name= $ age=; input idno name $18. team $ 25-30 startwght endwght; input; input @; input region $ jansales; input region $ 1-7 jansales 12-16; input region $7. +4 jansales 5.; input region $7. @12 jansales 5.; input @31 age 3. #3 id 3-4 #2 @6 name $20.; input name $ 1-10 #2 age 13-14 #4; input a / b #1 @52 c #2; input a @(a-3) b; input x ?? 10-12; input x ? 10-12; input x{*} 2.; input type $1. @; input course $ professor $; input name $ age @@; input x @; input @x City $9.; input @city Location : 2. @; input @(trim(region{location})) Population : 4.; input @26 book $ @1 company; input x 1-10 +m y 2.; input x 1-10 +(-1) y 2.; input (x y z) (2.,+1); input item : $10. amount comma5.; input item $10. +1 amount comma5.; input item : $10. amount; input Name : $9. Score1-Score3 Team ~ $25. Div $; input (score1-score5) (4. 4. 4. 4. 4.); input (score1-score5) (4.); input (name score1-score5) ($10. 5*4.);
Looks like you are trying to read in a printed report. The example lines make it look like this might be impossible without some human intevention. But perhaps the file has been corrupted by the process you used to make the sample? If the original file actually had the fields in fixed positions on the line (padded with spaces when the field values used fewer characters) then it should be simple.
In general the method for this is to read enough of the line to determine what type of line it is. Then read the line based on that calculation. So in your case it looks like you have a shipping address followed by a series of detail lines. So when you find the shipping address line read in those fields and RETAIN those variables so the value is copied onto each detail line.
First let's convert your sample lines into a file so we have something we can read.
options parmcards=text;
filename text temp;
parmcards4;
T QTY UNIT ITEM # BRAND CUBE WEIGHT DESCRIPTION lbs. PACK SIZE UNIT PRICE AMOUNT
Cust PO#: #: Ship to#:-------------------------------------------------------------≠ -------≠ --≠ ----------------------------------≠ ---------------------------------≠ ---------≠
F 2 CS 4889871 CHILI'S 3.40 21.00 BUN BRIOCHE SLIDER 2 INCH 4 / 40 CT 44.010 88.02
UNITED STATES -------------------------------------------------------------≠ -------≠ --≠ ----------------------------------≠ ---------------------------------≠ ---------≠
F 40 CS 7147063 CHILI'S 74.84 633.60 CAKE CHOC MOLTEN 6 / 6 CT 65.800 2,632.00
UNITED STATES -------------------------------------------------------------≠ -------≠ --≠ ----------------------------------≠ ---------------------------------≠ ---------≠
F 2 CS 4352926 SARGNTO 2.40 50.00 CHEESE CURD BRD 6 / 4# 86.710 173.42
UNITED STATES -------------------------------------------------------------≠ -------≠ --≠ ----------------------------------≠ ---------------------------------≠ ---------≠
F 12 CS 1954528 ISLDOAS 7.80 336.00 DRINK MIX PINA COLADA FRZN 12 / 32 OZ 90.940 1,091.28
UNITED STATES -------------------------------------------------------------≠ -------≠ --≠ ----------------------------------≠ ---------------------------------≠ ---------≠
F 120 CS 7190382 LAMB 150.00 3,600.00POTATO FRY STR 3/8 STLTH SKONUNITED STATES -------------------------------------------------------------≠ 6 -------≠ / --≠ 5 LB ----------------------------------≠ 37.410 ---------------------------------≠ 4,489.20 ---------≠
F 2 CS 4316244 BIPC 1.37 60.00 SAUCE ALFREDO VEG BASE 6 / 5 LB 67.380 134.76
UNITED STATES -------------------------------------------------------------≠ -------≠ --≠ ----------------------------------≠ ---------------------------------≠ ---------≠
F 40 CS 2055184 SIMPLOT 53.90 1,650.00Strawberry-SlicedCHECK/CONFRM-------------------------------------------------------------≠ 6 -------≠ / --≠ 6.5 LB ----------------------------------≠ 100.370 ---------------------------------≠ 4,014.80 ----------
Cust PO#:SAMPLE Order #: Ship to#:-------------------------------------------------------------≠ -------≠ --≠ ----------------------------------≠ ---------------------------------≠ ---------≠
;;;;
Now lets run a data step that reads from that TEXT fileref.
First define the variables. Then start reading the lines. You can use trailing @ to keep the column position fixed on the same line between separate INPUT statements. You can use @ 'quoted string' to try find a particular position on the line. (Or perhaps use FIND() on the _INFILE_ automatic variable and then use @ numvar to jump to a position on the line.)
You can try working around the messiness by using the & modify to read the next field upto two adjacent spaces instead of just one. You can also save the raw _INFILE_ variable so you have something for humans to check to try to fix any issues.
So something like this
data want;
infile text truncover firstobs=2;
length dummy $8 ;
length T $1 QTY 8 UNIT $10 ITEMNO $12 BRAND $20 CUBE WEIGHT 8
DESCRIPTION $80 LBS $20 PACKSIZE $20 UNITPRICE AMOUNT 8
;
length PONO $10 SHIPTO $100 ;
retain pono shipto;
input dummy $ @;
if dummy='Cust' then do;
input PONUM @'Ship to#:' shipto $100.;
delete;
end;
else if dummy='UNITED' then delete;
input @1 t qty unit itemno brand cube weight description & lbs & @;
input @ '/' packsize & unitprice amount :comma. ;
line=_infile_;
run;
It works pretty well for the first few lines.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
What’s the difference between SAS Enterprise Guide and SAS Studio? How are they similar? Just ask SAS’ Danny Modlin.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.