BookmarkSubscribeRSS Feed
deleted_user
Not applicable
Hi Guys,

I'm really struggling with a rather intriguing problem. I have a file of XML data that I need to read from SAS. The XML statements occur on two separate lines and only those statements with a valid value needs to be read and the others need to be deleted before writing the statements to a new file.

Here's an example:

My file consists of the following data:


Cost of Goods Sold

ss:Type="Number">2285.4549999999999


ss:Type="Number">2087.239


ss:Type="Number">2082.4839999999999


ss:Type="Number">2537.913


ss:Type="Number">4086.61


ss:Type="Number">5616.8729999999996


ss:Type="Number">6391.4539999999997


ss:Type="Number">7060.6840000000002


ss:Type="String">


ss:Type="String">


ss:Type="String">


ss:Type="String">


ss:Type="String">


ss:Type="String">


ss:Type="String">


ss:Type="String">


ss:Type="String">


ss:Type="String">


ss:Type="String">


ss:Type="String">


ss:Type="String">


ss:Type="String">



From this file I need to read in the valid statements beginning at line . So the valid lines will be:


Cost of Goods Sold

ss:Type="Number">2285.4549999999999


ss:Type="Number">2087.239


ss:Type="Number">2082.4839999999999


ss:Type="Number">2537.913


and the rest of the lines after ss:Type="Number">2537.913
needs to be deleted or just these lines need to be written to a new file and end it with tag.
When the program encounters line #2 ss:Type="Number">2537.913 with a number between the ss:Type="Number"> and then the line before that beginning is invalid hence delete this line and the line above that statement.

I was able to successfully delete all occurrences of ss:Type="Number"> but I'm still trying to figure out how to delete the statements above each occurances of ss:Type="Number">.

I would really appreciate your help on this matter.

Many Thanks,
Jerry
17 REPLIES 17
Cynthia_sas
SAS Super FREQ
Hi:
I'm having trouble figuring out what you want to do. Some possibilities that occur to me are these:
1) You have a Spreadsheet Markup Language XML file created by Excel and you are trying to read that XML into a SAS data set
---If this is what you are trying to do (get from Excel to a SAS dataset and if your data are in this form of XML, I would recommend resaving the file in Excel as an Excel 97-2003 .XLS binary file and then read the file using the SAS Libname engine for Excel or with PROC IMPORT.

2) You are trying to -alter- some Spreadsheet Markup Language XML so that the form you have (shown at the top of your post) turns into the shortened XML that you show at the bottom of your post.
---If this is what you are trying to do (transform the XML from one form into another), then you might try a DATA step program with some string manipulation with character functions. I'm not sure whether the XML Mapper would help you here or not. Since the starting XML looks fairly regular, you could treat everything within <Cell> and </Cell> as one huge character variable and/or everything within one <Data> and </Data> as one huge character variable. But it sort of depends on what you're trying to do in the long run.


3) Or you are trying to do something else???

Did the data originally come from a SAS dataset? Are you trying to create a file for Excel??? Or create a file for Excel using SAS to clean up some XML???

cynthia
deleted_user
Not applicable
Hi Cynthia,

Thanks a million for your response.

The excel file is generated as a result of a webquery. For ex: a client wants to get historical data for a company such as IBM for years 2000 - 2004. Once the client selects the required parameters via the web the SAS is called which will then extract the required data from the corresponding datasets and outputs the result in the form of xls spreadsheet.

The problem is, since the whole process is web based, SAS is run from an unix environment which does not support DDE. Moreover, the excel spreadsheet is a pre-formatted template with formulas in each cell. And selected rows have a continuous border based on the style ID.

The excel template is formatted to 26 columns and 1396 rows. So if the user selects data for only 4 years then when the template is populated with the data the rest of the empty columns that have the borders should be hidden.

I found out that within the and if you remove the the cell tag:


ss:Type="String">
**** NO DATA INBETWEEN

the borders will be removed.

So my goal is to have sas search the entire xml tags and find and remove all instances of where there is no data between ss:Type="String">

Since I'm running the process from shell I read in the original template in XML form.

I hope this clears up your doubts.

Regards,
Jerry
Cynthia_sas
SAS Super FREQ
Hi:
Here's what I don't understand about what you're trying to do. You said:

Once the client selects the required parameters via the web the SAS is called which will then extract the required data from the corresponding datasets and SAS outputs the result in the form of xls spreadsheet.


I added the word "SAS" before the word "outputs", because that's how I read the sentence. What I don't understand is that IF SAS is creating the XLS spreadsheet -- how is it making the spreadsheet -- using ODS HTML, using PROC EXPORT, using CSV, using TAGSETS.EXCELXP?????? If you are using SAS to create the XML that you now want to turn around and edit, why not just change the SAS step to only contain the rows or turn off the borders???

Generally speaking SAS does not create an XLT file or populate an Excel template, it creates an ASCII text file that Excel can open (if you use ODS CSV, ODS HTML or ODS TAGSETS.EXCELXP). When SAS uses ODS, it does not ADD or REMOVE rows from a table or a spreadsheet -- it recreates the file -- so you should be able to recreate the file. If you use PROC EXPORT, SAS might add a sheet to a workbook and then you might use an Excel template with the data in that workbook.

This is why I recommend that you contact Tech Support I think that altering the XML file might be above and beyond what you really need to do. And without seeing all of the code and the involved files, it's hard to make that determination or offer a meaningful suggestion here. I wouldn't recommend editing the XML until after you've investigated other options, especially if SAS is creating the XML file in the first place (the one you now want to edit).

cynthia
deleted_user
Not applicable
Hi Cynthia,

I am given a pre-formatted excel sheet of 26 rows and 1396 columns where all cells from columns C - X have a financial formula in them. This template is not generated by SAS.

So my goal is plug this template into a SAS process where the data extracted from a Data Step will be put into specific cells of the given template. Thus eliminating the need for the user to copy the results of the sas query and manually paste it into the excel template.

I converted this pre-formatted excel spread sheet into excel xml form as template.xml and have sas read in the template file.

Regards
Jerry
CurtisMack
Fluorite | Level 6
Jerry,

Does this code answer your question?

filename in "c:\temp\junk.xml";
filename out "c:\temp\junk2.xml";

data _null_;
infile in lrecl=120;
file out lrecl=4000;
length CellHolder $4000;
retain CellHolder;

input;

if _infile_ =: ' CellHolder = _infile_;
end;
if CellHolder = "" then do;
put _infile_;
end;
else do;
CellHolder = cats(CellHolder,_infile_);
if substr(CellHolder,length(CellHolder)-6) = '
' then
do;
if index(CellHolder,">") = 0 then put CellHolder;
CellHolder = "";
end;
end;
run;
deleted_user
Not applicable
Hi Curtis,

Thanks a bunch for your reply. Yes thats exactly what I was looking for. But here's something interesting that I found. I tried to get the data into a dataset just to see what I get and only those tags were stored that had data in them.

But the most interesting part is when I tried the same data step as you suggested and tried to write to a file. I get the same # of lines as output as in the read file. Meaning unlike in the data step no rows were removed.

I know its gotta be something simple.

Many Thanks
Jerry
CurtisMack
Fluorite | Level 6
Jerry,

I am glad it helped, but a bit confused with your reply. Let me know if there was a question you needed answered, our if you were just sharing your experience. I tested the code I sent you and the junk2.xml only had the rows I think you wanted. If you change that _null_ to a dataset name, you will get results, but not what you were looking for. The logic is based arround writting the correct results to the output file, but the dataset will be populated by the automatic output that occurs before the run statement. If you want a dataset containing these results, you will need to put some explicit output statements along with the put statements, and put the results into a common variable. Here is an example that generates the file and the dataset.


data JunkData(keep = outdata);
infile in lrecl=120;
file out lrecl=4000;
length CellHolder outdata $4000;
retain CellHolder;

input;

if _infile_ =: ' CellHolder = _infile_;
end;
if CellHolder = "" then do;
outdata = _infile_;
output;
put _infile_;
end;
else do;
CellHolder = cats(CellHolder,_infile_);
if substr(CellHolder,length(CellHolder)-6) = '
' then
do;
if index(CellHolder,">") = 0 then do;
outdata = CellHolder;
output;
put CellHolder;
end;
CellHolder = "";
end;
end;
run;
deleted_user
Not applicable
Hi Curtis,

Thanks a bunch. The program is perfect when the raw datafile has now trailing blanks. but lets say if the rawdata reads:


    Cost of Goods     Sold
   
        quot;,R[608]C)">
       
   
  ss:Type="Number">2285.4549999999999


.
.
.
.
.
.
.


I need to remove the trailing blank before letting sas read in the tags. Therefore in such an instance when I run the code the entire contents of the rawdata file is being put into the outfile.

Regards,
Jerry After the tag there is a blank space for each occurrence of


Message was edited by: jerry001
CurtisMack
Fluorite | Level 6
Jerry,

A simple STRIP function will solve most of what you are talking about:
-------------------------------------------------------

data JunkData(keep = outdata);
infile in lrecl=120;
file out lrecl=4000;
length CellHolder outdata InStr $4000;
retain CellHolder;

input;
InStr= strip(_infile_);

if strip(InStr) =: ' CellHolder = InStr;
end;
if CellHolder = "" then do;
outdata = InStr;
output;
put InStr;
end;
else do;
CellHolder = cats(CellHolder,InStr);
if substr(CellHolder,length(CellHolder)-6) = '
' then
do;
if index(CellHolder,">") = 0 then do;
outdata = CellHolder;
output;
put CellHolder;
end;
CellHolder = "";
end;
end;
run;
--------------------------------------------------------------

but I never meant to imply that writting your own XML parser was going to be easy. I have done it in the past, but it was not simple, and i had to assume a lot about how the XML would be structured in that particular process. I only offered my solution because your example was so simple. If for example you want to drop that line that reads "", you will need to add more logic to capture that condition and my code doesn't currently handle that last line of your new example.

Good Luck,
Curtis
deleted_user
Not applicable
Hi Curtis,

I admit.. this XML stuff is killing me.

It wasn't until after I posted my reply did I realize that I could handle the spacing issue by applying a left trim. For the most part I have modified the code to get the desired result but I'm still faced with a major problem.

After running the program in the outfile i get the following:


Cost of Goods Sold


ss:Type="Number">2285.4549999999999


ss:Type="Number">2087.239


ss:Type="Number">2082.4839999999999


ss:Type="Number">2537.913


ss:Type="Number">4086.61




















I'm not sure on how to get rid of the multiple repetitions of:



Below is my code:


data _null_;
infile indata lrecl=1024;
file outdata lrecl=4000;
length CellHolder tag1 tag2 $4000;
retain CellHolder tag1;

putlog tag2;
input;

if _infile_ =: ' CellHolder = _infile_;
end;
if CellHolder = "" then do;
put _infile_;
end;
else do;
*CellHolder = cats(CellHolder,_infile_);
CellHolder = _infile_;
if index(_infile_, '/>') > 0 then do; tag1 = _infile_; put tag1; end;
if index(_infile_, 'ss:Formula=') > 0 then do; tag2 = _infile_; put tag2; tag2 = ""; end;
*put tag2;
if substr(CellHolder,length(CellHolder)-6) = '
' then
do;
if index(CellHolder,">
") = 0 then put CellHolder;
CellHolder = "";
end;
end;
run;

I'm indeed so very grateful for your help.

Regards,
Jerry
deleted_user
Not applicable
Hi Curtis,

I admit.. this XML stuff is killing me.

It wasn't until after I posted my reply did I realize that I could handle the spacing issue by applying a left trim. For the most part I have modified the code to get the desired result but I'm still faced with a major problem.

After running the program in the outfile i get the following:


Cost of Goods Sold


ss:Type="Number">2285.4549999999999


ss:Type="Number">2087.239


ss:Type="Number">2082.4839999999999


ss:Type="Number">2537.913


ss:Type="Number">4086.61




















I'm not sure on how to get rid of the multiple repetitions of:




I'm indeed so very grateful for your help.

Regards,
Jerry Message was edited by: jerry001
CurtisMack
Fluorite | Level 6
My original code had a flaw that worked fine in the original sample data, but not in this new sample. The logic to concatinate the cell string needed to be in its own if block. I think this will solve that last problem.
------------------------------------------

data _null_;
infile in lrecl=120;
file out lrecl=4000;
length CellHolder $4000;
retain CellHolder;

input;
if CellHolder ne "" then do;
CellHolder = catx(" ",CellHolder,_infile_);
end;
else if strip(_infile_) =: ' CellHolder = _infile_;
end;

if CellHolder = "" then do;
put _infile_;
end;
else do;
if substr(CellHolder,length(CellHolder)-6) = '
' then
do;
if index(CellHolder,">") = 0 then do;
put CellHolder;
end;
CellHolder = "";
end;
end;
run;
-----------------------------------------------------------

Curtis
sbb
Lapis Lazuli | Level 10 sbb
Lapis Lazuli | Level 10
One comment - you have no END=EOF condition for your INFILE, so on the last input record, is there going to be a pending PUT that needs to be executed because your DATA step will not iterate again? Hopefully you are running some examples with either LIST; or PUTLOG _ALL_ / _INFILE_; coded in your program.

Scott Barry
SBBWorks, Inc.
CurtisMack
Fluorite | Level 6
As I said earlier, this example assumes a lot about the structure of the XML. If the tags are at the begining of a record, and they are properly closed with a tag which falls at the end of a record, this code does work correctly. Values are only retained if there is an open tag. All other lines are simply written out immediatly. I did test it on the examples Jerry supplied, but admittedly was not attempting to write a complete XML parser.

Curtis

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 17 replies
  • 2741 views
  • 0 likes
  • 4 in conversation