Building models with SAS Enterprise Miner, SAS Factory Miner, SAS Visual Data Mining and Machine Learning or just with programming

Need to Separate XML In Variable Into Separate Lines and Variables

Accepted Solution Solved
Reply
Frequent Contributor
Posts: 87
Accepted Solution

Need to Separate XML In Variable Into Separate Lines and Variables

Hi all,

I have a dataset of system notes which recently started storing the notes in XML format so a long with a whole lot of other information there is one variable called notes that now contains XML. All the other variables are just standard character or numeric variables. An example of the notes var is:

<notes displayAsDescending="yes" showtimezone="no"><note><date domain="SVR_DATETIME">20140926T113458</date><timeZone>Pacific/Auckland</timeZone><author username="">Jane Doe</author><content>Note One Content Here</content></note><note><date domain="SVR_DATETIME">20141030T133030</date><timeZone>Pacific/Auckland</timeZone><author username="">Joe Bloggs</author><content>Note Two Content In Here</content></note><note><date domain="SVR_DATETIME">20141117T135213</date><timeZone>New Zealand Standard Time</timeZone><author username="">Joe Public</author><content>Note Three yada yada</content></note><note><date domain="SVR_DATETIME">20141117T152544</date><timeZone>New Zealand Standard Time</timeZone><author username="">Sally Citizen</author><content>Note4 Could be really long</content></note></notes>

The amount of separate <note>'s within each record varies so it could be one or 20 or more. What I need to do is for each <note> output a separate line with each of the nodes ie date, timezone, author and content that are in the xml. So far I have tried two approaches.

First I tried a PERL approach but I couldn't get it to handle the variable lengths ie the code below is intended to separate out each <note> from which I'd do further processing but it returns all:

data notes;

    length fnd $10100 ;

set xml ;

   ExpressionID = prxparse('/(<note>.+<\/note>)/i');

   start = 1;

   stop = length(notes);

   call prxnext(ExpressionID, start, stop, notes, position, length);

     prior_position = 1;

      do while (position > 0);

    fnd = prxposn(ExpressionID,1,notes);

  prior_position = position + length(fnd)+1;

         put fnd= ;

         call prxnext(ExpressionID, start, stop, notes, position, length);

         output ;

      end;

run;

It will output direct matches though ie if I just got it to pull out <note> it would correctly get all 4 in the example above but the second I ask it to get everything between <note> and </note> it gives me the whole record.

Secondly I saved the output as an xml file and successfully mapped it using XML mapper but I am unsure if the map can be applied to a variable within a dataset? All the info I found here and on the net all relate to importing in a whole xml file and I need some of the other variables in the dataset as well. I also dont really want to export it to xml and then import it in if I can help it.

Can anyone please assist? Many thanks in advance

Steve


Accepted Solutions
Solution
‎02-12-2015 01:02 PM
SAS Super FREQ
Posts: 682

Re: Need to Separate XML In Variable Into Separate Lines and Variables

Find below your code with some additional statements to get the values between tags and get the tag name, it might be a starting point.

Writing the XML to a file and then use the XML libname engine with a MAP file is maybe a lot more flexible and less code maintenance. You could add a key defintion to the XML string, so that you later can join the original data with the data read from the XML file.

/*
* this regular expression should match any XML tag
* minimum length of tag is 2
/(<.[^(><.)]+>)/
*/

data have;
  length notes $ 32767;
  notes =
'<notes><aa>aacaa</aa><note><date domain="SVR_DATETIME">20140926T113458</date><timeZone>Pacific/Auckland</timeZone><author username="">Jane Doe</author><content>Note One Content Here</content></note><note><date domain="SVR_DATETIME">20141030T133030</date><timeZone>Pacific/Auckland</timeZone><author username="">Joe Bloggs</author><content>Note Two Content In Here</content></note><note><date domain="SVR_DATETIME">20141117T135213</date><timeZone>New Zealand Standard Time</timeZone><author username="">Joe Public</author><content>Note Three yada yada</content></note><note><date domain="SVR_DATETIME">20141117T152544</date><timeZone>New Zealand Standard Time</timeZone><author username="">Sally Citizen</author><content>Note4 Could be really long</content></note></notes>';

 
* notes = '<notes displayAsDescending="yes" showtimezone="no"><note><date domain="SVR_DATETIME">20140926T113458</date><timeZone>Pacific/Auckland</timeZone><author username="">Jane Doe</author><content>Note One Content Here</content></note><note><date domain="SVR_DATETIME">20141030T133030</date><timeZone>Pacific/Auckland</timeZone><author username="">Joe Bloggs</author><content>Note Two Content In Here</content></note><note><date domain="SVR_DATETIME">20141117T135213</date><timeZone>New Zealand Standard Time</timeZone><author username="">Joe Public</author><content>Note Three yada yada</content></note><note><date domain="SVR_DATETIME">20141117T152544</date><timeZone>New Zealand Standard Time</timeZone><author username="">Sally Citizen</author><content>Note4 Could be really long</content></note></notes>';
  regxID = prxparse(
'/(<.[^(><.)]+>)/');
  start = 1;
 
stop = length(notes);
  call prxnext(regxID, start, stop, notes, position, length);
  prior_position = 1;

 
do while (position > 0);
    tag = prxposn(regxID, 1, notes);
    prevPos = prior_position;
    prior_position = position + length;
    endTag = ( tag =:
"</" );

   
if endTag = 1 then do;
      valueLen = position - prevPos;

     
if valueLen > 0 then do;
       
length tagValue $ 128;
        tagValue = substr(notes, prevPos, valueLen);
        tagName = scan(tag,
2, "/>");
      end;
   
end;

   
output;
   
call missing( valueLen, tagValue, tagName );
    call prxnext(regxID, start, stop, notes, position, length);
  end;
run;

View solution in original post


All Replies
Solution
‎02-12-2015 01:02 PM
SAS Super FREQ
Posts: 682

Re: Need to Separate XML In Variable Into Separate Lines and Variables

Find below your code with some additional statements to get the values between tags and get the tag name, it might be a starting point.

Writing the XML to a file and then use the XML libname engine with a MAP file is maybe a lot more flexible and less code maintenance. You could add a key defintion to the XML string, so that you later can join the original data with the data read from the XML file.

/*
* this regular expression should match any XML tag
* minimum length of tag is 2
/(<.[^(><.)]+>)/
*/

data have;
  length notes $ 32767;
  notes =
'<notes><aa>aacaa</aa><note><date domain="SVR_DATETIME">20140926T113458</date><timeZone>Pacific/Auckland</timeZone><author username="">Jane Doe</author><content>Note One Content Here</content></note><note><date domain="SVR_DATETIME">20141030T133030</date><timeZone>Pacific/Auckland</timeZone><author username="">Joe Bloggs</author><content>Note Two Content In Here</content></note><note><date domain="SVR_DATETIME">20141117T135213</date><timeZone>New Zealand Standard Time</timeZone><author username="">Joe Public</author><content>Note Three yada yada</content></note><note><date domain="SVR_DATETIME">20141117T152544</date><timeZone>New Zealand Standard Time</timeZone><author username="">Sally Citizen</author><content>Note4 Could be really long</content></note></notes>';

 
* notes = '<notes displayAsDescending="yes" showtimezone="no"><note><date domain="SVR_DATETIME">20140926T113458</date><timeZone>Pacific/Auckland</timeZone><author username="">Jane Doe</author><content>Note One Content Here</content></note><note><date domain="SVR_DATETIME">20141030T133030</date><timeZone>Pacific/Auckland</timeZone><author username="">Joe Bloggs</author><content>Note Two Content In Here</content></note><note><date domain="SVR_DATETIME">20141117T135213</date><timeZone>New Zealand Standard Time</timeZone><author username="">Joe Public</author><content>Note Three yada yada</content></note><note><date domain="SVR_DATETIME">20141117T152544</date><timeZone>New Zealand Standard Time</timeZone><author username="">Sally Citizen</author><content>Note4 Could be really long</content></note></notes>';
  regxID = prxparse(
'/(<.[^(><.)]+>)/');
  start = 1;
 
stop = length(notes);
  call prxnext(regxID, start, stop, notes, position, length);
  prior_position = 1;

 
do while (position > 0);
    tag = prxposn(regxID, 1, notes);
    prevPos = prior_position;
    prior_position = position + length;
    endTag = ( tag =:
"</" );

   
if endTag = 1 then do;
      valueLen = position - prevPos;

     
if valueLen > 0 then do;
       
length tagValue $ 128;
        tagValue = substr(notes, prevPos, valueLen);
        tagName = scan(tag,
2, "/>");
      end;
   
end;

   
output;
   
call missing( valueLen, tagValue, tagName );
    call prxnext(regxID, start, stop, notes, position, length);
  end;
run;
Frequent Contributor
Posts: 87

Re: Need to Separate XML In Variable Into Separate Lines and Variables

Hi Bruno,

Thank you so much for this, really appreciated. I agree with the write to xml as when I did this I got perfect results but not sure I can get away with this as it's going to end up on our web portal and not sure the admins will allow it. Might punt up both options but as long as the xml doesn't change I can easily get exactly what I want with your method.

cheers

Steve

Super User
Posts: 9,662

Re: Need to Separate XML In Variable Into Separate Lines and Variables

If these fields always have value , You could try:

input @'SVR_DATETIME'   date    @'timeZone'   zone  .............

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 3 replies
  • 381 views
  • 0 likes
  • 3 in conversation