BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
SteveNZ
Obsidian | Level 7

Hi all,

I have a dataset of system notes which recently started storing the notes in XML format so a long with a whole lot of other information there is one variable called notes that now contains XML. All the other variables are just standard character or numeric variables. An example of the notes var is:

<notes displayAsDescending="yes" showtimezone="no"><note><date domain="SVR_DATETIME">20140926T113458</date><timeZone>Pacific/Auckland</timeZone><author username="">Jane Doe</author><content>Note One Content Here</content></note><note><date domain="SVR_DATETIME">20141030T133030</date><timeZone>Pacific/Auckland</timeZone><author username="">Joe Bloggs</author><content>Note Two Content In Here</content></note><note><date domain="SVR_DATETIME">20141117T135213</date><timeZone>New Zealand Standard Time</timeZone><author username="">Joe Public</author><content>Note Three yada yada</content></note><note><date domain="SVR_DATETIME">20141117T152544</date><timeZone>New Zealand Standard Time</timeZone><author username="">Sally Citizen</author><content>Note4 Could be really long</content></note></notes>

The amount of separate <note>'s within each record varies so it could be one or 20 or more. What I need to do is for each <note> output a separate line with each of the nodes ie date, timezone, author and content that are in the xml. So far I have tried two approaches.

First I tried a PERL approach but I couldn't get it to handle the variable lengths ie the code below is intended to separate out each <note> from which I'd do further processing but it returns all:

data notes;

    length fnd $10100 ;

set xml ;

   ExpressionID = prxparse('/(<note>.+<\/note>)/i');

   start = 1;

   stop = length(notes);

   call prxnext(ExpressionID, start, stop, notes, position, length);

     prior_position = 1;

      do while (position > 0);

    fnd = prxposn(ExpressionID,1,notes);

  prior_position = position + length(fnd)+1;

         put fnd= ;

         call prxnext(ExpressionID, start, stop, notes, position, length);

         output ;

      end;

run;

It will output direct matches though ie if I just got it to pull out <note> it would correctly get all 4 in the example above but the second I ask it to get everything between <note> and </note> it gives me the whole record.

Secondly I saved the output as an xml file and successfully mapped it using XML mapper but I am unsure if the map can be applied to a variable within a dataset? All the info I found here and on the net all relate to importing in a whole xml file and I need some of the other variables in the dataset as well. I also dont really want to export it to xml and then import it in if I can help it.

Can anyone please assist? Many thanks in advance

Steve

1 ACCEPTED SOLUTION

Accepted Solutions
BrunoMueller
SAS Super FREQ

Find below your code with some additional statements to get the values between tags and get the tag name, it might be a starting point.

Writing the XML to a file and then use the XML libname engine with a MAP file is maybe a lot more flexible and less code maintenance. You could add a key defintion to the XML string, so that you later can join the original data with the data read from the XML file.

/*
* this regular expression should match any XML tag
* minimum length of tag is 2
/(<.[^(><.)]+>)/
*/

data have;
  length notes $ 32767;
  notes =
'<notes><aa>aacaa</aa><note><date domain="SVR_DATETIME">20140926T113458</date><timeZone>Pacific/Auckland</timeZone><author username="">Jane Doe</author><content>Note One Content Here</content></note><note><date domain="SVR_DATETIME">20141030T133030</date><timeZone>Pacific/Auckland</timeZone><author username="">Joe Bloggs</author><content>Note Two Content In Here</content></note><note><date domain="SVR_DATETIME">20141117T135213</date><timeZone>New Zealand Standard Time</timeZone><author username="">Joe Public</author><content>Note Three yada yada</content></note><note><date domain="SVR_DATETIME">20141117T152544</date><timeZone>New Zealand Standard Time</timeZone><author username="">Sally Citizen</author><content>Note4 Could be really long</content></note></notes>';

 
* notes = '<notes displayAsDescending="yes" showtimezone="no"><note><date domain="SVR_DATETIME">20140926T113458</date><timeZone>Pacific/Auckland</timeZone><author username="">Jane Doe</author><content>Note One Content Here</content></note><note><date domain="SVR_DATETIME">20141030T133030</date><timeZone>Pacific/Auckland</timeZone><author username="">Joe Bloggs</author><content>Note Two Content In Here</content></note><note><date domain="SVR_DATETIME">20141117T135213</date><timeZone>New Zealand Standard Time</timeZone><author username="">Joe Public</author><content>Note Three yada yada</content></note><note><date domain="SVR_DATETIME">20141117T152544</date><timeZone>New Zealand Standard Time</timeZone><author username="">Sally Citizen</author><content>Note4 Could be really long</content></note></notes>';
  regxID = prxparse(
'/(<.[^(><.)]+>)/');
  start = 1;
 
stop = length(notes);
  call prxnext(regxID, start, stop, notes, position, length);
  prior_position = 1;

 
do while (position > 0);
    tag = prxposn(regxID, 1, notes);
    prevPos = prior_position;
    prior_position = position + length;
    endTag = ( tag =:
"</" );

   
if endTag = 1 then do;
      valueLen = position - prevPos;

     
if valueLen > 0 then do;
       
length tagValue $ 128;
        tagValue = substr(notes, prevPos, valueLen);
        tagName = scan(tag,
2, "/>");
      end;
   
end;

   
output;
   
call missing( valueLen, tagValue, tagName );
    call prxnext(regxID, start, stop, notes, position, length);
  end;
run;

View solution in original post

3 REPLIES 3
BrunoMueller
SAS Super FREQ

Find below your code with some additional statements to get the values between tags and get the tag name, it might be a starting point.

Writing the XML to a file and then use the XML libname engine with a MAP file is maybe a lot more flexible and less code maintenance. You could add a key defintion to the XML string, so that you later can join the original data with the data read from the XML file.

/*
* this regular expression should match any XML tag
* minimum length of tag is 2
/(<.[^(><.)]+>)/
*/

data have;
  length notes $ 32767;
  notes =
'<notes><aa>aacaa</aa><note><date domain="SVR_DATETIME">20140926T113458</date><timeZone>Pacific/Auckland</timeZone><author username="">Jane Doe</author><content>Note One Content Here</content></note><note><date domain="SVR_DATETIME">20141030T133030</date><timeZone>Pacific/Auckland</timeZone><author username="">Joe Bloggs</author><content>Note Two Content In Here</content></note><note><date domain="SVR_DATETIME">20141117T135213</date><timeZone>New Zealand Standard Time</timeZone><author username="">Joe Public</author><content>Note Three yada yada</content></note><note><date domain="SVR_DATETIME">20141117T152544</date><timeZone>New Zealand Standard Time</timeZone><author username="">Sally Citizen</author><content>Note4 Could be really long</content></note></notes>';

 
* notes = '<notes displayAsDescending="yes" showtimezone="no"><note><date domain="SVR_DATETIME">20140926T113458</date><timeZone>Pacific/Auckland</timeZone><author username="">Jane Doe</author><content>Note One Content Here</content></note><note><date domain="SVR_DATETIME">20141030T133030</date><timeZone>Pacific/Auckland</timeZone><author username="">Joe Bloggs</author><content>Note Two Content In Here</content></note><note><date domain="SVR_DATETIME">20141117T135213</date><timeZone>New Zealand Standard Time</timeZone><author username="">Joe Public</author><content>Note Three yada yada</content></note><note><date domain="SVR_DATETIME">20141117T152544</date><timeZone>New Zealand Standard Time</timeZone><author username="">Sally Citizen</author><content>Note4 Could be really long</content></note></notes>';
  regxID = prxparse(
'/(<.[^(><.)]+>)/');
  start = 1;
 
stop = length(notes);
  call prxnext(regxID, start, stop, notes, position, length);
  prior_position = 1;

 
do while (position > 0);
    tag = prxposn(regxID, 1, notes);
    prevPos = prior_position;
    prior_position = position + length;
    endTag = ( tag =:
"</" );

   
if endTag = 1 then do;
      valueLen = position - prevPos;

     
if valueLen > 0 then do;
       
length tagValue $ 128;
        tagValue = substr(notes, prevPos, valueLen);
        tagName = scan(tag,
2, "/>");
      end;
   
end;

   
output;
   
call missing( valueLen, tagValue, tagName );
    call prxnext(regxID, start, stop, notes, position, length);
  end;
run;
SteveNZ
Obsidian | Level 7

Hi Bruno,

Thank you so much for this, really appreciated. I agree with the write to xml as when I did this I got perfect results but not sure I can get away with this as it's going to end up on our web portal and not sure the admins will allow it. Might punt up both options but as long as the xml doesn't change I can easily get exactly what I want with your method.

cheers

Steve

Ksharp
Super User

If these fields always have value , You could try:

input @'SVR_DATETIME'   date    @'timeZone'   zone  .............

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 3 replies
  • 1341 views
  • 0 likes
  • 3 in conversation