Import Text File with Variable Names in Each Row

Reply
New Contributor
Posts: 4

Import Text File with Variable Names in Each Row

Hi all -

Below is one record from a text based file that I need to import:

eventId=2305843081473767143 externalId=1350711152103278711 mrt=1381416790650 customerURI=/All Customers/SC MSS/sc-pc categorySignificance=/Informational categoryBehavior=/Execute/Query categoryDeviceGroup=/IDS/Network categoryOutcome=/Success categoryObject=/Host/Application/Service modelConfidence=0 severity=0 relevance=10 assetCriticality=0 priority=2 art=1381416786785 cat=evStatus rt=1381416786775 dhost=PCMCDCN101 dproc=monitor locality=1 ahost=CHOSTLS911.cgimss.com agt=12.11.157.109 av=5.1.7.6151.0 atz=UTC aid=3ODXDfkABABCAAdhI6X7n3Q\=\= at=superagent_ng dvchost=PCMCDCN101 dvc=12.18.65.23 deviceZoneID=M8NU8ASQBABDaJkLKSuC+mA\=\= deviceZoneURI=/All Zones/CGI MSS/NG MSA/Client Zone dtz=GMT devicePrcli26ssName=monitor eventAnnotationStageUpdateTime=1381416790667 eventAnnotationModificationTime=1381416790667 eventAnnotationAuditTrail=1,1379373843698,root,Queued,,,,\n eventAnnotationVersion=1 eventAnnotationEventId=2305843081473767143 eventAnnotationFlags=0 eventAnnotationEndTime=1381416786775 eventAnnotationManagerReceiptTime=1381416790650 _cefVer=0.1 ad.arcSightEventPath=3sgUYPB4BABCCaGcORHjfMQ\=\=

You'll see the variable names are part of the row, the end result required is separate variables in a SAS dataset such as:

eventId=2305843081473767143

externalId=1350711152103278711

mrt=1381416790650

...

The delimiter is essentially the '=' sign, meaning each time it follows the variable name but precedes the value for the variable.

An interesting twist is the next row of data could have different variable names embedded.

I've tried infile, proc import and various combinations of find, scan, tranwrd, etc.  Also tried a space delimiter but some of the values have spaces so that doesn't work.  Just can't seem to figure-out a strategy to pursue.  Any help or guidance would be appreciated.

Thanks,

Bart

Respected Advisor
Posts: 4,173

Re: Import Text File with Variable Names in Each Row

Below some code which works for the sample data you've provided. The code is based on the sample code found under SAS(R) 9.4 Functions and CALL Routines: Reference

You have in your data key/value pairs. You could use Proc Transpose in case you want to convert the key strings into SAS variables. You probably would need some pre-processing of the key strings in order to create valid SAS variable names (max. 32 characters, alphanumeric plus '_', not starting with a digit).

And: The delimiter between a key and a value is not simply a '=' as you're also having such equal signs as part of your values (eg: GcORHjfMQ\=\=). For the data you've posted the RegEx pattern that seems to work for finding the key string is:

1. beginning of the string or a blank

2. one or many characters which are: alphanumeric, or dot or underscore

3. an equal sign

data want(drop=_Smiley Happy;
  infile datalines;
  input;
  length key $100 value $300;
  _ExpressionID = prxparse('/(^| )[[:alnum:]\._]+(?==)/');
  _start = 1;
  _stop = lengthn(_infile_);

  /* Use PRXNEXT to find the first instance of the pattern, */
  /* then use DO WHILE to find all further instances.       */
  /* PRXNEXT changes the _start parameter so that searching  */
  /* begins again after the last match.                     */
  call prxnext(_ExpressionID, _start, _stop, _infile_, _position, _length);

  do while (_position > 0);
    key = left(substr(_infile_, _position, _length));
    _pos_value=_position+_length+1;
    call prxnext(_ExpressionID, _start, _stop, _infile_, _position, _length);
    value=substrn(_infile_, _pos_value, ifn(_position>0,_position-_pos_value,lengthn(_infile_)));
    output;
  end;

  datalines;
eventId=2305843081473767143 externalId=1350711152103278711 mrt=1381416790650 customerURI=/All Customers/SC MSS/sc-pc categorySignificance=/Informational categoryBehavior=/Execute/Query categoryDeviceGroup=/IDS/Network categoryOutcome=/Success categoryObject=/Host/Application/Service modelConfidence=0 severity=0 relevance=10 assetCriticality=0 priority=2 art=1381416786785 cat=evStatus rt=1381416786775 dhost=PCMCDCN101 dproc=monitor locality=1 ahost=CHOSTLS911.cgimss.com agt=12.11.157.109 av=5.1.7.6151.0 atz=UTC aid=3ODXDfkABABCAAdhI6X7n3Q\=\= at=superagent_ng dvchost=PCMCDCN101 dvc=12.18.65.23 deviceZoneID=M8NU8ASQBABDaJkLKSuC+mA\=\= deviceZoneURI=/All Zones/CGI MSS/NG MSA/Client Zone dtz=GMT devicePrcli26ssName=monitor eventAnnotationStageUpdateTime=1381416790667 eventAnnotationModificationTime=1381416790667 eventAnnotationAuditTrail=1,1379373843698,root,Queued,,,,\n eventAnnotationVersion=1 eventAnnotationEventId=2305843081473767143 eventAnnotationFlags=0 eventAnnotationEndTime=1381416786775 eventAnnotationManagerReceiptTime=1381416790650 _cefVer=0.1 ad.arcSightEventPath=3sgUYPB4BABCCaGcORHjfMQ\=\=
;
run;

Respected Advisor
Posts: 3,799

Re: Import Text File with Variable Names in Each Row

Your data line mostly conforms to the SAS named input style.  In the example one of the names is too long and the values that contain equal signs like aid=3ODXDfkABABCAAdhI6X7n3Q\=\= need to have extra blanks before and after the value like this aid=  3ODXDfkABABCAAdhI6X7n3Q\=\=  .  And of course there is the question are those equal signs really part of the value or something else.  And to use named input you need to know the names so you can define them as variables.  SAS will help you identify the names but you would need to know how they should be defined, or you could just make them all CHAR 128 or so and sort it later.  You could try to fixup the input using the _INFILE_ variable to conform to true named input but you may not have enough information.

19         filename FT15F001 temp lrecl=2048;
20         data test;
21            infile FT15F001;
22            length eventId externalId mrt customerURI aid 'ad.arcSightEventPath'n $128;
23            input (_all_)(=);
24            parmcards4;
26         ;;;;

NOTE:
The infile FT15F001 is:
      (system-specific pathname),
      (system-specific file attributes)

NOTE: NAME,
'categorySignificance' is not defined.
NOTE: NAME,
'categoryBehavior' is not defined.
NOTE: NAME,
'categoryDeviceGroup' is not defined.
NOTE: NAME,
'categoryOutcome' is not defined.
NOTE: NAME,
'categoryObject' is not defined.
NOTE: NAME,
'modelConfidence' is not defined.
NOTE: NAME,
'severity' is not defined.
NOTE: NAME,
'relevance' is not defined.
NOTE: NAME,
'assetCriticality' is not defined.
NOTE: NAME,
'priority' is not defined.
NOTE: NAME,
'art' is not defined.
NOTE: NAME,
'cat' is not defined.
NOTE: NAME,
'rt' is not defined.
NOTE: NAME,
'dhost' is not defined.
NOTE: NAME,
'dproc' is not defined.
NOTE: NAME,
'locality' is not defined.
NOTE: NAME,
'ahost' is not defined.
NOTE: NAME,
'agt' is not defined.
NOTE: NAME,
'av' is not defined.
NOTE: NAME,
'atz' is not defined.
NOTE: NAME,
'at' is not defined.

NOTE: NAME,
'dvchost' is not defined.
NOTE: NAME,
'dvc' is not defined.
NOTE: NAME,
'deviceZoneID' is not defined.
NOTE: NAME,
'8NU8ASQBABDaJkLKSuC+mA\' is not defined.
NOTE: NAME,
'= deviceZoneURI' is not defined.
NOTE: NAME,
'dtz' is not defined.
NOTE: NAME,
'devicePrcli26ssName' is not defined.
NOTE: NAME,
'eventAnnotationStageUpdateTime' is not defined.
NOTE: NAME,
'eventAnnotationModificationTime' is not defined.
NOTE: NAME,
'eventAnnotationAuditTrail' is not defined.
NOTE: NAME,
'eventAnnotationVersion' is not defined.
NOTE: NAME,
'eventAnnotationEventId' is not defined.
NOTE: NAME,
'eventAnnotationFlags' is not defined.
NOTE: NAME,
'eventAnnotationEndTime' is not defined.
NOTE: NAME,
'eventAnnotationManagerReceiptTime' is longer than 32 characters.
NOTE: NAME,
'_cefVer' is not defined.
NOTE: NAME,
'sgUYPB4BABCCaGcORHjfMQ\' is not defined.
NOTE: EQUAL SIGN not
found.
RULE:     ----+----
1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0                     
1         eventId=2305843081473767143 externalId=1350711152103278711 mrt=1381416790650 customerURI=/All Custom
    
101  ers/SC MSS/sc-pc categorySignificance=/Informational categoryBehavior=/Execute/Query categoryDeviceG
    
201  roup=/IDS/Network categoryOutcome=/Success categoryObject=/Host/Application/Service modelConfidence=
    
301  0 severity=0 relevance=10 assetCriticality=0 priority=2 art=1381416786785 cat=evStatus rt=1381416786
    
401  775 dhost=PCMCDCN101 dproc=monitor locality=1 ahost=CHOSTLS911.cgimss.com agt=12.11.157.109 av=5.1.7
    
501  .6151.0 atz=UTC aid=  3ODXDfkABABCAAdhI6X7n3Q\=\=  at=superagent_ng dvchost=PCMCDCN101 dvc=12.18.65.
    
601  23 deviceZoneID=M8NU8ASQBABDaJkLKSuC+mA\=\= deviceZoneURI=/All Zones/CGI MSS/NG MSA/Client Zone dtz=
    
701  GMT devicePrcli26ssName=monitor eventAnnotationStageUpdateTime=1381416790667 eventAnnotationModifica
    
801  tionTime=1381416790667 eventAnnotationAuditTrail=1,1379373843698,root,Queued,,,,\n eventAnnotationVe
    
901  rsion=1 eventAnnotationEventId=2305843081473767143 eventAnnotationFlags=0 eventAnnotationEndTime=138
   
1001  1416786775 eventAnnotationManagerReceiptTime=1381416790650 _cefVer=0.1 ad.arcSightEventPath=3sgUYPB4
   
1101  BABCCaGcORHjfMQ\=\= 1119
eventId=
2305843081473767143 externalId=1350711152103278711 mrt=1381416790650 customerURI=/All Customers/SC MSS/sc-pc
aid=
3ODXDfkABABCAAdhI6X7n3Q\=\= 'ad.arcSightEventPath'n=3 _ERROR_=1 _N_=1
NOTE:
1 record was read from the infile (system-specific pathname).
      The minimum record length was
1119.
      The maximum record length was
1119.
NOTE: The data set WORK.TEST has
1 observations and 6 variables.
NOTE: DATA statement used (Total process time):
      real time          
0.01 seconds
      cpu time           
0.01 seconds
     
Respected Advisor
Posts: 4,173

Re: Import Text File with Variable Names in Each Row

Posted in reply to data_null__

If we wouldn't get stuff like "NOTE: NAME, '= deviceZoneURI' is not defined." then I would be very tempted to use your approach, parse the resulting log and then generate an input statement based on the log messages.

Respected Advisor
Posts: 3,799

Re: Import Text File with Variable Names in Each Row

I did mention the values that contain equal sign like deviceZoneID=M8NU8ASQBABDaJkLKSuC+mA\=\= do not conform to the rules of named input.  The value needs to have two spaces before and after as deviceZoneID=  M8NU8ASQBABDaJkLKSuC+mA\=\= .  Asumming that \=\= are really suppose to be part of that value.

You could still do your parse thing and use

'= deviceZoneURI' is not defined.

to figure out where to insert spaces in the previous value.

Of couse we have only seen one line of the data.

Super Contributor
Posts: 339

Re: Import Text File with Variable Names in Each Row

Do you have any knowledge about when new records start? Like do they always begin with eventid= for example.

You can manipulate your file to set it up with a single variable=value per line and then read that however you want using PRX routines. Here's an example

data _null_;

x="eventId=2305843081473767143 externalId=1350711152103278711 mrt=1381416790650 customerURI=/All Customers/SC MSS/sc-pc categorySignificance=/Informational categoryBehavior=/Execute/Query categoryDeviceGroup=/IDS/Network categoryOutcome=/Success categoryObject=/Host/Application/Service modelConfidence=0 severity=0 relevance=10 assetCriticality=0 priority=2 art=1381416786785 cat=evStatus rt=1381416786775 dhost=PCMCDCN101 dproc=monitor locality=1 ahost=CHOSTLS911.cgimss.com agt=12.11.157.109 av=5.1.7.6151.0 atz=UTC aid=3ODXDfkABABCAAdhI6X7n3Q\=\= at=superagent_ng dvchost=PCMCDCN101 dvc=12.18.65.23 deviceZoneID=M8NU8ASQBABDaJkLKSuC+mA\=\= deviceZoneURI=/All Zones/CGI MSS/NG MSA/Client Zone dtz=GMT devicePrcli26ssName=monitor eventAnnotationStageUpdateTime=1381416790667 eventAnnotationModificationTime=1381416790667 eventAnnotationAuditTrail=1,1379373843698,root,Queued,,,,\n eventAnnotationVersion=1 eventAnnotationEventId=2305843081473767143 eventAnnotationFlags=0 eventAnnotationEndTime=1381416786775 eventAnnotationManagerReceiptTime=1381416790650 _cefVer=0.1 ad.arcSightEventPath=3sgUYPB4BABCCaGcORHjfMQ\=\=";

prxs=prxparse('m/((^|\s)([^=\s]*[^=\\\s])(=)(.*?))(?=(\Z|([^=\s]*[^=\\\s]=[^\Z\s\\]+)))/');

start=1; stop=length(x);

call prxnext(prxs, start, stop, x, pos, len);

do while(pos>0);

  string=substr(x, pos, len);

  put pos= len= string=;

  start=pos+len-1;

  call prxnext(prxs, start, stop, x, pos, len);

end;

run;


*edit apparently prxnext routine "captures" non-capturing groups when it changes start value over iteration so I had to manually recalculate start at each iteration as otherwise the non-capturing group detailed below that acts as the complex delimiter wouldn't be part of next capture so it essentially was only capturing every other variable=value group instead of each one. This is kind of a disappointing behaviour but at least it's somewhat easy to fix.

You'll have to excuse my poor formatting, I'm in a hurry.

As for a brief PRX logic walkthrough:

the first big () group is the capturing group, following this is the non-capturing group that is used to check for the next variable name followed by a non-escaped = (the next variable name w non escape = becomes the actual complex seperating group since equal alone is far from sufficient).

capturing group logic: ((^|\s)([^=\s]*[^=\\\s])(=)(.*?))

1. must start at either the start of the line or with a blank space

2. capture however many non-equal or non-space characters as possible

3. capture exactly one non-equal, non-space AND non-backslash character

4. capture exactly one equal character

5. capture whatever, as few as possible following that equal sign

non-capturing group logic that more or less dictates where 5. above stops: (?=(\Z|([^=\s]*[^=\\\s]=[^\Z\s\\]+)))

1.1 capture an end of line or end of string character OR

1.2.1 capture 2 through 4 above (capture a new variable name followed with an equal sign)

1.2.2 follow that group with at least one non-space, non-backslash and non-end of line/end of string delimiter (that was to handle the last line that had a buggy capture of 5. with capturing nothing at all because it was otherwise possible to capture partial)

Hope this helps, I can help with debugging of routine for larger data later

Vince

SAS Super FREQ
Posts: 708

Re: Import Text File with Variable Names in Each Row

Hi BMillson

Try using the pointer control @'character-string' to position at the proper location and then use modified list input. A first simple test indicates that it works, but you need to check with all your data. Please note the sequence of the variables in the INPUT statement needs to match the sequence in the data.

data want;
  infile datalines  dlm=" ";
 
input
    @
"eventId=" eventID : $32.
    @
"aid=" aid : $32.
    @
"ad.arcSightEventPath=" ad_arcSightEventPath : $32.
  ;
  datalines;
eventId=2305843081473767143 aid=3ODXDfkABABCAAdhI6X7n3Q\=\= ad.arcSightEventPath=3sgUYPB4BABCCaGcORHjfMQ\=\=
;

proc print;
run;
SAS Super FREQ
Posts: 708

Re: Import Text File with Variable Names in Each Row

Posted in reply to Bruno_SAS

Just realized that some data values contain blanks, so it will not work.

Ask a Question
Discussion stats
  • 7 replies
  • 687 views
  • 4 likes
  • 5 in conversation