regex - re pulls data from one tag and not the other -


i trying program work parses html tags- it's trec collection. don't program often, except databases , getting stuck on syntax. here's current code:

parsetrec ('la010189.txt')  #following code-re p worked in python def parsetrec (atext):   atext=open(atext, "r")   filepath= "testla.txt"   docid= []   doctxt=[]   p = re.compile ('<docno>(.*?)</docno>', re.ignorecase)   m= re.compile ('<p>(.*?)</p>', re.ignorecase)   aline in atext:     values=str(aline)     if p.findall(values):       docid.append(p.findall(values))       if m.findall(values):         docid.append(p.findall(values))   print docid   atext.close() 

the p re pulled docno supposed. m re though not pull data , print empty list. pretty sure there white spaces , new line. tried re.m , did not pull data other lines. ideally point store in dictionary {docno, count}. count determined summing every word in p tags , in list []. appreciate suggestions or advice.

you can try removing line breaks file if think impacting regex results. also, make sure don't have nested <p> tags because regex may not match expected. example:

<p>   <p>     <p>here's data</p>     , more data.   </p>   , more data. </p> 

will capture section because of "?":

  <p>     <p>here's data</p>     , more data. 

also, typo:

if p.findall(values):       docid.append(p.findall(values))       if m.findall(values):         docid.append(p.findall(values)) 

should be:

docid.append(m.findall(values)) 

ont last line?


Comments