i trying program work parses html tags- it's trec collection. don't program often, except databases , getting stuck on syntax. here's current code:
parsetrec ('la010189.txt')  #following code-re p worked in python def parsetrec (atext):   atext=open(atext, "r")   filepath= "testla.txt"   docid= []   doctxt=[]   p = re.compile ('<docno>(.*?)</docno>', re.ignorecase)   m= re.compile ('<p>(.*?)</p>', re.ignorecase)   aline in atext:     values=str(aline)     if p.findall(values):       docid.append(p.findall(values))       if m.findall(values):         docid.append(p.findall(values))   print docid   atext.close() the p re pulled docno supposed. m re though not pull data , print empty list. pretty sure there white spaces , new line. tried re.m , did not pull data other lines. ideally point store in dictionary {docno, count}. count determined summing every word in p tags , in list []. appreciate suggestions or advice.
you can try removing line breaks file if think impacting regex results. also, make sure don't have nested <p> tags because regex may not match expected. example:
<p>   <p>     <p>here's data</p>     , more data.   </p>   , more data. </p> will capture section because of "?":
  <p>     <p>here's data</p>     , more data. also, typo:
if p.findall(values):       docid.append(p.findall(values))       if m.findall(values):         docid.append(p.findall(values)) should be:
docid.append(m.findall(values)) ont last line?
Comments
Post a Comment