i trying program work parses html tags- it's trec collection. don't program often, except databases , getting stuck on syntax. here's current code:
parsetrec ('la010189.txt') #following code-re p worked in python def parsetrec (atext): atext=open(atext, "r") filepath= "testla.txt" docid= [] doctxt=[] p = re.compile ('<docno>(.*?)</docno>', re.ignorecase) m= re.compile ('<p>(.*?)</p>', re.ignorecase) aline in atext: values=str(aline) if p.findall(values): docid.append(p.findall(values)) if m.findall(values): docid.append(p.findall(values)) print docid atext.close()
the p re pulled docno supposed. m re though not pull data , print empty list. pretty sure there white spaces , new line. tried re.m , did not pull data other lines. ideally point store in dictionary {docno, count}. count determined summing every word in p tags , in list []. appreciate suggestions or advice.
you can try removing line breaks file if think impacting regex results. also, make sure don't have nested <p> tags because regex may not match expected. example:
<p> <p> <p>here's data</p> , more data. </p> , more data. </p>
will capture section because of "?":
<p> <p>here's data</p> , more data.
also, typo:
if p.findall(values): docid.append(p.findall(values)) if m.findall(values): docid.append(p.findall(values))
should be:
docid.append(m.findall(values))
ont last line?
Comments
Post a Comment