How to omit duplicates in pyparsing?

Question

How to omit duplicates in pyparsing?

Well, I finally got my grammar to capture all my test cases, but I have a duplicate (case 3) and a false positive (case 6, "PATTERN 5"). Here are my test cases and my desired result .

I'm still pretty new to python (although I can teach my children! Scary!), So I'm sure there are obvious ways to solve this problem, I'm not even sure if this is a pyraming problem. This is what my conclusion looks like:

['01/01/01','S01-12345','20/111-22-1001',['GLEASON', ['5', '+', '4'], '=', '9']] ['02/02/02','S02-1234','20/111-22-1002',['GLEASON', 'SCORE', ':', ['3', '+', '3'], '=', '6']] ['03/02/03','S03-1234','31/111-22-1003',['GLEASON', 'GRADE', ['4', '+', '3'], '=', '7']] ['03/02/03','S03-1234','31/111-22-1003',['GLEASON', 'SCORE', ':', '7', '=', ['4', '+', '3']]] ['04/17/04','S04-123','30/111-22-1004',['GLEASON', 'SCORE', ':', ['3', '+', '4', '-', '7']]] ['05/28/05','S05-1234','20/111-22-1005',['GLEASON', 'SCORE', '7', '[', ['3', '+', '4'], ']']] ['06/18/06','S06-10686','20/111-22-1006',['GLEASON', ['4', '+', '3']]] ['06/18/06','S06-10686','20/111-22-1006',['GLEASON', 'PATTERN', '5']] ['07/22/07','S07-2749','20/111-22-1007',['GLEASON', 'SCORE', '6', '(', ['3', '+', '3'], ')']]

Here is the grammar

 num = Word(nums) arith_expr = operatorPrecedence(num, [ (oneOf('-'), 1, opAssoc.RIGHT), (oneOf('* /'), 2, opAssoc.LEFT), (oneOf('+ -'), 2, opAssoc.LEFT), ]) accessionDate = Combine(num + "/" + num + "/" + num)("accDate") accessionNumber = Combine("S" + num + "-" + num)("accNum") patMedicalRecordNum = Combine(num + "/" + num + "-" + num + "-" + num)("patientNum") score = (Optional(oneOf('( [')) + arith_expr('lhs') + Optional(oneOf(') ]')) + Optional(oneOf('= -')) + Optional(oneOf('( [')) + Optional(arith_expr('rhs')) + Optional(oneOf(') ]'))) gleason = Group("GLEASON" + Optional("SCORE") + Optional("GRADE") + Optional("PATTERN") + Optional(":") + score) patientData = Group(accessionDate + accessionNumber + patMedicalRecordNum) partMatch = patientData("patientData") | gleason("gleason")

and output function.

 lastPatientData = None for match in partMatch.searchString(TEXT): if match.patientData: lastPatientData = match elif match.gleason: if lastPatientData is None: print "bad!" continue # getParts() FOUT.write( "['{0.accDate}','{0.accNum}','{0.patientNum}',{1}]\n".format(lastPatientData.patientData, match.gleason))

As you can see, the result is not as good as it seems, I'm just writing to a file and pretending some syntax. I struggled with how to get intermediate pyparsing results so that I can work with them. Should I just write this and run a second script that finds duplicates?

Update based on Paul McGuire's answer. The result of this function leads me to one row per record, but now I lose part of the score (each Gleason score, intellectually, has the form primary + secondary = total . This is directed to the database, so pri, sec, tot are separate posgresql columns, or , to display the parser, values separated by commas)

 accumPatientData = None for match in partMatch.searchString(TEXT): if match.patientData: if accumPatientData is not None: #this is a new patient data, print out the accumulated #Gleason scores for the previous one writeOut(accumPatientData) accumPatientData = (match.patientData, []) elif match.gleason: accumPatientData[1].append(match.gleason) if accumPatientData is not None: writeOut(accumPatientData)

So now the result is as follows:

 01/01/01,S01-12345,20/111-22-1001,9 02/02/02,S02-1234,20/111-22-1002,6 03/02/03,S03-1234,31/111-22-1003,7,4+3 04/17/04,S04-123,30/111-22-1004, 05/28/05,S05-1234,20/111-22-1005,3+4 06/18/06,S06-10686,20/111-22-1006,, 07/22/07,S07-2749,20/111-22-1007,3+3

I would like to go back there and grab some of the lost elements, rearrange them, find those that are missing, and bring them back. Something like this pseudo code:

 def diceGleason(glrhs,gllhs) if glrhs.len() == 0: pri = gllhs[0] sec = gllhs[2] tot = pri + sec return [pri, sec, tot] elif glrhs.len() == 1: pri = gllhs[0] sec = gllhs[2] tot = glrhs return [pri, sec, tot] else: pri = glrhs[0] sec = glrhs[2] tot = gllhs return [pri, sec, tot]

Update 2: Ok, Paul is awesome, but I'm dumb. Having tried exactly what he said, I tried several ways to acquire pri, sec, and tot, but I failed. I get an error message:

 Traceback (most recent call last): File "Stage1.py", line 81, in <module> writeOut(accumPatientData) File "Stage1.py", line 47, in writeOut FOUT.write( "{0.accDate},{0.accNum},{0.patientNum},{1.pri},{1.sec},{1.tot}\n".format( pd, gleaso nList)) AttributeError: 'list' object has no attribute 'pri'

These attribute attributes are what I keep getting. It is clear that I do not understand what is going on between them (Paul, I have a book, I swear that it is open in front of me, and I do not understand). Here is my script . Is something in the wrong place? Am I calling the results wrong?

+4

python parsing pyparsing

Niels Aug 27 '13 at 20:40

source share

1 answer

Paulmcg · Accepted Answer · 2013-08-28T02:30:58+0000

I have not made any changes to your parser, but made a few changes to your code after parsing.

You don’t actually get “duplicates,” the problem is that you print out the current patient data every time you see the Gleason score, and some of your patient data include several Gleason score records. If I understand what you are trying to do, here is the pseudo code I will stick with:

 accumulator = None foreach match in (patientDataExpr | gleasonScoreExpr).searchString(source): if it a patientDataExpr: if accumulator is not None: # we are starting a new patient data record, print out the previous one print out accumulated data initialize new accumulator with current match and empty list for gleason data else if it a gleasonScoreExpr: add this expression into the current accumulator # done with the for loop, do one last printout of the accumulated data if accumulator is not None: print out accumulated data

This can be easily converted to Python:

 def printOut(patientDataTuple): pd,gleasonList = patientDataTuple print( "['{0.accDate}','{0.accNum}','{0.patientNum}',{1}]".format( pd, ','.join(''.join(gl.rhs) for gl in gleasonList))) accumPatientData = None for match in partMatch.searchString(TEXT): if match.patientData: if accumPatientData is not None: # this is a new patient data, print out the accumulated # Gleason scores for the previous one printOut(accumPatientData) # start accumulating for a new patient data entry accumPatientData = (match.patientData, []) elif match.gleason: accumPatientData[1].append(match.gleason) #~ print match.dump() if accumPatientData is not None: printOut(accumPatientData)

I do not think that I am dumping Gleason's data correctly, but you can configure it here, I think.

EDIT:

You can attach diceGleason as a parsing action to gleason and get the following:

 def diceGleasonParseAction(tokens): def diceGleason(glrhs,gllhs): if len(glrhs) == 0: pri = gllhs[0] sec = gllhs[2] #~ tot = pri + sec tot = str(int(pri)+int(sec)) return [pri, sec, tot] elif len(glrhs) == 1: pri = gllhs[0] sec = gllhs[2] tot = glrhs return [pri, sec, tot] else: pri = glrhs[0] sec = glrhs[2] tot = gllhs return [pri, sec, tot] pri,sec,tot = diceGleason(tokens.gleason.rhs, tokens.gleason.lhs) # assign results names for later use tokens.gleason['pri'] = pri tokens.gleason['sec'] = sec tokens.gleason['tot'] = tot gleason.setParseAction(diceGleasonParseAction)

You had only one typo in which you summed pri and sec to get tot , but these are all lines, so you added "3" and "4" and got "34" - converting to ints to make an addition was all that was necessary. Otherwise, I saved diceGleason verbatim inside the diceGleasonParseAction to isolate your logic for deriving pri , sec and tot from the mechanics of decorating syntax tokens with new result names. Since the parsing action does not return anything new, the tokens are updated in place and then transferred for later use in your output method.

How to omit duplicates in pyparsing?

More articles: