Well, I finally got my grammar to capture all my test cases, but I have a duplicate (case 3) and a false positive (case 6, "PATTERN 5"). Here are my test cases and my desired result .
I'm still pretty new to python (although I can teach my children! Scary!), So I'm sure there are obvious ways to solve this problem, I'm not even sure if this is a pyraming problem. This is what my conclusion looks like:
['01/01/01','S01-12345','20/111-22-1001',['GLEASON', ['5', '+', '4'], '=', '9']] ['02/02/02','S02-1234','20/111-22-1002',['GLEASON', 'SCORE', ':', ['3', '+', '3'], '=', '6']] ['03/02/03','S03-1234','31/111-22-1003',['GLEASON', 'GRADE', ['4', '+', '3'], '=', '7']] ['03/02/03','S03-1234','31/111-22-1003',['GLEASON', 'SCORE', ':', '7', '=', ['4', '+', '3']]] ['04/17/04','S04-123','30/111-22-1004',['GLEASON', 'SCORE', ':', ['3', '+', '4', '-', '7']]] ['05/28/05','S05-1234','20/111-22-1005',['GLEASON', 'SCORE', '7', '[', ['3', '+', '4'], ']']] ['06/18/06','S06-10686','20/111-22-1006',['GLEASON', ['4', '+', '3']]] ['06/18/06','S06-10686','20/111-22-1006',['GLEASON', 'PATTERN', '5']] ['07/22/07','S07-2749','20/111-22-1007',['GLEASON', 'SCORE', '6', '(', ['3', '+', '3'], ')']]
Here is the grammar
num = Word(nums) arith_expr = operatorPrecedence(num, [ (oneOf('-'), 1, opAssoc.RIGHT), (oneOf('* /'), 2, opAssoc.LEFT), (oneOf('+ -'), 2, opAssoc.LEFT), ]) accessionDate = Combine(num + "/" + num + "/" + num)("accDate") accessionNumber = Combine("S" + num + "-" + num)("accNum") patMedicalRecordNum = Combine(num + "/" + num + "-" + num + "-" + num)("patientNum") score = (Optional(oneOf('( [')) + arith_expr('lhs') + Optional(oneOf(') ]')) + Optional(oneOf('= -')) + Optional(oneOf('( [')) + Optional(arith_expr('rhs')) + Optional(oneOf(') ]'))) gleason = Group("GLEASON" + Optional("SCORE") + Optional("GRADE") + Optional("PATTERN") + Optional(":") + score) patientData = Group(accessionDate + accessionNumber + patMedicalRecordNum) partMatch = patientData("patientData") | gleason("gleason")
and output function.
lastPatientData = None for match in partMatch.searchString(TEXT): if match.patientData: lastPatientData = match elif match.gleason: if lastPatientData is None: print "bad!" continue
As you can see, the result is not as good as it seems, I'm just writing to a file and pretending some syntax. I struggled with how to get intermediate pyparsing results so that I can work with them. Should I just write this and run a second script that finds duplicates?
Update based on Paul McGuire's answer. The result of this function leads me to one row per record, but now I lose part of the score (each Gleason score, intellectually, has the form primary + secondary = total . This is directed to the database, so pri, sec, tot are separate posgresql columns, or , to display the parser, values separated by commas)
accumPatientData = None for match in partMatch.searchString(TEXT): if match.patientData: if accumPatientData is not None:
So now the result is as follows:
01/01/01,S01-12345,20/111-22-1001,9 02/02/02,S02-1234,20/111-22-1002,6 03/02/03,S03-1234,31/111-22-1003,7,4+3 04/17/04,S04-123,30/111-22-1004, 05/28/05,S05-1234,20/111-22-1005,3+4 06/18/06,S06-10686,20/111-22-1006,, 07/22/07,S07-2749,20/111-22-1007,3+3
I would like to go back there and grab some of the lost elements, rearrange them, find those that are missing, and bring them back. Something like this pseudo code:
def diceGleason(glrhs,gllhs) if glrhs.len() == 0: pri = gllhs[0] sec = gllhs[2] tot = pri + sec return [pri, sec, tot] elif glrhs.len() == 1: pri = gllhs[0] sec = gllhs[2] tot = glrhs return [pri, sec, tot] else: pri = glrhs[0] sec = glrhs[2] tot = gllhs return [pri, sec, tot]
Update 2: Ok, Paul is awesome, but I'm dumb. Having tried exactly what he said, I tried several ways to acquire pri, sec, and tot, but I failed. I get an error message:
Traceback (most recent call last): File "Stage1.py", line 81, in <module> writeOut(accumPatientData) File "Stage1.py", line 47, in writeOut FOUT.write( "{0.accDate},{0.accNum},{0.patientNum},{1.pri},{1.sec},{1.tot}\n".format( pd, gleaso nList)) AttributeError: 'list' object has no attribute 'pri'
These attribute attributes are what I keep getting. It is clear that I do not understand what is going on between them (Paul, I have a book, I swear that it is open in front of me, and I do not understand). Here is my script . Is something in the wrong place? Am I calling the results wrong?