Reasons for garbage collection failure

Question

Reasons for garbage collection failure

I’ve been struggling with a crash in a C # application for some time now, which also uses a fair share of C ++ / CLI modules, which basically wrap their own libraries to access device drivers. An accident is not always easy to reproduce, but I was able to collect half a dozen emergency dumps, which show that the program always crashes with access violation during garbage collection. This is the main column and the last event log:

0:000> k ChildEBP RetAddr 0012d754 79f95a8f mscorwks!WKS::gc_heap::find_first_object+0x62 0012d7dc 79f933bb mscorwks!WKS::gc_heap::mark_through_cards_for_segments+0x493 0012d814 79f92cbf mscorwks!WKS::gc_heap::mark_phase+0xc3 0012d838 79f93245 mscorwks!WKS::gc_heap::gc1+0x62 0012d84c 79f92f5a mscorwks!WKS::gc_heap::garbage_collect+0x253 0012d878 79f94e26 mscorwks!WKS::GCHeap::GarbageCollectGeneration+0x1a9 0012d904 79f926ce mscorwks!WKS::gc_heap::try_allocate_more_space+0x15b 0012d918 79f92769 mscorwks!WKS::gc_heap::allocate_more_space+0x11 0012d938 79e73291 mscorwks!WKS::GCHeap::Alloc+0x3b 0:000> .lastevent Last event: 7e8.88: Access violation - code c0000005 (first/second chance not available) debugger time: Mon Sep 26 11:34:53.646 2011 (UTC + 2:00)

So let me first ask my question and give more details below. My question is , besides the managed heap, is there any other reason for the failure during garbage collection ?

Now, to clarify a bit, the reason I'm asking about this is that it is very difficult for me to find code that distorts the managed heap and cannot find the template for the memory (presumably) being overwritten.

I already tried to comment on all the “dangerous” C ++ / CLI code (especially the parts that use pens), but that didn't help. When trying to find a pattern in rewritable memory, I looked at the disassembled code at the time of the failure:

 0:000> u .-a .+a mscorwks!WKS::gc_heap::find_first_object+0x54: 79f935b9 89450c mov dword ptr [ebp+0Ch],eax 79f935bc 8bd0 mov edx,eax 79f935be 8b02 mov eax,dword ptr [edx] 79f935c0 83e0fe and eax,0FFFFFFFEh 79f935c3 f70000000080 test dword ptr [eax],80000000h <<<<CRASH 79f935c9 0f84b1000000 je mscorwks!WKS::gc_heap::find_first_object+0x73 0:000> r eax=00000000 ebx=01c81000 ecx=01c80454 edx=01c82fe0 esi=012f0000 edi=000027e1 eip=79f935c3 esp=0012d738 ebp=0012d754 iopl=0 nv up ei pl zr na pe nc cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00010246 mscorwks!WKS::gc_heap::find_first_object+0x62: 79f935c3 f70000000080 test dword ptr [eax],80000000h ds:0023:00000000=????????

A crash occurs when you try to dereference an EAX register that is null. Now, from what I see, EAX was loaded from the contents indicated by the EDX register, so I looked at the address stored there:

 0:000> dd @edx-10 01c82fd0 06542778 00000000 00000000 01c82494 01c82fe0 00000000 00000000 00000000 00000000 01c82ff0 01b641d0 00000000 00000000 01c82380

EDIT: now I see that my analysis was wrong, there was no understanding of the x86 addressing modes.

So, I see that starting at address 01c82fed (the value stored in EDX), the next 16 bytes are zero. But, looking at another similar crash dump, I see the following:

 0:000> dd @edx-10 018defd4 00000000 00000000 00000000 00000000 018defe4 00000000 00000000 018b468c 01742354 018deff4 00e0907f 00000000 00000000 00000000

So here there are 16 bytes before the address indicated by EDX, and the next 8 of them are null. And the same thing happens in other dumps that I have, I don’t see a template here, that is, it does not seem that part of the code simply overwrites this memory area.

Returning to the question that I would like to know if there is any other explanation for the failure, except for one fragment that rewrites the memory, that it should not. Or any advice whatsoever about how to proceed, I am really lost in this here.

(can problems have arisen problems?) We have a lot of them, and I think it’s funny, because I always see 137 - no more and no less - pinned pens with! gchandles at the crash point, this is a strange coincidence for me ..).

EDIT : forgot to mention that we are using version 3.5.Net framework. I see reports of similar crashes in .Net 4 when the background GC is active (somewhere it is mentioned that this is a bug in .Net), but I do not think this is relevant here, since AFAIK does not have a background GC in .Net 3.5.

+6

debugging c # windbg

floyd73 Sep 28 '11 at 8:47

source share

3 answers

Not sure if this helps, but usually do not use destructors or prevent the GC from handling unmanaged memory. Instead, use the Dispose pattern and instead move all of the destructor code to finalizers:

 ref class MyClass { UnsafeObject data; MyClass() { data = CreateUnsafeDataObject(); } !MyClass() // IDisposable.Dispose() { DeleteUnsafeDataObject(data); } ~MyClass() // Destructor { } }

This implements the IDisposable template for the object. Call Dispose to clear the unmanaged data, and in the worst case scenario you will have a better chance of figuring out what exactly is happening.

+2

Geirgrusom Sep 28 '11 at 9:10

source share

You probably have an exception in one of your finalizers. I believe that you need to check them in turn, because there is no room for errors in the finalization queue. If you don't have unmanaged code, it's best not to have a finalizer at all, just call Dispose manually.

0

Sergei B. Sep 28 '11 at 9:23

source share

floyd73 · Accepted Answer · 2011-10-05T09:26:16+0000

So, unfortunately, my question was a little misleading, as I was looking for alternative explanations besides managed cumulative corruption, which ultimately turned out to be a problem (caused by an unsafe copy of an uncontrolled managed structure). Now the problem is solved, and I publish my conclusions here in a separate answer, I hope that everything is in order.

Reasons for garbage collection failure

More articles: