Collecting data re NMI Parity Error / Hard Lock with t60p + v5200
I have been having reproducable hard locks while playing World of Warcraft.
(and bf2142.. though not as reproducable)
From reading around the forums it seems other people are having similar problems.
I thought I'd start this post so we can collect data about this problem, and
hopefully get it escilated through ibm/lenovo tech support, or ATI's tech support so it actually gets dealt with.
I've been unable to contact a tech support person who knows anything solid about this problem.
While I know this is a CAD machine, I find it unacceptable that a bug causes hard crashes
reproducably during heavy 3d load, on a $5k (in my case) laptop, and the issue has been left unsolved.
I have a t60p, 2007-94U. 2.16ghz, 2x 1gb ibm shipped memory. v5200.
The only notable system change between the stock 2007-94U is
that I swapped out the intel 802.11 card for an ibm atheros one.
I'm running Windows XP Pro SP2, Everything patched and current, current bios, current thinkpad drivers for everything and so on.
I have tried various versions of the drivers for the v5200, all which have had the same result.
1: IBM's official drivers
2: Omega drivers
3: Catalyst Mobility 6.9, hacked INF to force load
4: Catalyst non-mobility 6.9, hacked INF to force load
5: ATI's Firegl drivers
The problem exists for all of them as best as I can tell.
All diagnostics have come up with no problems, memtest86 shows no problems, I have noticed no instability
along these lines in any other case then under high 3d load.
I generally run the machine in an advanced mini-dock, with a dell 24" on DVI as the primary monitor
and the laptop display as the secondary.
Resolutions are 1920x1200x32, 1600x1200x32.
I can reproduce the crash by logging into WoW somewhere busy (like orgrimmar)
and waiting. It generally happens anywhere from 30 seconds to 10 minutes after login.
Almost all the time there is slight, thin white horizontal static across BOTH the monitors before a crash.
The static is very thin and slight, maybe 2 or 3 discernable rows, and it looks like its 'chasing'
up the screen sometimes, like an old tv's vhold being off.
I have noticed a few failure modes:
1: What appears to be a VPU reset. This is the most rare. Screen will blank out, come back, WoW will continue to run.
2: Screen blanks, or image freezes on screen, and sound loops. this requires holding down the power button to restart the machine. Windows seems unaware that a crash occoured (no warnings, etc on next reboot) (this is the most common)
3: Blue NMI: Parity Error screen
At first I thought that this was the fault of the docking station, since it seems to happen less often if the machine is
not docked. Eventually I was able to reproduce the same problem while not in a docking station. Also able to reproduce it
driving a monitor via the VGA port, not docked, or with just the laptop screen alone, in mirrored or external monitor only
configurations as well.
Using tpfancontrol, the GPU temperture is reported at a peak of around 108C, idle of around 75-80C.
I'm fairly certain that tpfancontrol incorrectly reports this temperture, either in that its not celcius, or that it may
be being reported in half-degree-celcius units or something, because i'm fairly certain that 108C is far outside of the
thermal limits of the v5200. I also am ruling out thermal issues due to the specific pre-warning symptoms
at crash, which differ from general overheating failures.
I have removed and reseated the heatsink to no effect. Thermal pads for the chipset and GPU are intact. Everyone should note
that you CANNOT replace the thermal pads with thermal grease, the gap between the GPU and heatsink is designed to be filled
with the thermal pad. Thermal grease properly applied to the GPU will NOT COME IN CONTACT with the heatsink. This may be
part/build specific, but I'm pretty sure this is across the board for all t60p's.
The problem DOES NOT seem to occour if i remove one of my 1gb sticks. I have tried various ordering of the sticks, replaced
one with a replcement from ibm, to no avail. So far the only thing that consistantly works is removing one stick.
A friend is coming over with a pile of compatable memory today, and I'm going to try various configurations to try and
solidify a few theories.
1) 1gb + 512mb (attempt to force the memory out of dual-channel mode)
2) 512mb + 512mb (see if its reproducable in dual-channel mode with <2gb)
3) 1gb + 1gb (of friends memory, to doublecheck its not a general memory problem)
I want to try and test 1gb + 2gb, however 2gb sticks are out of my reach currently.
The theory I'm leaning towards is that there is an incompatability with ATI's hypermemory implementation and the ICH7m chipset
running in dual channel mode. Unfortunately there is no way to disable hypermemory (verified with someone fimiliar with ati
hardware) to directly test this, so I'm trying to alter system memory paramaters to find a stable situation.
If I can narrow this down to a dual-channel specific problem, I may try and talk IBM into sending me a single
2GB sodimm, to replace my current memory. While I have no idea if they'll bite on that, It seems like it would be the
most direct solution if my theory is correct. I think that after the time and energy I've put into solving what seems
to be their problem, time on the phone with their techs, as well as the generally high cost of the machine that there should
be some way that I can meet my desired specs, without it reproducably failing.
So I'd like to get information from anyone else who is having similar problems, and details of their configurations, and any
observations that match or contradict mine on this issue. Especially useful would be any methods besides WoW to reliably
reproduce this problem.
If people have ibm case numbers regarding this situation, we may be able to round up enough of them to convince ibm/lenovo
that there really is an issue besides 'bad memory' or other arbitrary problems.
Thanks very much for (hopefully) reading my tl;dr post, I hope that we can find some kind of solution to this problem.
If anyone would perferr to discuss this personally, feel free to pm me.
My dad has a T43 that had this problem: solution, after the shop replaced motherboard, etc. they reinstalled OS and immediately the RAM showed up as faulty. I see you are going to try the RAM today anyway... good luck, I hope that is the problem.