Difference between revisions of "Repairing MCU1 Black Screen, Glitching, Rebooting"

From Unofficial Tesla Tech
Jump to navigation Jump to search
 
Line 177: Line 177:
 
==Epilogue==
 
==Epilogue==
  
I'm not going to put in one of those useless 'I am not responsible...' disclaimers, but needless to say this all involves risk. Don't even think about it if you don't have engineering and solid Linux skills, just pay somebody. I recommend ce2078 or TonyT, who will give -you- root and they will not monitor you, as opposed to those commercial rooters.
+
I'm not going to put in one of those useless 'I am not responsible...' disclaimers, but needless to say this all involves risk. Don't even think about it if you don't have engineering and solid Linux skills, just pay somebody. I recommend appleguru, ce2078, or TonyT, who will give -you- root and they will not monitor you, as opposed to those commercial rooters.
  
 
There is also a hazard here if you wait to do this until you have a black screen. Ah'm just tellin' you, the eMMC wearing out is inevitable due to heavy logging. It is possible by this point already that partition 3 (/var) is damaged and unreadable. It is best to do this before you have a problem, although let's face it, that's not human nature. I don't know that you will recover if your chip is worn out.
 
There is also a hazard here if you wait to do this until you have a black screen. Ah'm just tellin' you, the eMMC wearing out is inevitable due to heavy logging. It is possible by this point already that partition 3 (/var) is damaged and unreadable. It is best to do this before you have a problem, although let's face it, that's not human nature. I don't know that you will recover if your chip is worn out.

Latest revision as of 15:54, 20 June 2020

Introduction

The Media Control Unit (MCU), the car's 'center of command', has an eMMC which is low quality and has started failing in cars 2012-2016. Three years ago I was the first to work out the procedure to replace this with a high-quality Swissbit eMMC and posted the only articles on how to put the original firmware for the car on this new chip. These articles enable anyone with basic mechanical skills and a well-developed knowledge of Linux to perform this replacement. A number of paid services are now doing this for $1,750 a pop -- Tesla will replace the MCU for $2,500, but the unit they install has the same Hynix chip and will invariably fail some day, and they will never give you root.

The MCU is the large center screen and associated electronics behind it. Assuming you're pre-2017 the CID card is what runs the MCU's OS. It's a daughtercard on the MCU mainboard, which is made by nVidia specifically for Tesla.

The nVidia Tegra 3 is the main processor for the MCU and it's on the CID daughtercard. (the chip under the heatsink) Inside the T3 is a boot coprocessor in addition to the main T3. On reset this boot coprocessor initializes to the first address in the boot flash, which is the Spansion flash on the obverse side of the CID.

You'll find that this flash has much more capacity (512MB) than you'd normally expect for an embedded device, because this code is for much more than a skeleton boot. The coprocessor runs it, and reads the main OS from the Hynix eMMC flash chip on the front of the CID. This Tegra (Ubongo) Linux filesystem is written into RAM by the boot coprocessor.

Once that's complete the coprocessor chains to the main processor which boots the OS in RAM, and then mounts /home and /var on partitions in the eMMC. It makes sense that they'd want /home and /var non-volatile, for car-specific details, logs, etc. I strongly suspect that this boot coprocessor and Spansion chip then becomes the Gateway subsystem.

Now, code-signing is used, which is started very early in the boot coprocessor's initialization, in what is effectively its BIOS. What this means is that if you manually downgrade the OS in the eMMC by writing with dd, you still have the boot code for the later version in the Spansion chip, which does not match. The T3 so fails to boot due to code-signing failure and a black screen. So the active firmware in the eMMC must match the version in the boot flash -- do not unilaterally upgrade the active partition's firmware. Ask me how I know...

Tesla has extensive OS logging going on, and over time this will inevitably wear out the (low-quality) Hynix chip. SD/MMC chips not specifically named 'endurance' are not able to take too much writing. This means that when blocks of an eMMC partition (/var) wear out, the T3 will fail to boot, and black screen, glitches, or frequent reboots.

Approach

My solution was to have a phone repair shop unsolder my eMMC, put it in a special AllSocket chip carrier for 153 pad ball-grid array chips, dump the image and put this firmware image down on a high quality industrial-grade chip with far more capacity. Then with the AllSocket plugged into my laptop's SD/MMC slot, I rooted it myself. I put a Netgear Power-over-Ethernet switch between the Tesla IC ('Instrument Cluster') and MCU, which has an ethernet connection between them, and connected an Arduino Yun to the switch, for wifi and remote control fun. At some point I'll add lipstick dashcams all around and probably change the Yun for a more powerful nano.

If you have a black screen the good news is the damage may be limited to partition 3. (/var) It is imperative that you run a firmware version release in your replacement eMMC which matches the version of your Spansion boot chip, due to code-signing. You have that, in partition 1 or 2 or both. So do the following steps carefully. With tech, a miss, is as good as a mile.

Procedure

Chip Carrier

EMMC BGA153-SD-Adapter.jpg

Order an Allsocket chip-carrier like this one.


Scaesare.png Update from scaesare

I purchased this instead, and it worked like a charm:

DS3000-USB3.0-B153-01.jpg

It also has some other features such as:

  • Write protect switch
  • Power and activity/status LED's
  • Power switch (handy for re-detecting hardware for troublesome recovery)

It also doesn't require buying a quality USB-to-SD adapter if your computer doesn't already have one, helping offset the additional cost.

Ah, good alternative. It won't do though if you need to tap off the CID without unsoldering the eMMC. For that you need the first AllSocket above, with the capacitor mod.

New eMMC Chip

You'll need a new eMMC. SD/MMC is an interface standard for the industry so that in theory any eMMC is a drop-in replacement for one of the same formfactor. I've looked everywhere for an 'endurance'-type eMMC chip but haven't found one yet, so the next best thing is a high-quality chip. Whereas the stock Hynix chip is a low-grade Hynix H26M42002GMR 8GB, I've replaced mine with a SwissBit SFEM032GB1EA1TO 32GB. Moar storage is never a problem (with quality components). I bought mine at Mouser Electronics. The new chip will be balled from the factory so you can pop it in the AllSocket for programming (paying close attention to pin 1).

Unsolder the Old

Take a picture of the Hynix chip before they remove it so you can identify pin 1, which isn't marked on this sorry chip. It helps them if you remove the shroud and heatsink from the T3. No big deal. Have the Hynix eMMC chip removed and make sure they reball it. You need a phone repair shop that actually does rework (doesn't send it out), with an infrared soldering workstation. They'll know exactly what to do, but don't tell them it's for your car... they may get frightened. 'It's for your stereo' or something. Hopefully they have a stainless steel mask (153 pad .5mm pitch), no-clean ('NC') liquid flux, and solder balls, for reballing the pads, and hopefully they place pieces of metal over other components to protect them as they un/solder. If they don't know what you're talking about, find someplace else. This guy unsoldered/soldered mine and he's an expert. (Craigslist)

GadgetGenie.png

Read and Back Up

Put the Hynix chip in the AllSocket carrier, paying close attention to (your picture of) pin 1, and put that into your computer. Need to be running Linux, I recommend CentOS but any will do.

# dmesg

... and the last few lines will show what the chip came up as, usually /dev/mmcblk0< and associated partitions p1-p4.

# cd /home/{youruser}/dl/

... or wherever you want to put the firmware images. You first want to pull the full image, bit-for-bit as there are several blank blocks at the beginning and it is very important to preserve the on-disk structure.

DO NOT MOUNT the partitions at this point.

# dd if=/dev/mmcblk0 of=mmcblk0.img bs=4M

... substitute your input device if different. 'bs=' block-size just speeds up the transfer and makes no reference to structure. This is your Golden Image. Save it, save a backup of it, save a backup of a backup of it, treasure it.

Now for safety and study, also preserve each partition:

# dd if=/dev/mmcblk0p1 of=mmcblk0p1.img bs=4M

... note carefully how this command differs from the above. Do this also with p2 through p4.

Read Errors Aplenty

Unhappy tidings when you try to dd off the original chip and get

dd: error reading ‘/dev/mmcblk0’: Input/output error

In my investigations of the chip I've found that dumping its partitions multiple times gives inconsistent images, not just on partition 3, but also on the other three partitions. I compare images either with diff or with sha256sum. These inconsistent reads imply that it is not the memory cells which are failing, but the support circuitry in the chip that is failing! (yaay Hynix...)

It is very common when ICs fail that the problem is heat-related -- as the chips get warmer they are more and more likely to fail. A common method of isolating faults on a circuit board is to use cold spray on its chips to see when the problem disappears. (Frys Electronics, et al) In this case you'd cold spray the chip carrier with the Hynix chip for about 10 seconds and try to read it. As it's being read, spray every 5 seconds or so to keep it cold.

This should make it possible to get a good read from it.

Fallback: ddrescue

There's a wonderful app for that -- ddrescue.

# ddrescue -d -r3 /dev/mmcblk0 mmcblk0-orig.img mmcblk0-orig.logfile

  • d tells ddrescue to use direct disk access and ignore the kernel’s cache.
  • r3 tells ddrescue to retry bad sectors 3 times before giving up. Note: On a failing drive you may want to eliminate this option the first time so as to not waste time hammering on bad sectors and risking drive failure. You can always use the logfile to go back and retry the bad sectors after you get an image from the first sweep.
  • /dev/mmcblk0 is the drive we are rescuing… the whole disk. Naturally, if you just wanted a specific partition, you would use something like /dev/mmcblk0p1 instead.
  • mmcblk0-orig.img is the name of the destination image file.
  • mmcblk0-orig.logfile is the name of the logfile. Always use a logfile. This allows you to resume an interrupted image at the point you left off, or to retry bad sectors after an initial pass. Without a logfile, you will have to start over again!

GNU ddrescue 1.22 Press Ctrl-C to interrupt Initial status (read from mapfile) rescued: 7818 MB, tried: 0 B, bad-sector: 0 B, bad areas: 0

ipos: 0 B, non-trimmed: 0 B, current rate: 0 B/s opos: 0 B, non-scraped: 0 B, average rate: 0 B/s non-tried: 0 B, bad-sector: 0 B, error rate: 0 B/s rescued: 7818 MB, bad areas: 0, run time: 0s pct rescued: 100.00%, read errors: 0, remaining time: n/a time since last successful read: n/a Finished

I like to ddrescue the image twice or more (renaming the image and logfile) and then diff them to make sure they're binary-identical. Another option is to run sha256sum on each of them to get their hash. Sometimes you can pull a partition six times and none match... yaay Hynix.

Notice above "read errors: 0"? That is a more reliable image than pulls with 1 or 2 read errors. If you can't get matching images, use the one that got 0 read errors, or the fewest.

Fallback: e2fsck

Now; on the Hynix chip it is possible that your filesystem is damaged (due to worn-out memory cells or support circuitry), so let's check it:

# e2fsck /dev/mmcblk0p1

... (used to be for ext* filesystems we'd use fsck --no, not F*CK!, an old Linux joke-- but not anymore), and do each through partition 4. If it finds damage, have it try to repair with

# fsck -y /dev/mmcblk0p3

This should only be done if you can not pull an image first. If you continue to have a problem with a partition use ddrescue. You don't want to mount and just copy the files because that would not preserve ownership, rights, and so on. Although last-ditch,

Fallback: rsync

# rsync --archive --progress {source} {destination}

works very well as long as you restore with same.

Write the New

If you've gotten a clean full image from above, or have repaired part 3 and then gotten a clean full image, remove the Allsocket and the Hynix from within, and replace with your new eMMC.

# dd if=mmcblk0.img of=/dev/mmcblk0 bs=4M

... of course substituting your own source and destination. You see, to Linux everything looks like a file, even devices, so you can treat them that way.

Re-Size

If all has gone well, what we've done is lay down a bit-for-bit image of your old flash onto the new one. This means that the image is limited to 8GB, whereas your new chip has a capacity of 32GB. Fortunately Tesla made /home as the 4th partition, so let's expand that to use more of the chip, giving us access to that space for future fun.

# gparted /dev/mmcblk0

... make -sure- that the device selected in the upper-right is the mmcblk0, or else you are about to ruin your boot disk. Select the resize tool and drag part 4's partition to take up the rest of the chip. Check-mark. Close gparted and consider your next move.

Verygreen.jpg Update from verygreen

It will work better if you do NOT extend home to fill entire chip but leave some "unpartitioned" space instead. The unused space would work as a buffer and that helps reduce wear as the EMMC becomes full (since filesystem does not signal to it when the space is free it does not know what bits are unused and needlessly copying them around). By having more space than is actually used, write amplification is reduced as you approach full storage. With a 32G chip you can easily leave 16G unpartitioned for a great benefit but even 8G would do wonders. For more info search on "ssd write amplification".

Will do, thanks vg.

Root It

Now would be a good time to root the chip and take other measures.

For now you will be able to have the shop resolder the chip and it should boot. Whole chip rework process cost me $125. Hold off on this if you want to root though.

I you feel this process is not something you'd like to take on, there are several people using my methods as a service, for $1,500, or $1,750 with root.

Epilogue

I'm not going to put in one of those useless 'I am not responsible...' disclaimers, but needless to say this all involves risk. Don't even think about it if you don't have engineering and solid Linux skills, just pay somebody. I recommend appleguru, ce2078, or TonyT, who will give -you- root and they will not monitor you, as opposed to those commercial rooters.

There is also a hazard here if you wait to do this until you have a black screen. Ah'm just tellin' you, the eMMC wearing out is inevitable due to heavy logging. It is possible by this point already that partition 3 (/var) is damaged and unreadable. It is best to do this before you have a problem, although let's face it, that's not human nature. I don't know that you will recover if your chip is worn out.

Add a Switch, NanoComputer

Modifying the firmware, you can open the ethernet diag port, get root, and many other things, depending on your firmware version. Optionally you can put a switch between the ICU and MCU for initial work -- I put in a Netgear GS105PE, a PoE switch so I also needed to add a 12v-to-PoE converter. (eBay) Make sure it's a 12v==> PoE and not t'other way around, and make sure it can handle at least twice the wattage as the switch requires. Do not try to power the 'right' ethernet lines, because PoE no longer works that way. Don't ask me how I know...

Once you install a script to open the diag port, it's best to rejoin the IC directly to the MCU (for updates) and attach the diag port to the switch. This gives access to not just the CID but also to the gateway! Happy days!

Also to the switch I added an Arduino Yun. And I did the rubber-ducky antenna mod on the Yun so it associates with my home Unifi Pro AP, and this is how I access the car from the house.

I'll soon be installing Wireguard in the CID so it will establish a tunnel from whatever IP the LTE happens to get, into my home WireGuard server, which is running in a virtual machine in my LAN. IMHO OpenVPN is so lame that I've never bothered to learn it. I had been using IPSec, but recently Linus was so impressed with Wireguard that he's putting it into the kernel as a module (~v4.20) so I've switched everything to that, including using Azire for outgoing VPN service. (ATM I'm apparently coming to you from Spain)

Question: Remounting /var Elsewhere

TB205gti.jpg Question from TB205gti

Can we mitigate the /var wear by remounting a USB to /var? (yes I have root, USB in the armrest mounts to /dev/sda). The the weak will be on the USB stick, and it is also easier to remove the stick and analyse logs on a linux machine if needed? And moving files back and forth is much easier. Triggering the auto-remount should be straight forward once the USB connects.

Yes, you should rsync --archive (to preserve ownership & rights), or better yet dd partition 3 to your USB or SD (/dev/mmcblk1) if you can get to the eMMC with the system shut down. (dd'ing /dev/mmcblk0p3 with the system running could produce unexpected results at the destination)

Then is the catch: modifying fstab to mount the new/var location on boot. This file before boot is in parts 1 and 2/cid-slash-lib/init/ and after boot is of course /etc/fstab. If you modify fstab in the chip-carrier (1 and 2/cid-slash-lib/init/) your mod will disappear on the next firmware update. Best to remount after boot using a custom script in the repertoire you've built of useful custom functions.

Bomb.png Safety Glasses Warning

If you remove the USB though without remounting, the CID will crash and possibly leave the filesystem in a discombobulated state. Then you'll be using one of my solutions above to recover. (IF you have a Golden image)

You can buy extended-wear SD cards, so you could remove the mapping SD card from the CID, dd your mapping partition over to a better one, and then make your /var partition as well (and fstab changes). FWIW I have Transcend High Endurance micro SD's in all my security cameras. Put in as large-a-one as you want, for more fun.

I've never been comfortable with turning off logging as a solution, as this means you are turning off -all- OS logging. (System logging is started at init with Upstart, and can be disabled this way) Sometimes you need the logs for troubleshooting or for forensic purposes.

Carl A. Cook