When analyzing binaries, it is important to be able to put what is observed into context. For example, how can CPU instructions be differentiated from data in a binary with a non-standard format? This requires some background knowledge of computer systems in general. I would argue that before any attempt at reverse engineering firmware is made, at least basic familiarity with the following concepts is required:
- Computer architecture / computer system organization
- CPU design and function (e.g. registers, the instruction pointer, memory access)
- memory and the memory hierarchy
- instruction sets, assembly, opcodes, addressing modes, syntax, mnemonics
- information representation (binary, hex, endianness)
- Operating system concepts
- Virtual memory
- usermode vs kernelmode, the kernel, the kernel interface (system calls)
- process layout in memory – stack, heap, data, instructions
- executable formats
- application binary interfaces
- program entry points
- source code to object code transformation
- compilation, assembly, linking
- C/C++ programming
- Assembly programming
- source-to-assembly construct correlation (e.g. recognition of loop, switch constructs in assembly)
- disassembly vs decompilation
My advice is the following:
- read as much as you can: technical specifications, assembly/disassembly, answers to firmware RE questions, research papers, tutorials, blogs, textbooks, manual pages
- emulate/copy the methodologies employed and approaches taken by pros
- gain experience as quickly as possible: look at and experiment with many different types of files (executables, image files, compressed files, firmware, etc.), program in assembly to get a feel for it, disassemble many executables
Firmware RE Resources
“Intro to Embedded Reverse Engineering for PC reversers” by Igor Skochinsky provides an overview of what is involved in reversing firmware, and in “Embedded Devices Security: Firmware Reverse Engineering” Jonas Zaddach and Andrei Costin outline a general methodology for reversing firmware beginning on slide 31.
Look at answers given by pros:
These may be useful or interesting:
Embedded systems often use MIPS or ARM processors, and by extension MIPS or ARM instruction sets. This means that being familiar with MIPS and ARM assembly will be very helpful when analyzing firmware for these systems.
Analyzing the binary
Part 1: Identification of the target device’s architecture
We cannot rely on hearsay to obtain the information required to analyze the firmware. Validity of information about the firmware must be proven by using empirical evidence. It is not enough to have a binary blob from a second-hand source and a processor name from a different question.
1. Identify the target device
Fortunately in this case it is easy to at least get the device name: SMOK X Cube II. When the vendor’s firmware and tools support page is examined it turns out that there is a real device with that name. The .hex file is bundled with an upgrade tool from Taiwanese semiconductor manufacturer Nuvoton called “NuMicro ISP Programming Tool“:
~/firmware/e-cig/XCUBE II upgrading tool $ file * config.ini: ASCII text, with CRLF line terminators NuMicro ISP Programming Tool.exe: PE32 executable (GUI) Intel 80386, for MS Windows NuMicro ISP Programming Tool User's Guide.pdf: PDF document, version 1.5 XCUBE II-VIVI-52 (160616)V.1.098(checksum=0x28F9).hex: ASCII text, with CRLF line terminators
This hex file is straight from the manufacturer of the device processor rather than from a second-hand source. It is also a newer version – v1.098 rather than v1.07. I decided to analyze the older firmware version (v1.07) since this is the version of the binary in the question.
2. Identify the processor
There are some interesting things in the pictures used to describe the upgrade process: the name NuMicro and the acronym ISP in the tool name, the term DataFlash, a reference to something called APROM, and most importantly, the part number: NUC220LE3AN. What “part” is this a number for? A Nuvoton-developed microcontroller based on ARM’s Cortex-M0 processor.
3. Identify the instruction set architecture
Nuvoton is kind enough freely share technical documentation for the NuMicro NUC220 series, including the datasheet and the technical reference manual, in addition to various software tools and training materials (click on the “Resources” tab at the top of the NUC220LE3AN product page).
From the datasheet, Section 1: “General Description”, page 7 (emphasis mine):
The NuMicro NUC200 Series 32-bit microcontrollers is embedded with the newest ARM® Cortex™-M0 core with a cost equivalent to traditional 8-bit MCU for industrial control and applications requiring rich communication interfaces. The NuMicro NUC200 Series includes NUC200 and NUC220 product lines.
Is this enough information to conclude that the code in the firmware binary consist of 32-bit ARM instructions? No, it is not. Let us look closely at the functional description of the processor (Chapter 6: Functional Description, section 1: ARM Cortex-M0 Core, page 48):
Let us take special note of the following information:
The processor can execute Thumb code and is compatible with other Cortex®-M profile processor.
ARMv6-M Thumb® instruction set
Note that the processor is an ARM Cortex-M0 Core and not ARM Cortex-M0+ Core, which has a different instruction set.
From ARM’s Cortex-M0 technical reference manual:
The processor implements the ARMv6-M Thumb instruction set, including a number of 32-bit instructions that use Thumb-2 technology. The ARMv6-M instruction set comprises:
- all of the 16-bit Thumb instructions from ARMv7-M excluding CBZ, CBNZ and IT
- the 32-bit Thumb instructions BL, DMB, DSB, ISB, MRS and MSR.
What is “Thumb code” and the “Thumb instruction set”?
From “Introduction to ARM thumb” by Joe Lemieux (emphasis mine):
The Thumb instruction set consists of 16-bit instructions that act as a compact shorthand for a subset of the 32-bit instructions of the standard ARM. Every Thumb instruction could instead be executed via the equivalent 32-bit ARM instruction. However, not all ARM instructions are available in the Thumb subset; for example, there’s no way to access status or coprocessor registers. Also, some functions that can be accomplished in a single ARM instruction can only be simulated with a sequence of Thumb instructions.
At this point, you may ask why have two instruction sets in the same CPU? But really the ARM contains only one instruction set: the 32-bit set. When it’s operating in the Thumb state, the processor simply expands the smaller shorthand instructions fetched from memory into their 32-bit equivalents.
The difference between two equivalent instructions lies in how the instructions are fetched and interpreted prior to execution, not in how they function. Since the expansion from 16-bit to 32-bit instruction is accomplished via dedicated hardware within the chip, it doesn’t slow execution even a bit. But the narrower 16-bit instructions do offer memory advantages.
The Thumb instruction set provides most of the functionality required in a typical application. Arithmetic and logical operations, load/store data movements, and conditional and unconditional branches are supported. Based upon the available instruction set, any code written in C could be executed successfully in Thumb state. However, device drivers and exception handlers must often be written at least partly in ARM state.
Here is a good explanation from SO: ARM, Thumb and Thumb 2 instructions confusion
From the ARMv6-M Architecture Reference Manual, Chapter A5:The Thumb Instruction Set Encoding, section 1: Thumb instruction set encoding, page 82:
The NuMicro NUC200 Series only supports little-endian data format.
To summarize: the code in the firmware binary will consist of little-endian 16-bit ARM Thumb instructions plus a few 32-bit Thumb2 instructions to be executed by a 32-bit ARM Cortex-M0 processor implementing the ARM 16-bit Thumb instruction set with support for Thumb2.
4. Identify the device’s memory layout
Access to the technical reference manual allows us to determine what APROM and ISP are. From Chapter 6: Functional Description, section 4.4.1: Flash Memory Organization, page 191:
The NuMicro NUC200 Series flash memory consists of program memory (APROM), Data Flash, ISP loader program memory (LDROM), and user configuration. Program memory is main memory for user applications and called APROM. User can write their application to APROM and set system to boot from APROM.
ISP loader program memory is designed for a loader to implement In-System-Programming function. LDROM is independent to APROM and system can also be set to boot from LDROM. Therefore, user can user LDROM to avoid system boot fail when code of APROM was corrupted.
And from Chapter 6: Functional Description, section 4.4.5: In-System-Programming (ISP), page 199:
ISP provides the ability to update system firmware on board. Various peripheral interfaces let ISP loader in LDROM to receive new program code easily. The most common method to perform ISP is via UART along with the ISP loader in LDROM. General speaking, PC transfers the new APROM code through serial port. Then ISP loader receives it and re-programs into APROM through ISP commands.
According to the information in the
config.ini file bundled with the NuMicro ISP Programming Tool, flash memory size of the APROM segment is 128 KB:
$ cat config.ini | grep NUC200LE3AN -B2 -A3 [0x00020000] NAME_STRING = NUC200LE3AN RAM_SIZE = 16 FLASH_SIZE = 128
Here is a diagram of the flash memory address map:
We know that the space from 0x0000_0000 to 0x0001_FFFF = 131071 bytes, which is 128 KB, and this is the region to which the binary from the hex file will be flashed to using the upgrade tool. Above that there is a block of memory from 0x0002_0000 to 0x0010_000 which is labeled “Reserved for Further Used”. The size of this “Reserved” space is 0x0010_0000 – 0x0002_0000 = 0xE0000, or 917504 bytes. This is almost 1 megabyte of reserved space. The 128 KB reserved for APROM makes up 12.5% of the address space between 0x0000_0000 and 0x0010_0000, but is represented as being larger than the ~1 MB “Reserved” block. This is very strange. There is also no documentation of this reserved block anywhere in the technical reference manual that I could find. If one had physical access to the device, perhaps the contents of flash memory could be dumped and analyzed to find out what lies in this region.
Since the firmware binary is written to space in flash memory reserved for user applications, it seems unlikely that the firmware binary contains kernel code, bootloader code or a filesystem. This is different from router firmware, which tends to at the very least contain kernel code.
Part 2: Direct analysis of the binary
Quick recap of what we know at this point:
- The device name – SMOK X Cube II
- The processor – A NuMicro NUC220LE3AN processor, based on an ARM Cortex-M0 Core processor
- The instruction set architecture – little-endian ARM-v6 M 16-bit Thumb
- The location in flash memory to which the firmware will be written – the 128KB APROM region for user applications (in other words, not the kernel)
- NuMicro is a Taiwan-based company. We will see why this is potentially relevant shortly.
- The entropy plot generated by
binwalkincluded in the question reveals that there are no encrypted or compressed regions in the firmware
- Based on information included in the question, there exist ASCII strings embedded in the file that appear to be related to the functionality of the device
- firmware binaries do not have a standard format like executable binaries do
- Data may be intermingled with code/instructions within the binary. If this is the case, it is possible that data such as strings will be disassembled as instructions, resulting in an incorrect representation of the firmware’s code
The output of strings can be used to quick heuristic in determining if the firmware is encrypted/compressed. If there are no strings in the output, it is a good indicator that the entire file is obfuscated somehow.
hexdump with the -C argument can be used to provide some context for the strings i.e. where in the binary they are relative to code and relative to each other. In other words, are the strings packed together in a single block, or are they scattered throughout the binary? The answer can provide clues about the layout of the firmware.
hexump, we see that the ASCII strings are intermingled with what might be code:
00002ed0 01 21 1b 20 fd f7 6e fe 21 46 38 6a 09 f0 16 fd |.!. ..n.!F8j....| 00002ee0 64 21 09 f0 13 fd 08 46 0a 21 09 f0 0f fd 10 30 |d!.....F.!.....0| 00002ef0 14 21 48 43 42 19 01 21 25 20 fd f7 5b fe 73 e0 |.!HCB..!% ..[.s.| 00002f00 68 e2 88 e0 57 41 54 54 0a 00 00 00 4d 4f 44 45 |h...WATT....MODE| 00002f10 0a 00 00 00 7c db 00 00 88 db 00 00 54 45 4d 50 |....|.......TEMP| 00002f20 0a 00 00 00 4d 45 4d 4f 52 59 0a 00 20 4d 4f 44 |....MEMORY.. MOD| 00002f30 45 20 0a 00 ac 01 00 20 53 54 52 45 4e 47 54 48 |E ..... STRENGTH| 00002f40 0a 00 00 00 3c 0b 00 20 20 4d 49 4e 20 0a 00 00 |....<.. MIN ...| 00002f50 53 4f 46 54 0a 00 00 00 4e 4f 52 4d 0a 00 00 00 |SOFT....NORM....| 00002f60 48 41 52 44 0a 00 00 00 20 4d 41 58 20 0a 00 00 |HARD.... MAX ...| 00002f70 ea cf 00 00 42 4c 55 45 54 4f 4f 54 48 0a 00 00 |....BLUETOOTH...| 00002f80 20 20 20 4f 4e 20 20 20 20 0a 00 00 20 20 20 4f | ON ... O| 00002f90 46 46 20 20 20 0a 00 00 ea d0 00 00 20 20 20 4c |FF ....... L| 00002fa0 45 44 20 20 20 0a 00 00 6a d1 00 00 53 54 45 41 |ED ...j...STEA| 00002fb0 4c 54 48 0a 00 00 00 00 20 4f 46 46 20 20 0a 00 |LTH..... OFF ..| 00002fc0 20 20 4f 4e 20 20 0a 00 20 20 54 4f 44 41 59 20 | ON .. TODAY | 00002fd0 20 0a 00 00 80 96 98 00 f6 e1 00 00 83 e5 00 00 | ...............| 00002fe0 a0 86 01 00 10 27 00 00 21 46 38 6a 09 f0 8e fc |.....'..!F8j....| 00002ff0 0a 21 09 f0 8b fc 10 31 14 20 41 43 4a 19 01 21 |.!.....1. ACJ..!|
another group of ASCII strings elsewhere in the binary:
00004f70 84 e0 04 f0 40 fe 00 28 13 d0 00 20 03 f0 ec ff |....@..(... ....| 00004f80 1e 49 80 31 08 69 88 61 35 4a 90 42 00 d3 8c 61 |.I.1.i.a5J.B...a| 00004f90 88 69 08 62 33 48 06 23 04 22 00 90 19 46 00 20 |.i.b3H.#."...F. | 00004fa0 62 e0 6b e0 20 43 48 45 43 4b 20 20 0a 00 00 00 |b.k. CHECK ....| 00004fb0 41 54 4f 4d 49 5a 45 52 0a 00 00 00 f6 e0 00 00 |ATOMIZER........| 00004fc0 28 03 00 20 ac 01 00 20 7a e0 00 00 20 20 43 48 |(.. ... z... CH| 00004fd0 45 43 4b 20 20 0a 00 00 10 4b 00 00 ba e0 00 00 |ECK ....K......| 00004fe0 44 4f 4e 27 54 0a 00 00 41 42 55 53 45 0a 00 00 |DON'T...ABUSE...| 00004ff0 50 52 4f 54 45 43 54 53 21 0a 00 00 3c 0b 00 20 |PROTECTS!...<.. | 00005000 20 57 41 54 54 20 0a 00 2c 2f 00 00 60 ea 00 00 | WATT ..,/..`...| 00005010 36 e1 00 00 2d 53 48 4f 52 54 2d 20 0a 00 00 00 |6...-SHORT- ....| 00005020 b2 eb 00 00 88 13 00 00 20 53 48 4f 52 54 20 20 |........ SHORT | 00005030 0a 00 00 00 81 0b 00 00 49 53 20 4e 45 57 0a 00 |........IS NEW..| 00005040 43 4f 49 4c 3f 20 0a 00 59 0a 00 00 4e 0a 00 00 |COIL? ..Y...N...| 00005050 7c db 00 00 88 db 00 00 dc 05 00 00 a0 db 00 00 ||...............| 00005060 0f 27 00 00 94 db 00 00 fb f7 e0 fd 28 46 fd f7 |.'..........(F..| 00005070 a1 f8 fb f7 f0 fe 07 20 fd f7 08 fb af 20 fb f7 |....... ..... ..| 00005080 2f ff 00 20 fb f7 30 ff 38 bd ff 49 08 60 70 47 |/.. ..0.8..I.`pG| 00005090 fe 49 88 72 70 47 fd 48 80 7a 70 47 10 b5 13 24 |.I.rpG.H.zpG...$|
more ASCII strings elsewhere:
00005490 44 2f 00 00 34 0c 00 20 a0 db 00 00 88 db 00 00 |D/..4.. ........| 000054a0 94 db 00 00 7c db 00 00 ea d5 00 00 36 0a 00 00 |....|.......6...| 000054b0 2e 0a 00 00 50 4f 57 45 52 0a 00 00 20 4f 46 46 |....POWER... OFF| 000054c0 20 0a 00 00 20 20 4f 4e 20 0a 00 00 e7 03 00 00 | ... ON .......| 000054d0 0f 27 00 00 9f 86 01 00 33 08 00 00 5f db 00 00 |.'......3..._...| 000054e0 fb f7 a4 fb fd 49 20 68 07 f0 10 fa 7d 27 08 46 |.....I h....}'.F| 000054f0 ff 00 39 46 07 f0 0a fa f9 4e 00 01 80 19 01 22 |..9F.....N....."|
There are several more such clusters of ASCII strings in different parts of the file. Some of the ASCII strings are mentioned in the product manual:
However, many of the ASCII strings in the binary are not mentioned in the manual, such as these:
00009d00 21 b0 f0 bd 00 01 00 50 00 ff 01 00 b4 ed 00 00 |!......P........| 00009d10 43 12 67 00 45 52 52 4f 52 3a 20 20 20 0a 00 00 |C.g.ERROR: ...| 00009d20 4e 4f 20 53 45 43 52 45 54 0a 00 00 2d 4b 45 59 |NO SECRET...-KEY| 00009d30 21 20 20 20 20 0a 00 00 ef 48 00 68 c0 07 c0 0f |! ....H.h....|
Visualization of the binary also shows that byte sequences that fall within the ASCII range are scattered throughout the binary (blue is ASCII):
2. Taking the locale the firmware was developed in into consideration
The firmware, the upgrade tool and the microcontroller are all developed by Nuvoton, a Taiwanese company. Perhaps there are sequences of traditional Chinese characters in the binary as well.
strings searches for ASCII character sequences and the -C option for
hexdump prints bytes within the ASII range as ASCII characters. But what if there are Unicode-encoded strings in the binary in addition to ASCII-encoded strings? Radare2 can be used to search for strings in the hex file directly, rather than relying on the output of a different tool (hexdump is pretty flexible but it is faster to use radare2). To search for strings, the
izz commands will be used to search for strings throughout the binary:
$ r2 ihex://SMOK_X_CUBE_II_firmware_v1.07.hex -- I am Pentium of Borg. Division is futile. You will be approximated. [0x00000000]> izz Do you want to print 1444 lines? (y/N) <--- enter "y", obviously
This has some potentially interesting results:
vaddr=0x0000aa95 paddr=0x0000aa95 ordinal=1093 sz=28 len=13 section=unknown type=wide string=h(胐恇ԇӕ栠だi(胐⁇ԇ vaddr=0x0000aab5 paddr=0x0000aab5 ordinal=1094 sz=54 len=26 section=unknown type=wide string=i(胐ⱇ潩ᄆHhШ⣐ࡉ⡀ѡ⣠ũड蠅⡃灡h(胐 vaddr=0x0000aaef paddr=0x0000aaef ordinal=1095 sz=10 len=4 section=unknown type=wide string=Hh̨⣐ vaddr=0x0000ab07 paddr=0x0000ab07 ordinal=1096 sz=62 len=30 section=unknown type=wide string=h(胐ᄆ탕HhШ棐칩ࡉ桀ѡ棠ũड蠅桃灡i(胐༂웕Hh̨棐 vaddr=0x0000ab53 paddr=0x0000ab53 ordinal=1097 sz=70 len=34 section=unknown type=wide string=i(胐삵汍쁨ԇǐ栠だh(胐ꁇԇ˕栠끠h(胐恇ԇӕ栠だi(胐⁇ԇ vaddr=0x0000ab9d paddr=0x0000ab9d ordinal=1098 sz=58 len=28 section=unknown type=wide string=i(胐ⱇ潩ᄆ꧕HhШ⣐ꡩࡉ⡀ѡ⣠ũड蠅⡃灡h(胐ꈂཌ vaddr=0x0000abd7 paddr=0x0000abd7 ordinal=1099 sz=10 len=4 section=unknown type=wide string=Hh̨⣐ vaddr=0x0000abef paddr=0x0000abef ordinal=1100 sz=62 len=30 section=unknown type=wide string=h(胐ᄆ雕HhШ棐鑩ࡉ桀ѡ棠ũड蠅桃灡i(胐༂賕Hh̨棐 vaddr=0x0000ac3b paddr=0x0000ac3b ordinal=1101 sz=22 len=10 section=unknown type=wide string=i(胐袽腈ཨሢᄅ腃
I cannot read these characters, so I do not know what language they are from. Maybe it is just gibberish.
3. Using a hex editor
A hex editor with a GUI can be used to quickly search for patterns in the data. For example, the byte
0A looks like it is used as a terminating character for ASCII strings:
So how should the binary be disassembled using r2? Are any there any special arguments or commands for 16-bit ARM Thumb instructions + some 32-bit Thumb2 instructions?
-b16 is asumed for thumb, not because the instruction size or the register size. Its an exception to make things simpler. Because its just a mode of the cpu.
-b16 sets thumb2 mode in capstone disassembler (as well as in gnu). Thumb2 contains 2 byte and 4 byte instruction lengths. Thumb was only 2. But thumb and thumb2 are binarynl compatible, so it makes sense to use thumb2 here, unless the cpu doesnt supports it.
From what i understand from ual is that this ist just a syntax, and this symtax should be ready in capstone.
Capstone knows nothing about code or data. It just disassembles.
In order to properly disassemble the file, it is critical that the correct architecture is specified:
-b bits force asm.bits (16, 32, 64)
For this firmware binary,
-b 16 should be used, not
$ r2 -a arm -b 16 ihex://SMOK_X_CUBE_II_firmware_v1.07.hex
For reference, here is disassembly beginning at the same offset,
0x1e8, with proper 16-bit alignment:
Obviously this is totally different.
It is important to emphasize that the entire binary will be disassembled as executable code, including data such as the ASCII and Unicode byte sequences. This must be taken into consideration when analyzing the disassembled output.
To analyze the disassembled code, one must be familiar with ARM assembly.
- The ISP upgrade tool is a MS Windows PE32 executable binary. This can be reverse engineered to determine how the flashing process takes place.
- Physical access to the microcontroller could be useful. The entire contents of flash memory could be dumped and analyzed. This would also enable one to see exactly how everything is laid out in flash memory
- if known good blocks of code can be isolated, it my be possible to decompile it
Hopefully the approach used here proves useful for your future firmware RE endeavors. Analyzing firmware poses its own set of challenges because of the close relationship between it and it the hardware it is designed to be embedded in. Since the design and architecture of the device determines the layout and content of firmware, firmware sometimes cannot be reversed without access to the device, or at the very least knowing the instruction set architecture of the device.
#infomagnum #cyberfit #reverseengineering #re