Pages: [1]
Author Topic: Generating SBOM from a binary file.  (Read 1953 times)
Shot_Engineering
Newbie
*

Karma: +0/-0
Offline Offline

Posts: 3


« on: January 02, 2024, 02:47:08 AM »

Hi everyone. I'm starting a new project its goal is to generate an SBOM from a binary file. So I was wondering if anyone got any info on how to start with it. I know it might not be an easy task especially that I'm new to reverse engineering but I just need to figure out where to start. How can I analyze a binary file like an ecu binary file and extract the dependencies from it? And does anyone have a file that has dependencies? Any help Would be much appreciated. Thanks
Logged
d3irb
Full Member
***

Karma: +134/-1
Offline Offline

Posts: 195


« Reply #1 on: January 02, 2024, 09:09:04 AM »

This doesn't make sense, and is very difficult. If this is some academic project you may want to select a different topic.

First off, what kind of "dependencies" are you looking for? Are you trying to find what version of the Simulink function blocks a binary had its source code generated from? What compiler was used? What RTOS is employed? Which middleware vendor provided random blocks in the model generator? ECU binaries are flat, static binaries which are compiled with optimization, so information is fundamentally destroyed and their structure will be altered freely by the compiler (inlining, optimization, etc). There's not some linker table you can use to find library dependencies like in desktop software, and there's not some set of libraries that are "linked against" to produce the final product.

The ECU build process does not look like a traditional software project; code is usually generated from models using a code-generation tool like Simulink, ASCET, etc. This means that instead of finding calls to library functions like you might in a traditional statically linked binary, you will find that the actual source code used to generate the binary is full of repeated blocks provided by a middleware vendor. One way to think about this is that "linking" as you'd think about it traditionally has already happened before the C compiler toolchain is ever invoked - the model generator has "linked" various code blocks (from middleware vendors, the standard libraries, or custom) together in order to generate the C source code in the first place.

The only way I can see that would be practical to attempt this would be to use function hashing. If you had access to the dependencies you were trying to detect, you'd use something like BinDiff or IDA's FLIRT/Lumina. You'd build a fuzzy hashed database of functions in the dependency, then compare them with the binary. But, this isn't likely to work very exactly.
Logged
prj
Hero Member
*****

Karma: +1072/-480
Offline Offline

Posts: 6035


« Reply #2 on: January 02, 2024, 11:48:23 AM »

The only way I can see that would be practical to attempt this would be to use function hashing. If you had access to the dependencies you were trying to detect, you'd use something like BinDiff or IDA's FLIRT/Lumina. You'd build a fuzzy hashed database of functions in the dependency, then compare them with the binary. But, this isn't likely to work very exactly.

Doesn't work. The moment the compiler version changes and starts optimizing differently it all goes to hell.
Had some PhD guys try this approach by using neural networks to match the Ghidra pseudocode. It went nowhere. I could match much more stuff with the approach I came up with myself, which works much better because it makes a ton of assumptions it knows to be true and does not decompile whole functions, only certain instructions.

The hexrays and ghidra disassembly in text form changes massively with different compiler optimization levels Sad
« Last Edit: January 02, 2024, 11:50:02 AM by prj » Logged

PM's will not be answered, so don't even try.
Log your car properly - WinOLS database - Tools/patches
Shot_Engineering
Newbie
*

Karma: +0/-0
Offline Offline

Posts: 3


« Reply #3 on: January 30, 2024, 08:15:30 PM »

Doesn't work. The moment the compiler version changes and starts optimizing differently it all goes to hell.
Had some PhD guys try this approach by using neural networks to match the Ghidra pseudocode. It went nowhere. I could match much more stuff with the approach I came up with myself, which works much better because it makes a ton of assumptions it knows to be true and does not decompile whole functions, only certain instructions.

The hexrays and ghidra disassembly in text form changes massively with different compiler optimization levels Sad

Hello there. Thanks for the reply and sorry for taking so much to reply back I've got busy with other things.
Can I ask what exactly do you mean by the approach you came up yourself? What approach did you take?
Logged
Shot_Engineering
Newbie
*

Karma: +0/-0
Offline Offline

Posts: 3


« Reply #4 on: January 30, 2024, 08:23:43 PM »

This doesn't make sense, and is very difficult. If this is some academic project you may want to select a different topic.

First off, what kind of "dependencies" are you looking for? Are you trying to find what version of the Simulink function blocks a binary had its source code generated from? What compiler was used? What RTOS is employed? Which middleware vendor provided random blocks in the model generator? ECU binaries are flat, static binaries which are compiled with optimization, so information is fundamentally destroyed and their structure will be altered freely by the compiler (inlining, optimization, etc). There's not some linker table you can use to find library dependencies like in desktop software, and there's not some set of libraries that are "linked against" to produce the final product.

The ECU build process does not look like a traditional software project; code is usually generated from models using a code-generation tool like Simulink, ASCET, etc. This means that instead of finding calls to library functions like you might in a traditional statically linked binary, you will find that the actual source code used to generate the binary is full of repeated blocks provided by a middleware vendor. One way to think about this is that "linking" as you'd think about it traditionally has already happened before the C compiler toolchain is ever invoked - the model generator has "linked" various code blocks (from middleware vendors, the standard libraries, or custom) together in order to generate the C source code in the first place.

The only way I can see that would be practical to attempt this would be to use function hashing. If you had access to the dependencies you were trying to detect, you'd use something like BinDiff or IDA's FLIRT/Lumina. You'd build a fuzzy hashed database of functions in the dependency, then compare them with the binary. But, this isn't likely to work very exactly.

Hello buddy. Thank you so much for the reply! Also sorry for taking ages for replying back I kinda got busy with other stuff.
I also saw your reply on reddit!
I'm not sure if I understand everything you mentioned in your reply but I looked into BinDiff and IDA's FLIRT/Lumina and they're really interesting tools and will be really helpful with stripped files. From what I understood it can match functions and identify them giving back some of the symbols and names that were gone when the file was stripped. But I'm not exactly sure what's next tbh. Giving I have a non-stripped file, I'm not sure how to trace the imported functions whether in assembly or the pseudo-code generated by decompilers. I found out about a new tool in the latest update of ghidra which is ghidra BSim and I think I'm checking this one next.
Logged
Pages: [1]
  Print  
 
Jump to:  

Powered by SMF 1.1.21 | SMF © 2015, Simple Machines Page created in 0.016 seconds with 16 queries. (Pretty URLs adds 0s, 0q)