Tuesday, January 27, 2015

Incorporating Disk Forensics with Memory Forensics - Bulk Extractor

In this post we will take our first look at a tool that is primarily used for disk forensics and show how it can be useful during memory forensics analysis as well. In the coming weeks we will have several follow on posts highlighting other tools and techniques.


As you are likely aware, much of the information that is recoverable from disk and the network is also recoverable from memory - assuming you get a valid sample within a proper time frame.  If these conditions are met, then many forensics tools not usually associated with memory forensics can be incorporated into the memory analysis process. The artifacts recovered by these tools are often ones that memory tools do not focus on and are ill-suited to handle. Through incorporation of such tools and techniques, investigators can fully leverage all information available in volatile memory.

In many real-world investigation scenarios, you may only ever get a memory sample - see Jared's OMFW 2014 presentation for an excellent example of this [6] - and even less likely will you get full network flow (PCAP) during the time frame of an attack. This is unfortunate as the network often has information vital to an investigation, such as which servers were used to attack the network, which systems attackers laterally moved to, which domain(s) malware was downloaded from, and what commands were sent by a C&C server to local nodes. Even without a PCAP, all hope is not lost though, as network data must traverse main memory in order to be sent and received by applications.** This leaves the opportunity for historical network information to be left in memory long after it was active.

This idea was explored in great detail by Simpson Garfinkel and his co-authors in their 2011 DFRWS paper 'Forensic carving of network packets and associated data structures' [5]. In the paper they discuss not only historical network data in volatile memory, but also how that information can be stored on disk through system swapping and hibernation. This approach has the advantage of potentially getting network data from even previous reboots of the system.

To demonstrate the validity of their approach and evaluate the research, new modules were added to the open source bulk_extractor [1, 2, 3] tool in order to automate extraction and processing.

Since the publication of this paper and the popularization of bulk_extractor, extracting network data from memory captures has become a technique used by many analysts and it has even appeared in leading college digital forensics courses [4].

** With the exception of hardware rootkits within NIC firmware. If you believe this type of malware is active on a system that you need to investigate then you should check out the KntDD acquisition tool [14]. It can safely acquire memory from select NICs as well as acquire memory from other hardware devices.

Bulk Extractor

bulk_extractor is a highly-optimized open source tool that can scan inputs (disk, memory captures, etc.) and automatically find a wide range of information useful to investigators, such as email addresses, URLs, domains, credit cards numbers, and more. The project's wiki documents all of its scanning features [7]. Due to its multi-threaded, C++ design, bulk_extractor can often process inputs and extract all features at the speed at which the input can be read, making it very useful for efficient analysis.

When bulk_extractor is finished processing a target it then produces a number of output files. The exact files produced depends on which scanners were activated. For most scanners, in particular the network-related ones, not only is there a file of the raw data recovered, but there is also a histogram produced that orders entries based on the number of times found. This can be interesting in a number of ways, such as looking for the most common email addresses to determine who a user being investigated communicated with frequently.

Carving Network Packets and Streams from Memory

One of bulk_extractor's most useful features to memory forensics, and the one this blog will focus on, is the ability to profile network data in a sample as well as produce a PCAP of all packets found.

The profiled data includes recovered IP addresses, which can be mapped to known-bad lists or researched through threat intelligence tools, and Ethernet frames, which can be used to determine which local systems were contacted. Both of these profiled data sets include a histogram file.

To get this set of information you can run bulk_extractor as:

bulk_extractor -E net -o <output directory> <memory image path>

This will only run the net scanner and quickly produce the results described above in the directory specified with -o. Depending on your time constraints and desired data recovered, you can also run with other options, such as:

bulk_extractor -e net -e email -o <output directory> <memory image path>

This will run both the net and email scanners, which adds several more output files of interesting forensics data, including information about all domains found as well as email addresses and URLs. This can obviously give deep insight into what network activity occurred on the system being analyzed - potential phishing email accounts, malware URLs and domains, attacker C&C infrastructure, and so on.

Analyzing the PCAP File

In many of our own investigations we have found the PCAP file produced by bulk_extractor to be invaluable. In the connection traces we have uncovered evidence of data exfiltration, including (partial) file contents, commands sent by attackers to malware on client systems, input and output of tools on remotely controlled cmd.exe sessions, re-directed HTTP traffic from web servers, and much more.

The number of packets in a created PCAP file is limited to what is available in memory and not overwritten. For systems with large amounts of RAM, this can be many megabytes of data. For systems with less RAM, the in-memory caches will obviously be smaller, which often leads to less remnant information.  As we will show with the following forensics challenges though, even systems with small amounts of RAM can still retain crucial, historical network data long after it was processed.

Forensics Challenges

To show the usefulness of extracting network data from memory samples, we ran bulk_extractor against several memory captures included with popular forensics challenges. We choose these challenges for several reasons. First, they include both Linux and Windows samples. This shows the usefulness of the technique across several operating systems. Second, these challenges include small memory captures while still containing very useful data. Finally, by choosing public memory samples, anyone can perform their own research and follow the steps shown in the blog.

DFRWS 2005 Challenge

The first challenge we will analyze is the DFRWS 2005 challenge [8]. This is one the most well-known challenges since it directly inspired what we now consider modern memory forensics.

In this challenge, investigators were given a full network capture as well as a memory capture both before and after the suspect's laptop crashed. After running bulk_extractor, with -E net chosen, against the first sample we obtain a PCAP file (packets.pcap in BE's output folder) that has 130 packets. This is in comparison to 6304 packets in the full network capture, meaning roughly 2% of the related packets were recovered from a 126MB memory capture (no, the size is not a typo).

While this number may seem incredibly small, analysis shows that the recovered network data has highly useful information. As seen in the following screenshot from the PCAP loaded into Wireshark, proof of the "Back Orifice" (BO2k) malware being used on the system is present. This also includes partial file names and commands used on the system:

As discussed in the challenge's solutions, these backdoor interactions were a key part of solving what the attacker did to the system. Assuming a PCAP wasn't given along with the challenge, which we will see is not always the case, then volatile memory would be all investigators had to rely on for network clues.

DFRWS 2008 Challenge

The second challenge we will discuss is the one from DFRWS 2008 [9]. This included a Linux memory memory sample along with an accompanying PCAP file. As with the previous challenge, we will compare the network data obtained from memory to the provided PCAP along with analyzing the data from memory. This memory capture was 284MB and 122 packets were recovered. This is in comparison to 10243 contained in the full network capture (~1%).

Of these 122 packets, there was quite a bit of interesting data. As documented in the winning solution to the challenge [10], an encrypted ZIP file was exfiltrated from the network. Portions of this exfiltration appear in the capture. Also, the challenge details a Perl script used to extfiltrate the data using HTTP cookies on particular domains. Both fragments of the Perl script as well as several cookie instances are contained within the memory capture. The following shows one of the malicious requests encoding data within a msn.com cookie.

Note that the real msn.com is not contacted and a server under attacker control in Malaysia ( is contacted instead. For more information on this Perl's scripts chunking and encoding of data, please read the winning submission.

Honeynet 2010 Challenge

Honeynet's 2010 Challenge [11] focused on a end-user infected with a banking trojan. In the challenge you were given only a memory sample of the victim system. The given memory sample was 512MB and bulk_extractor recovered 256 packets. From our analysis, we believe this is a fairly significant amount of packets that were sent or received during creation of the challenge, even though it is a small number.

As shown in the following Wireshark screenshot, one of the packets recovered includes the request that downloaded the ZBot malware:

This is the focal point of the investigation, and if you read the challenge's official solution [12], you will see that once the malicious domain and connection are determined, the rest of the investigation falls into place.  Through the use of bulk_extractor we can immediately find the request in-tact, whereas the solution required an intricate mixing of strings and Volatility to map the pieces together. It is also interesting to note that when trying to definitively piece together the network traffic the solution states:

"It’s not possible to state it for sure since no network dump is available."

As we have just learned, we do not need to be given a network capture in order to get verifiable and useful network data from memory.

Honeynet 2011 Challenge

The 2011 Honeynet Challenge was an investigation against a memory sample and disk image of a compromised server. No network capture was provided. The memory capture is 256MB and 347 packets were recovered. As discussed in the solutions' for this challenge, the initial foothold was gained on the server through a heap-based vulnerability in the Exim mail server. The following from the PCAP of recovered packets shows this exploit in action:

Besides the large buffer of AAAAA's used to fill the heap, you can also see the IP address ( of the attacking system as well as its destination port (25) against the local server. This is an immediate warning sign of an attack against the mail server.

Mapping Connections to Processes

As useful information is recovered from memory, investigators often want to determine which process or kernel component was responsible for sending or receiving the particular packets. Volatility provides two ways to do this. The first is through the use of the strings plugin. The strings plugin can map a physical offset in a memory capture (the offset in the capture file) to its owning process or kernel module. Unfortunately, the strings plugins is not very straightforward with a PCAP file as the offset in physical memory of the packet is not encoded within the PCAP's data**.

To get around this limitation, you can use the yarascan plugin to search for unique data within the packet(s) that you find interesting. To do this, pass the data as the -Y option to yarascan. Note that you can pass arbitrary strings, bytes, and hex values to yarascan and you are not just limited to searching readable text. As yarascan searches for your signature it will list any processes or kernel components that contain the data. This can then immediately point you to the area of code responsible for generating or receiving the packet, which can greatly focus your analysis.

For those who closely follow Volatility, you know that there are also the ethscan [15] and pktscan [16] plugins that can perform the mapping as well. Unfortunately, pktscan is no longer supported and ethscan can take substantially longer to produce results versus bulk_extractor. For these reasons and others we choose to incorporate bulk_extractor in our analysis instead of using only Volatility components. 

** Someone please let me know if I am wrong about this. I have not seen this documented anywhere for bulk_extractor and I am not sure its possible using the old PCAP format that BE writes the capture in. The newer PCAP-ng format allows for arbitrary comments though and a nice addition to BE would be writing to the new format with a comment of the physical offset of where the particular packet was found.


This post was written to highlight the power of carving network data from memory. When full packet capture is not available, or when no network data is available, the ability to extract network data from memory may be your only chance of recovering it. As shown, bulk_extractor is a highly capable tool for this task. Not only can it recover packets to a PCAP, but it can also be used to recover a wide range of other information as well as arbitrary strings or regular expressions that you configure it for.

We strongly encourage our readers who are currently not utilizing this capability to start to incorporate in into your memory forensics workflows. We believe you will see immediate usefulness of the data recovered and bulk_extractor makes recovery trivial.


[1] https://github.com/simsong/bulk_extractor
[2] http://simson.net/ref/2012/2012-02-02%20USMA%20bulk_extractor.pdf
[3] http://simson.net/ref/2013/2013-12-05_tcpflow-and-BE-update.pdf
[4] https://samsclass.info/121/proj/p3-Bulk.htm
[5] http://simson.net/clips/academic/2011.DFRWS.ipcarving.pdf
[6] http://www.slideshare.net/jared703/vol-ir-jgss114
[7] https://github.com/simsong/bulk_extractor/wiki
[8] http://www.dfrws.org/2005/challenge/
[9] http://www.dfrws.org/2008/challenge/
[10] http://sandbox.dfrws.org/2008/Cohen_Collet_Walters/Digital_Forensics_Research_Workshop_2.pdf
[11] http://www.honeynet.org/challenges/2010_3_banking_troubles
[12] http://www.honeynet.org/files/Forensic_Challenge_3_-_Banking_Troubles_Solution.pdf
[13] http://www.honeynet.org/challenges/2011_7_compromised_server
[14] http://www.gmgsystemsinc.com/knttools/
[15] http://jamaaldev.blogspot.com/2013/07/ethscan-volatility-memory-forensics.html
[16]  https://code.google.com/p/volatility/issues/detail?id=233

1 comment:

  1. Nice post, bulk_extractor is truly an awesome program. If you enable the wordlist functionality of bulk_extractor when you scan your memory dump, a file with all printable characters and the offset to where they are found will be generated. If the pattern you are looking for contains other, non-printable characters, I would highly recommend creating a findlist and enabling the lightgrep scanner. I suspect this will get you the desired result quicker than running yarascan.