Tuesday, February 18, 2020

Shared Code Analysis: Finding similar code in malware samples

Shared Code Analysis: Finding similar code in malware samples 

Whats covered?

  1. Preparing the samples
  2. Jaccard? I'snt that the captain of the US Space Force?
  3. Graph Similarities


In my last blog post I hunted down some APT28 malware samples and sorted them by the malware family name. Now were going to use those samples and find code similarities between them.


Ok ive loaded the samples into my Ubuntu VM now what?


1 - Preparing the samples


Im a total noob at this data science buisness but so far im loving what im learning. Thats why this is applicable here.


What can we compare in all the samples that could show similarities in the code?

Strings
During malware analysis strings are used a lot for IOCs and to look for various keywords in a file to see if anything jumps out.  Usually these are strings that are defined by the programmer. This could be a good option.

Assembly Code
We could do a more complex approach and use capstone or redare and get the assembly code and compare that but that could take a while to figure out.. and compilers do fun stuff so very similar code could be very different when compiled with a flag or not... so lets not go there for now.

Import Address Table
What something all malware has that the author probably just doesnt think about even if they go crazy with obfuscation, anti-analysis tricks and all that jazz?  The imported dlls, and function calls.    They usually remain the same because you know... programming.  This is a pretty good option to compare. The Import Address Table is where we find a mapping of imports.

Dynamic API Call
What about API calls?  These have an advantage of giving a generic picture of actions taken by the malware and usually happen in a particular order.  To do this we will probably need to actually run the malware samples and that requires some planning to execute.  Could be a fun project for later.  Cuckoo Sandbox might be a good option for that but... have you ever set a full Cuckoo enviornment up?  Its time consuming and requires a good amount of computing resources.

How are we going to compare these samples?  

Strings would work pretty good but after some research looks like the Import Address Table a more reliable option.  Im tempted to use both.  just in case we run into malware samples without IAT.

IAT Wait whats that?

Id suggest reading this blog post here: https://tech-zealots.com/malware-analysis/journey-towards-import-address-table-of-an-executable-file/
As soon as the Windows loader loads an executable it does certain things in the background. First, it reads the files of a PE structure and loads an executable image into the memory. The other thing it does is to scan the Import Address Table (IAT) of an executable to locate the DLLs and functions that the executable uses and loads all these DLLs and maps them into the process address space.
[......] 
Within any executable file, we would see an array of data structures which is one per imported DLL. Each of these structures gives the name of the imported DLL and points to an array of function pointers. Import Address Table (IAT) is an array of these function pointers where the address of the imported function is written by the Windows loader.

Ok so how are we going to extract out and store this info?

After some google fu and reading looks like we can accomplish this with pefile pretty easy.

Once pefile is installed naturally we have to test it with a little code.
psudo code?

import pefile
import os

file = <our pe file>
pe=pefile.PE(file)
pe.parse_data_directories()
print pe.DIRECTORY_ENTRY_IMPORT

!!! Nope thats doesnt quite work.. ok back to the docs...


https://github.com/erocarrera/pefile/blob/wiki/UsageExamples.md#listing-the-imported-symbols

Found it.  Looks like can loop through

and pull them out.


Not so psudo code

import os
import pefile

file='/path/to/malware'
 
def getIAT(file):
    pe =  pefile.PE(file)
    # If the PE file was loaded using the fast_load=True argument, we will need to parse the data directories:
    pe.parse_data_directories()
    for entry in pe.DIRECTORY_ENTRY_IMPORT:
       print entry.dll
       for imp in entry.imports:
          iat=set(hex(imp.address))
         print iat


if __name__ == '__main__':
    getIAT(file)
 

oh.. bad code..


Add some better functionality


import os
import pefile
import argparse


def getIAT(file):

    pe =  pefile.PE(file)
    # If the PE file was loaded using the fast_load=True argument, we will need to parse the data directories:
    pe.parse_data_directories()
    for entry in pe.DIRECTORY_ENTRY_IMPORT:
       print entry.dll
       for imp in entry.imports:
          iat=set(hex(imp.address))
          print iat


if __name__ == '__main__':
    parser = argparse.ArgumentParser(
        description="GetIAT of file."
    )
    parser.add_argument(
        "target_file",
        help="malware file"
    )
    args = parser.parse_args()
    file=args.target_file
    getIAT(file)


And it gets something.... 



Not quite what were looking for.

Changing the code a little more

import os
import pefile
import argparse


def getIAT(file):

    pe =  pefile.PE(file)
    iat=set()
    # If the PE file was loaded using the fast_load=True argument, we will need to parse the data directories:
    pe.parse_data_directories()
    for entry in pe.DIRECTORY_ENTRY_IMPORT:
       print entry.dll
       for imp in entry.imports:
          iat.add(hex(imp.address))
          print iat


if __name__ == '__main__':
    parser = argparse.ArgumentParser(
        description="GetIAT of file."
    )
    parser.add_argument(
        "target_file",
        help="malware file"
    )
    args = parser.parse_args()
    file=args.target_file
    getIAT(file)






Ok... Its a start.

Now how are going to compare this for each sample?

Well we can take all the IAT data per sample. and compare them then devide how many are similar by the number of iat entries.   To do this we are going to use whats called the Jaccard Index.

2 - Jaccard Index

What is the Jaccard Index?
According to deepai.org the jaccard index is
The Jaccard Index, also known as the Jaccard similarity coefficient, is a statistic used in understanding the similarities between sample sets. The measurement emphasizes similarity between finite sample sets, and is formally defined as the size of the intersection divided by the size of the union of the sample sets.

Sound like what we need to use.


code:

def jaccard(set1,set2):
    """
    Calculate Jaccard distance between two sets of malware.
    Uses what is similar and how many attributes there are to calculate jaccard
    """
    intersection = set1.intersection(set2)
    intersection_length = float(len(intersection))
    union = set1.union(set2)
    union_length = float(len(union))
    return intersection_length / union_length

Now how do we visualize these connections in the data?


3 - Graphing similarities


Networkx to the rescue.

What does networkx do?  It allows us to visualize and graph the connections in the data by creating nodes and connections between them. In other words each sample is a node and if They share similar dll imports then we can connect them and say that they are similar.

APT28 Similarity Graph
























Lets try to create a simple graph




import networkx as nx 

G=nx.Graph() 

G.add_node("a") 
G.add_nodes_from(["b","c"]) 

G.add_edge(1,2)
edge = ("d", "e") 
G.add_edge(*edge) 
edge = ("a", "b") 
G.add_edge(*edge) 

print("Nodes of graph: ") 
print(G.nodes()) 
print("Edges of graph: ")
print(G.edges())

nx.draw(G)
















Putting It all together




Ok so what steps do we need to take now to get this whole thing working?


  1. Get user input of malware samples directory
  2. get file and check if PE
  3. extract IAT from PE
  4. add malware sample as a node on the graph  [ graph.addnode(path,label) ]
  5. Iterate samples and calculate jaccard index 
  6. if jaccard is above threshold add connection "edge" [ graph.addedge(mal1,mal2,how similar) ]
  7. write graph to disk




Psudo code:

Imports
   
def getIAT(path):
     """
        Extract IAT
     """

def jaccard(set1,set2):
    """
    Calculate Jaccard distance between two sets of malware.
    Uses what is similar and how many attributes there are to calculate jaccard
    """

def  check_pe(path):
   """
      Check if its a PE file
   """

__main__

Get arguments
directory=

for each file in directory
     check if PE
     if  PE file
         Extract IAT from file
         Store data
     ADD file as node to graph
     Iterate through malware files
        calculate jaccard for 2 malware files
      If jaccard distance is above the threshold add an connection("edge") between nodes 

write to disk
    


Lets code this up now.


#!/usr/bin/python

import itertools
import argparse
import networkx
from networkx.drawing.nx_pydot import write_dot
import pprint
import pefile
import os


def getIAT(fullpath):
    """
    Extract the Import Address Table from the binary
    """
    pe =  pefile.PE(fullpath)
    # If the PE file was loaded using the fast_load=True argument, we will need to parse the data directories:
    pe.parse_data_directories()
    iat_list=set()
    try:
        for entry in pe.DIRECTORY_ENTRY_IMPORT:

          for imp in entry.imports:
            iat_list.add(hex(imp.address))
            #iat=set(hex(imp.address))
    except AttributeError:
        print "ERROR! No imports in sample. Falling back to strings method.."
        iat_list=getstrings(fullpath)
    return iat_list

def getstrings(fullpath):
    """
    Extract strings from the binary
    really doesnt do much unless theres a large number of PE files without imports... packed warez?
    """
    strings = os.popen("strings '{0}'".format(fullpath)).read()
    strings = set(strings.split("\n"))
    return strings

def pecheck(fullpath):
    """
    Checks for 'MZ' to see if binary is PE
    """
    return open(fullpath).read(2) == "MZ"

def jaccard(set1,set2):
    """
    Calculate Jaccard distance between two sets of malware.
    Uses what is similar and how many attributes there are to calculate jaccard
    """
    intersection = set1.intersection(set2)
    intersection_length = float(len(intersection))
    union = set1.union(set2)
    union_length = float(len(union))
    return intersection_length / union_length


if __name__ == '__main__':
    parser = argparse.ArgumentParser(
        description="Find similarity between malware and graph it."
    )

    parser.add_argument(
        "target_directory",
        help="Directory containing malware"
    )

    parser.add_argument(
        "output_dot_file",
        help="Where to save the output graph DOT file"
    )

    parser.add_argument(
        "--jaccard_index_threshold","-j",dest="threshold",type=float,
        default=0.8,help="Threshold above which to create an 'edge' between samples"
    )
    """
    parser.add_argument(
        "--method","-m",dest=method
    )
    """
    args = parser.parse_args()
    malware_paths = [] # stores malware file paths
    malware_attributes = dict() # stores the malware Import Addtess Table
    graph = networkx.Graph() # similarity graph

    for root, dirs, paths in os.walk(args.target_directory):
        # iterate through directory to find malware paths
        for path in paths:
            full_path = os.path.join(root,path)
            malware_paths.append(full_path)

    # check if PE file
    malware_paths = filter(pecheck, malware_paths)

    # get the IAT for malware and store it
    for path in malware_paths:
        attributes = getIAT(path)
        print "Extracted {0} attributes from {1} ...".format(len(attributes),path)
        malware_attributes[path] = attributes

        # add each malware file to the graph
        graph.add_node(path,label=os.path.split(path)[-1][:10])

    # iterate through all pairs of malware
    for malware1,malware2 in itertools.combinations(malware_paths,2):

        # calculate the jaccard distance for the current malware samples
        jaccard_index = jaccard(malware_attributes[malware1],malware_attributes[malware2])

        # Check if jaccard distance is above the threshold.. if so add an edge
        if jaccard_index > args.threshold:
            print malware1,malware2,jaccard_index
            graph.add_edge(malware1,malware2,penwidth=1+(jaccard_index-args.threshold)*10)

    # Output the graph
    write_dot(graph,args.output_dot_file)


Full Code: https://github.com/es0/MalwareSimilarityGraph

Monday, February 3, 2020

Hunting for APT28 malware in a stockpile of samples

Recently I wanted to do some data analysis on APT28 malware samples I had.  I have some samples sorted and organized but have a pile of unsorted encrypted zip and rar files with a bunch of other unrelated malware samples and warez.

The question is what APT samples are hiding in my stockpile of malware samples and what of those samples are related to APT28.



So how do we get to the juicy samples inside the thousands of password protected files?

We brute force them of course.  Being as they're malware samples more than likely the password will be something like the following:

infected
password!
malware

or some variation similar.


After some google and testing of a small script I had something that worked using John The Ripper to brute force the zip file password.

#!/bin/bash
echo "Brute all the zip files in dir";
if [ $# -ne 2 ]
then
echo "Usage $0 <directory_with_zip_files> <wordlist>";
exit;
fi
FILES="$1*.zip"
echo $FILES
for f in $FILES
do
for i in $(john --wordlist=$2 --rules --stdout)
do
echo -ne "\rtrying \"$i\" "
unzip -d zip-out -o -P $i $f >/dev/null 2>&1
STATUS=$?
if [ $STATUS -eq 0 ]; then
echo -e "\nArchive: $f  password is: \"$i\""
fi
done
done


Running it.

<  INSERT FORGOTTEN SCREENSHOT HERE   >

Modifying this script I was able to get a sort of hacky brute force that seems to work with the rar files.

#!/bin/bash
echo "rar file brute";
if [ $# -ne 2 ]
then
echo "Usage $0 <directory_with_rar_files> <wordlist>";
exit;
fi
FILES="$1*.rar"
echo $FILES
for f in $FILES
do
#unrar x $f -pinfected rar-out/ >/dev/null 2>&1
while IFS= read -r line
do
echo "File: $f"
echo -ne "\rtrying \"$line\" "
unrar x $f -p$line rar-out/ >/dev/null 2>&1
STATUS=$?
if [ $STATUS -eq 0 ]; then
echo -e "\nArchive: $f  password is: \"$i\""
fi
done < $2
done


Yes I realize its not perfect like the output password isnt set to the right variable... but it works and ill fix it later.



Running it.





brute forcing the zips was a lot cleaner.


Anyways we now have two directories with a bunch of malware samples.  i also ran the zip brute force inside the zip-out directory to get any samples still ziped up and I got a few. :)



So now we have all the malware samples that were decrypted from the rars and zips.


How are we going to sort through 10,000+ malware samples?


With Yara and bash of course.

Using the Yaras APT rules to sort through all the samples we find some interesting malware.



yara -p 20 -g /YARA_RULES/rules/malware/APT_*.yar -r /MALWARE 



command breakdown:

-p 20          Use 20 threads
-g               print tags
<yara rules>
-r                recursive search
<malware directory>

run with the -m flag to get meta data which will be very helpful when sorting the malware families.


So we see there's a lot of info and a lot of various APT malware samples.  Now we need to sift out the APT28 samples.


This is where we grep is our friend

grep "APT28"

 yara -p 20 -g -m /YARA_RULES/rules/malware/APT_*.yar -r /MALWARE | grep "APT28" | sort | cut -d"/" -f1,2,3,4,5,6,7,8,9,10,16,17,18,19

You can ignore the cut command.  I just wanted to clean up the output.





Now lets sort the malware into its family groups.


Basically we want to sort out the APT28 familys into the sample gorups
we use grep to pull out samples related to the family name like
grep "CORESHELL"
grep "X-Agent"
etc..

 
Using a little command line kung-fu we can pull out the sample directories and the copy those samples into the malware family directories.


I wrote a small shell script to do this.

echo "YARA APT28 MALWARE FAMILY SORTER"
echo " Sorts CORESHELL, X-Agent, XTunnel, etc..."
list=(X-Agent CORESHELL XTunnel EVILTOSS BlackEnergy)
for i in ${list[@]}
do
# Sorted known APT28 files
yara -p 20 -g -m /YARA_RULES/rules/malware/APT_*.yar -r /MALWARE-SAMPLES/APT28/ | grep "GRIZZLY-STEPPE" | grep "$i" | sort > APT_28-$i-Family_Samples.txt
# Unsorted stockpile dir
        yara -p 20 -g -m /YARA_RULES/rules/malware/APT_*.yar -r /MALWARE | grep "GRIZZLY-STEPPE" | grep "$i" | sort >> APT_28-$i-Family_Samples.txt
cat APT_28-$i-Family_Samples.txt | cut -d"]" -f3 > sample_dir.txt  samples=sample_dir.txt
while read -r sample
do
echo "\nFAMILY: $i"
echo "$sample" cp "$sample" APT28/Malware-Family/$i/
done < "$samples" done

I manually created directories... why? Because that's just how it happened.
Running the script resulted in the following




There you have it.  We successfully sorted through a pile of malware searching for samples from APT28 and separated out the samples into the malware families.

Next step is to use the malware sample set for some data science and machine learning fun.

Like what?

Well like doing a little shared code analysis on the samples.

But that's for another blog post. 

Thursday, October 17, 2019

Attacking SSH



Attacking SSH


Goals:

  • Discover hosts with ssh running on a network.
  • Brute force ssh credentials using Hydra and wrapper script
  • Intro to SSHOOTER forsystem managment or post exploitation of SSH.



Scan for ssh running on network and get the ip addresses.

Nmap -p22 –open 192.168.1.1/24 | grep “scan report” | cut -d” “ -f5







results:
192.168.1.103
192.168.1.148
192.168.1.150
192.168.1.157
192.168.1.172
192.168.1.182
192.168.1.162


put into a file.
Now we have our list of targets. Lets use Hydra to brute force ssh credentials.
Im going to use a shortened wordlist but feel free to use lists from SecLists or other sources.

Hydra -L wordlist/usernames/labsmall.txt -U wordlists/passwords/lab-small.txt -t4 -M targets.txt

then wait…






WIN!

We have creds.

A keen eye might note the current working directory of the above screenshot. I wrote a wrapper script to brute force ssh and format it in such a way that we can use later on.

Lets see this script in action now.













So thats how the ssh_bruter.sh script works.

So why the formatted output? Im glad you asked.


USERNAME@IP:PORT PASSWORD


Let me introduce you to another little tool I wrote I like to call SSHOOTER.

Its kind of a SSH administration tool. I plan on adding more features in the future but it helps with some simple tasks for now. It takes a creds.txt file with the formatted output from the ssh_bruter.sh script.


Why?

I wanted a centralized way to manage multiple systems that were running ssh in a somewhat easy way.  Got tired of having multiple terminals sshd into a remote box and trying to execute a simple task on them all and get the output.  You know when youre in your pentest lab and need to check the ip on a few systems or restart a service.


So what can we do with SSHOOTER?

  • manage multiple remote systems with ssh enabled using username and password or key file.
  • Execute command on a host or multiple hosts
  • Upload/Download files
  • Establish shell on remote host
coming soon:
  • ssh tunneling
  • importing new hosts



Starting SSHOOTER







Main Menu:





List hosts:





As we can see the creds logged in and we have gotten the hosts runtimes. :)

Lets open a shell on host 0. Its a metasploitable 2 vm FYI.





Enter the host we want to open the shell on.




And there we have it. A shell on the remote machine using ssh. Pretty simple. Basically just runs `ssh msfadmin@192.168.1.103` for you.

type exit to exit the shell.

Lets download the ‘/etc/passwd’ from the 192.168.1.103

enter 4 in the main menu
When prompted enter the file to download from host
then the destination
and the host to execute on.




Results:










Running command on multiple hosts


























That's it for now.
Hope you enjoyed the blog post.


Feel free to play around with the scripts and see if they help you.  I've noticed hydra is pretty unstable sometimes and just hangs.


CODE:







Sunday, August 9, 2015

Exploit-Development 1 Notes and write-up (Strategic Security training.)

This is a blog post contains Personal homework notes/step-by-step instructions on a buffer overflow exploit development. This material is from Joe McCrays Strategic Security training.



Lets begin...

First we need a skeleton script for out exploit.


























This will send 2000 A's to the IP address specified as the first parameter of the programs










Next lets start up the target application using Immunity Debugger on the target system.

Now we run the program by clicking the “play” run button >


Next we need to run our skeleton script from the attacking host.









Doesnt look exciting from this view.

Checking out the target.

Boom! Its dead!
Notice EIP is 41414141 with is AAAA.


Now we need to find how much room we have for shellcode. :)

Lets use pattern_create.rb to generate a pattern that we can use.


Copy this into our skeleton script. I like saving it under a new name so I keep a skeleton script around.





Now we run this against our target.
NOTE: Remember to restart the application within Immunity.

And this is what we get.




















Lets calculate our shellcode spacing.

We get the value in EIP and ESP.




Now we can use pattern_offset.rb to find the offsets at which these values are at.



So it looks like pattern_offset is having issues finding the offset of 0Aj1A

Lets do it by hand.

Vim find





















Ok so we found it at 281. But thats from beginning of line which includes 'buff =' We don't want that. The actual string starts at offset 9. so 281-9=272

Now we have the offset of ESP

and its only 4 bytes away from EIP.

Lets translate this information into our python exploit.




























Offset to EIP was 268 so we fill buffer with 268 A's

EIP is 4bytes long so we fill that with B's

ESP is filled with C's

Lets test it out and we should see EIP have B's and ESP C's.








































Cool what we thought should happen did.

So we confirmed our information and now we can actually exploit this.





First we need the location of our shellcode, this is ESP. Since we have control over EIP we can just put the address directly in.

Lets generate some shellcode real quick, a bind_tcp should work for a poc.
Insert the shellcode into our script

New exploit script looks like this.



























Lets run it agains our target now and if all goes as planned we should be able to use netcat to connect in on port 4444 (default for msf payloads).


And we have a shell.





Notice our connection in netstat from 192.168.0.109