Adding a scoring system in peepdf

= Adding a scoring system in peepdf = Google Summer Of Code-2015: PEEPDF
 * Public URL: [//www.mediawiki.org/wiki/Adding_a_scoring_system_in_peepdf //www.mediawiki.org/wiki/Adding_a_scoring_system_in_peepdf]
 * Google Code: https://code.google.com/p/peepdf/
 * Project Page: http://eternal-todo.com/tools/peepdf-pdf-analysis-tool

Name and contact information
Name: Rohit Dua Email: 8ohit.dua AT gmail DOT com IRC or IM networks/handle(s): rohit-dua Location: New Delhi, India Time-zone: UTC+5:30 Typical working hours: 12:00 pm to 5:00 pm, 8:00 pm to 2:00am(IST) until August, 6:00 pm to 2:00 am after August. Nationality: Indian

Synopsis
Currently, it is possible to identify the suspicious elements in a PDF file because they are shown in a different color (yellow). While it helps for experimented analysts or users with some experience with the PDF format and/or threat analysis, it could be difficult to understand for less skilled users. This project focuses to list out the elememts which permit distinguish if a PDF file is malicious or not and create a score out for each of those elements (maybe out of 10) and sum up the individual scores to the overall file maliciousness score.

Factors to be added(if not already present) that may decide maliciousness:
Generally pdf's with single page are more possible malware May indicate manual modification. eg: /#4a#61#76#61#73#63#72#69#70#74 instead of /Javascript eg: JBIG2Decode is not expected in text streams These will be ignored by the pdf reader The pdf viewer will read this without error. But wihout length tag, the terminator could extend into other objects. This is also used for file type cloaking. eg: when /Javascript is not directly called but via pointing. eg: Creating a pdf by exporting from Office(MS/Open/Libre) adds specific metadata.
 * Number of pages
 * Broken trailer/xref
 * hex/oct in tags
 * No. of filter on a specific stream
 * Type of filter applied
 * Presence of Javascript/XFA
 * Presence of invalid tokens between objects
 * Absence of terminator or length tag in streams
 * Presence of random garbage before the header
 * Presence of unknown elements in the header except %PDF-xx
 * Presence of additional triggers
 * Presence of Launch/OpenAction with javascript
 * Absence of xref/startxref/trailer
 * Similar file structure/ names as used in the popular exploit kits
 * Colour expressed with more than 3 bytes
 * Backtracking and analysing for PDF Syntax Obfuscation
 * Objects not referenced from Catalog
 * Encrypted File with default password
 * File not made with traditional known editors
 * Chars After Last EOF

Score assignment to each factor
http://contagiodump.blogspot.in/2013/03/16800-clean-and-11960-malicious-files.html Along with the pre-defined factors, the test suite will run against the system to identify new factors. (such as regex expressions etc.)
 * The test suite for pdf files will be obtained from online sources
 * To improve scores, the above obtained test suite will be used

Optional Goals:

 * Link with peepdf web interface

Project schedule
* The above plan could go as expected or invariably re-distribute among the tasks.

Participation
During my work hours, I would always be logged in IRC (channels: #gsoc-honeynet, #gsoc) and also can always be reached at my email. I'm an computer addict and have hard time staying off of it. All source code I write will be published to my Github repo along with the official git of peepdf(with default or developement branch) At each stage of development I would like to discuss implementation details with the mentors so that there are no delays/issues later on. If face some other doubts or need feedback I would head over to the talk on the mailing list.

About you
My name is Rohit Dua, and I'm currently pursuing my B.Tech in Electronics and Communication at Jaypee Institute of Information Technology, Noida at India. My home-town is New-Delhi, India. I code in Python/JavaScript/C/C++. I'm passionate about computer-security/automation and Coding gets me high! This is my second consecutive year in Google Summer Of Code. Previous year(2014) I contributed to Mediawiki organization developing an online tool + bot(python + shell) http://tools.wmflabs.org/bub/ which uploads books to internet archive from Google books. When I first heard about Open Source at a Linux User Group Meetup at my university, I went crazy about it as I always thought there's no such thing as a free bread, but then there always was free knowledge. This is why I love open-source :-) I feel I can grow and learn much faster with community-bondings in the Open Source universe. I have been building a headless browser that randamizes its fingerprint http://github.com/rohit-dua/selkie. I recently got to know about thug project(honeynet). This is somewhat similar to the project I have been working on. Although I love the thug project but I'm more interested in malware spreading and security. Thats why I choose peepdf project. Google Summer of Code will be my top priority and I will be happily accepting this as a full time job.

Past open source experience
GitHub profile: rohit-dua GSOC-2014->Mediawiki OWCS-2014(Owasp Winter Code Sprint)->OTWF