Skip to content
Davide Tampellini edited this page Mar 15, 2015 · 8 revisions

#####Q: Do you have any plan to create a GUI? A: Yes, I have. However I can't tell you when since I work on this project on free time and I don't have a lot of it

#####Q: How are tweets organized? A: The main idea behind Dump Scraper is to work on files saved on the filesystem. Every step (scrape, organize, classify) will create new files instead of moving the existing ones. In this way, if anything goes wrong or we improve the algorithm, you'll be able to work again on the same files, you simply have to delete the output.

All dumps will be stored under the data directory:

data
  `- organized
    `- hash
      `- YYYY-MM-DD
        `- <tweet id>.txt
    `- plain
    `- trash
  `- processed
    `- hash
      `- YYYY-MM-DD
        `- <tweet id>.txt
    `- plain
  `- raw
      `- YYYY-MM-DD
        `- <tweet id>.txt
        `- features.csv
  • raw Stores all the dumps downloaded from PasteBin, creating one directory for each day
  • organize Contains the dumps divided into three categories: Trash (files we don't care), Plain (files with plain passwords) and Hash (files with encrypted passwords)
  • processed Contains the final result: on each line there will be one hash or plain password, depending by the original type
Clone this wiki locally