I've been working on some password-cracking research on the side. I thought I'd come up with a cool new idea, but it turns out that someone else already thought of it.
It occurred to me last night that a big list of passwords could be abstracted out into their equivalent masks, and then a frequency count of those masks could be generated, which could then be exhausted in frequency order.
First, I extracted a frequency count of character set combinations (masks) from all eight-characters-longthe RockYou breach's password list, yielding a list of the form:
100:hundredofthese 95: 95ofthese [...] 2:justtwoofthese 1:onlyoneofthese 1:alsoonlyoneofthese
... as follows:
#!/bin/bash echo "- Getting frequency of character patterns from RockYou ..." time gunzip -cd rockyou.txt.gz \ | tr '[:lower:]' 'l' \ | tr '[:upper:]' 'u' \ | tr '[:digit:]' 'd' \ | tr "[\ !\"#$%amp;&\'()*+,-./:;<=>?@\[\\\]^_\`{|}~]" 's' \ | sed 's/[^luds]/a/g' \ | strings \ | cut -b1-8 \ | freqcount \ > rockyou.freq.8a wc -l rockyou.freq.8a head rockyou.freq.8a echo "- Generate masks." echo "- Ignoring all masks with more than three consecutive 'a' charset." time cat rockyou.freq.8a \ | cut -d\: -f2 \ | sed 's/l/?l/g;s/u/?u/g;s/d/?d/g;s/s/?s/g;s/a/?a/g' \ | egrep -v 'aaaa' \ > rockyou.masks.8 wc -l rockyou.masks.8 head rockyou.masks.8 echo "- Done." #end of script
Next, I wrote a script to exhaust each one in order by frequency using hashcat:
#!/bin/bash for mymask in `rockyou.masks.8`; do echo "- Running mask: $mymask ..." cudaHashcat64.bin -a 3 -m 1500 \ target-hashes.list \ $mymask echo "$mymask: done - `date`" >> $0.log done #end of script
Then it occurred to me that if someone else had published this info, and had used real corpora of passwords as the input, then our frequency lists would probably look similar. So I did the following Google search:
"?l?l?l?d?d?d?d" "?l?l?l?l?l?d?d?d"
... and the first hit was the KoreLogic blog post.
Dangit! :-) But at least I'm catching up to the state of the art; the KoreLogic article was published in April 2014. :-)
I got the idea from work I had done on some license-plate-collecting stuff I do on the side. I thought of it for capturing high-level patterns in serials, so that people can search for a plate based on the serial. A plate with "BDT 606" on it would match any plate whose serial "mask" is "AAA 999" using my notation. (I then match more closely, but it's used for a high-level search first).
I haven't watched the KoreLogic presentation yet, but I can definitely improve upon my own approach, because I'm being overly aggressive in turning then entire set of non-alphanumeric-but-printable characters into 's':
| tr "[\ !\"#$%&\'()*+,-./:;<=>?@\[\\\]^_\`{|}~]" 's' \
... when most folks use the simple ones (#$%@, etc.) I could create a custom charset for this using the notation as noted here ... and then turn the remaining characters into another custom charset that is the remaining characters.
I then found PACK - the Password Analysis and Cracking Kit, which is is a set of Python scripts to manage masks, including optimizing a set of masks based on a given timeframe (or, "I have 24 hours. Which masks should I use to maximize how many passwords I can crack?")