I'm on my senior year of college, and my professor and I have decided on an independent study where I will analyze and develop some method(s) for screening valid data sets for the 'selection' phase of data mining, where spam bots, dead accounts, and other irrelevant data sources are omitted from machine learning analysis.

The problem i'm running into is that I really want to use a data set that is from the real world, so I could truly discover something meaningful from the data that hasn't been discovered before.

My initial thought was to use the Ashley Madison hack dump from the summer of 2015, but I wanted to consider the ethical implications of using such sensitive data. Alternatively, I was thinking I could manipulate the data before analyzing it to provide some sense of anonymity to the victims of the hack (for instance, replacing all full emails with the first and last character of the name, as well as the @ and suffix).

My question is NOT whether you think these practices are ethical, but whether there has been some professional work done in the past anyone is aware of that can serve as a model for my current work.

For example, Facebook was caught manipulating the content of its users' Facebook feeds in order to measure the emotional reaction of the content on their future posts.

You have no legal right to use this data in your research. Basically what you are describing is gaining unauthorised access to sensitive personal data and using it without the permission of the data holder or the individuals. The fact that some hackers posted the data on the internet makes this unauthorized access really easy, but it doesn't change what you are doing from a legal perspective.

Anonymising the data in some way doesn't change this. There are certain cases where anonymising data makes it legally usable for certain purposes--but not in this case, where you have no right to use the data in any way.

Using this data would put you at risk of prosecution and probably jail time. And, if you plan to do some research and try to actually publish it, I'd say this is a very real risk. Even if you avoid legal consequences, it's quite likely this will affect your ability to publish the work.

Caveat: I'm not a lawyer. Ask one if you want more information.