18981
Hi I'm Matt Feld, Data Scientist and creator of Congresswebhistory.com, a tool that tracks what Congress, The White House, and the FCC are browsing on the internet-- I'm doing an AMA today at 4:00pm
UPDATE: I'll be going through this on an ongoing basis to make sure I answer as many questions as I can -- Feel free to post, and I'll get a response to you as soon as I can
UPDATE: Alright everyone, I need to get to bed, so I'm signing off for the night-- I'll be back on tomorrow to continue answering questions
Here is my proof: http://www.speaktogether.org/blog/my-reddit-username-is-mfworks
Here's the main page: https://www.congresswebhistory.com
Here's a little bit about the project: https://igg.me/at/browserhistory/x/16494183
And here is our other project for fighting the ISP law-- we've built a tool to opt your data out of collection by ISPs, and are building a legal fund to bankroll civil cases when those ISPs abuse your data:
https://indiegogo.com/projects/opt-out-your-browsing-data-so-isps-cannot-sell-it
Louise Matsakis also covers it really well here:
twitter: @mfeldspeak facebook: https://www.facebook.com/speaktogethernow
Some background about me: I'm a 24 year old software developer in Research Triangle Park, NC. I founded Speak Together a year ago to build software to change the models citizens can use to reach out to their govt.
last year I got involved in the fight to repeal NC HB2 (the notorious anti-transgender bathroom law that was passed here in North Carolina) and quickly became jaded by how difficult and inefficient it was to learn about the activities of the state legislature and communicate at all with my representatives.
I found a few friends who had felt the same way as me, and we've been building software to try to make that process easier. One of them, violetnekos, is also on the ama.
Ask Me Anything!
mfworks1325 karma
While the types of porn our representatives are watching is definitely entertaining, the real value from getting our plugin on porn sites(and we have it currently tracking on a few) is seeing if Congress or the White House is accessing porn while on the job at all.
Also, our goal isn't to "out" anyone for their sexual preferences-- however it would be interesting to see if their are distinct trends between Congress vs. The White House vs. FCC vs. the public at large, the last comparison of which is something really only the sites that are currently using our plugin can answer.
mfworks1039 karma
Where we see the largest value for our tool(beyond its efficacy as political protest) is if we can get it on a large swatch of online news site. It would be interesting to see where swaths of representatives are getting their news: How many get everything from Breitbart/Infowars? How many from Mother Jones, or the Young Turks? We can also use referral headers to track whether people are following specific sites, or if they just read what pops up on, for example, their facebook feed
fifibuci22 karma
The website Pornhub does an annual breakdown of trends geographically, if you're interested ;)
mfworks8 karma
I got you: https://www.pornhub.com/insights/2016-year-in-review
Also they have an insights feed: https://www.pornhub.com/insights/
bananahead679 karma
Isn't this mostly just tracking what underpaid interns are doing while they're supposed to be running to Starbucks?
mfworks967 karma
Also, just because interns exist doesn't mean reps are immune from tracking. The irony here is that the ISP privacy law was based on the legal argument that ISPs are not utilities, and so are exempt from regulations that apply to utility companies.
If that's so, then congress should be able to work around having to use the internet (and being tracked on it) in the same way they expect us to, and not have it impact their job.
If they can't, then it's a pretty clear indication that ISPs are providing a public utility, and should have to safeguard our data in the same way utilities do.
MyParentsAre_Cousins297 karma
I was an underpaid intern on the hill, can you please delete by browser history.
Interns need some help, they are the ones opening up the Hustler magazine every month.
mfworks273 karma
Yes, we actually created a tool for just this purpose. email me at [email protected] and I can get rid of all your tracked data for you.
In a broader sense, this kind of participation from non-reps in Washington is incredibly helpful. One of the easiest way to improve the accuracy of our data is just to opt out all the interns that work there. If you are intern, hit me up at [email protected] and we can auto-filter out any of your browsing history.
violetnekos42 karma
Will we track what interns are doing on their way to Starbucks? Yes, we will probably catch some of that. We have filters and can analyze the data to parse some of that out.
bananahead40 karma
Can you explain in more detail? How can you differentiate traffic from an FCC intern from that of a commissioner? Don't they have the exact same IP?
mfworks56 karma
Yes.
- The easiest comparison to make is to other 3rd party tracking tools(like Google analytics). These tools work kind of like this:
mfworks62 karma
step 1: you send a request to a web page that has one of these tracking scripts on it
Step 2: the server responds, and your browser then downloads the web page you are trying to view (including the plugin)
Step 3: the script then runs on your machine (within your web browser), and sends data about your browser (and cookies, and some other info) to another server that stores that info
mfworks54 karma
This is how, even though an entirely family might be using the same wifi router(and same ip address) Google can still tell who is visiting what
mfworks48 karma
In the same vein, our tracking tool has the ability to utilize unique info about each persons user-agent(among other techniques) to help weed out the interns from the administration
Beyond that, there are a suite of data analysis techniques that can further narrow the scope of who we are looking at.
bananahead9 karma
Ah, maybe my question was unclear. I don't doubt that you can see that e.g. 1000 different people are behind one IP address, but how can you tell which are interns and which are members of Congress? Google analytics doesn't really know the answer to that either.
Cookies and browser fingerprinting can help you tie two different web requests to the same browser, but they don't tell you who is using the browser, right?
mfworks20 karma
Yes-- which brings me to point 2:
- In aggregate, members of Congress are going to have different patterns of internet use than interns. As our data grows, this disparity should become more clear.
There are numerous techniques to be able to capture this trend, but I'll mention a few here, as well as some links to good descriptions of them
K-means clustering:
Decision trees: these would work insofar as we choose some function of the data(or metadata ) that correlates well with whether or not the user is a representative, and minimizing the entropy of that function. The tricky part is figuring out a function that would be an effective differentiator.
These are both techniques that can operate using only unlabeled data, meaning that we don't have any browsing history that we know comes from either an intern or a congressperson. Which segues into part 3
- If a representative (unlikely) or an intern(more likely) is willing to let us tag their history, we can use that data to greatly inform how we differentiate between interns and reps.
Ultimately, there is no surefire way to completely eliminate intern data. However there are a number of techniques we can is to narrow the scope of our data and ensure that our analytics are as targeted as possible.
slan45642 karma
What are the most interesting insights you've uncovered by tracking our government official's browsing activities?
mfworks510 karma
Right now we are focused on testing the technical aspects of our tool and data management pipeline. We are trying to hit the sweet spot of a few different targets before we release it to the public: 1. The tracking code installation needs to be as simple as possible. We want to keep it to a piece of javascript code you can drop in to your frontend, with no server-side dev.
It needs to be robust enough that it can't easily be disabled. This actually conflicts with our client-side only goals, because the more we allow for server-side implementation, the more it can circumvent methods to disable it.
It needs to leave as minimal a privacy footprint as possible. We want to avoid sucking in any non-government data, and that means putting a lot of code in the tool that makes it bulkier and more vulnerable to disabling.
mfworks619 karma
However, to answer your question: What qualifies as 'interesting' will vary depending upon who we are talking to. Obviously, the most eye-catching data we can grab is porn habits or other "embarrassing details," so to speak. However, from a political or legislative position, that data isn't all that informative.
I am most interested in putting the tool on news sites, as I mentioned above: We would be able to see where politicians are getting their news, and potentially how that informs the actions of our federal government as a whole.
On a more voyeuristic note, it would also be really cool to get it on social media sites like facebook, and a ton of forums and blogs, to try to paint a picture of what kinds of things our representatives do in their free time. Do they like fishing? Do they play church league basketball? It would be cool to see how politicians manage their leisure time and who are in the personal circles of the most powerful people in the country.
slan45163 karma
So it sounds like the tool would require cooperation from web masters (and their devs) to implement at all, right? That sounds tricky...have you had much success in getting sites to implement the client side code?
mfworks149 karma
Great question -- There are a few hurdles to getting a site to implement a tool, but way more hurdles when getting a *business to implement the tool on its site.
I mentioned in another answer that we're trying to hit a simplicity-robustness-privacy sweet spot, where we want to make a tool that's easy to install, hard to circumvent, and extremely selective in who it tracks.
Right now we've done a pretty good job with the first and the last parts-- our plugin is super simple to install(just drop the js code on your site's page) and really private(we do ip address validation on the front end, and then run it through an aws lambda filter that automatically does a first pass and removes some public access data that came from their open wifi access points)
Before we release it as open source, however, we also want it to -be extremely resistant to tampering/disabling -be across multiple languages(it's a client side js script right now, but we're also porting it to PHP, Ruby, Node(server side), and building a wordpress plugin) -make it so releasing it open source won't make it extremely easy to bypass by Congresspeople and representatives
slan4554 karma
Man these are fascinating problems to account for in designing your solution. Thanks much for your time in answering my questions.
The last big hurdle I can imagine you needing to overcome, which I think you've touched on a little, is how to deal with pairing down everyone's data into just the population your seeking to collect on. I imagine first you'll need to gather all data for everyone, which sounds scary if exploitable
mfworks60 karma
Yeah we are basically parsing the data through a series of filters(which I won't go into too much detail on in specifics) but they basically go like this:
Only federal ips(will link to them in a sec)
Some low-tech solutions that we found to easily filter out non-gov data
Some slightly-higher-tech solutions that identify data that definitely belongs to Congress/FCC/White House Administration
Some classifiers that will be able to further segment the data the more we receive.
pjguy2000145 karma
Have you encountered any suspicious activity in their browsing history?
mfworks172 karma
I'm not really in a position to talk about the data we've collected so far, we're actually treading carefully around what/how we release the data we collect (1/?)
mfworks113 karma
First, to answer your question: what we have learned so far has certainly been informative(even if in some cases informative means nobody in Congress really cares about a certain subject enough to visit any sites related to it at all)
There are a few lines of reasoning behind this:
How can we balance releasing timely insights about our data with Congress' ability to use that to identify and work around our collection methods?
What is our exposure legally from different methods of releasing the data?
Are there any ways that releasing certain data can undermine our fundamental message(we are trying to shed light on the ISP Privacy Law that passed in April) (2/?)
mfworks85 karma
- The more data we release earlier, obviously, the more Congress will be able to react to those releases and mitigate their exposure. Similar to how investigate journalists frequently operate-- if they find a potentially interesting story, they don't publish immediately. They'll keep pulling threads while staying low key to try to get as complete a picture as possible.
If we find an interesting trend in the data, and then release it immediately, we could be tipping our hand to Congress, early, and they can react and change their browsing patterns accordingly. So we're erring on the side of keeping our info quiet for the time being to try to gather as much unfiltered behavior as possible.
mfworks69 karma
"2." We've had a lot of help from really cool and awesome legal experts-- Ben Wizner of the ACLU and Anne Klinefelter at UNC, to name a few. What we've found is that our exposure in this project is minimal, and in some cases it's the sites that use the plugin that may be at risk.
To mitigate this, our legal team has built a privacy policy addendum that we include with the plugin that eliminates exposure from the sites that use it--they attach that to their privacy policy and they are good.
mfworks84 karma
"3." The issue with having raw data, and with releasing raw data, is that one we hit a critical mass, it gets frighteningly easy to "de-anonymize" the data. That is-- use it in conjunction with public domain info to start to build browsing profile, and attach those profiles to public figures.
Jessica Su, a computer scientist at Stanford actually did this with some sample data, and showed how easy it is to do-- http://randomwalker.info/publications/browsing-history-deanonymization.pdf
We don't want to release this data and then have the internet collectively weaponize this data against our representatives. So we need to be really careful with how we use it.
Wertsir67 karma
Why do you post things in multiple posts like that? you are far from going over the character limit.
mfworks8 karma
At the time of writing I was getting ~10-15 new questions every time I refreshed the page. I didn't want to spend 20 minutes answering one question and neglect others, so I would post and refresh for each salient point so I could make sure I addressed a spectrum of inquiries.
goat_reaper81 karma
I have a lot of questions but I'll limit myself to two lol
What advice do you have for an aspiring data scientist? And what projects would provide proof to an employer that I can do the job? Thanks
mfworks138 karma
What advice do you have for an aspiring data scientist?
Most of the other data scientists I know are self-directed one way or another.
The first thing I did was get familiar enough with python that I wasn't tripping over my own code when I wanted to start working with data. I went through code academy's python course, then did some hackerrank data structures problems with python.
After that I went through and read about a lot of the basic DS implementations and did a version of my own. I started with probability and statistics, then went on and wrote a basic python implementation of nearest neighbor, multiple(and polynomial) regression, decision trees, a basic feed-forward neural net, k-means and hierarchical clustering and a few others
After that, I definitely recommend kaggle. I don't do a lot of competitive coding, but kaggle has a ton of open data sets so you can dive into a project about whatever data interests you. I did a couple projects on beer types and brewing distros, and now I'm working through the 2016 election data.
My biggest piece of advice would be not to sweat a lot of the more hyped stuff(neural nets, SVM, etc.) because 90% of the time you're going to be able to do really, really cool stuff really easily with way simpler DS solutions.
mfworks53 karma
And what projects would provide proof to an employer that I can do the job?
I kind of cheated, already worked at my current employer as a Java midtier developer, and now I run a DS meetup here. Showing the people here my basic implementations has opened a ton of doors, and really helped with my confidence in my abilities. So I would say definitely go through that process, then just work consistently on projects you find interesting.
Interview-wise, put the projects on your resume and keep the narrative of how you got started in your back pocket. I've found showing your projects, then telling the story of self-directing to where you're at now is really compelling to potential employers, and can be a plus if you've shown consistent, meaningful work
goat_reaper21 karma
Thank you! Currently I've been working with Python actively for three months and on and off before then I'm a self study as college isn't financially an option right now, working two jobs
If you're willing to answer one more question, what languages or technologies do you recommend? (I know you mentioned Python and Java, but SQL and Azure are some I've been considering) Thanks again!
mfworks34 karma
Happy to answer as many questions as you have :)
The infamous language debate :)
Depending on what you are trying to accomplish, it makes sense to learn different languages. However, I would also suggest a larger philosophy of how you approach the languages you use.
First: If you're interested or working in Data Science(and it sounds like you are) I would highly recommend Python. It sits at this nexus of readability, ease of use, and power, and elegance that I haven't seen in any other mainstream language(I would say javascript is a close second) Also, it continues to grow at a breakneck rate across a ton of industries, so it's also one of the most future-proof languages when it comes to employability :)
If you fit into one of these buckets: - You love open source and want to contribute to the open source community - You have a ton of ideas and you want to build projects around them - You want to get into web development
I would recommend node.js + javascript. Full disclosure, node(and javascript) was the first non-statically typed language I really dug into, and after I got used to the idea of no compile-time checks for type safety, I loved how fast I could take an idea and turn it into an actual solution.
JS gets a bad rap, and to be fair, has a lot of weird quirks that are holdovers from its long and checkered history as a language, but once you get past them, it's absolutely intoxicating how easy and fun it is to develop CS solutions with.
If you are interested in JS, I would highly recommend "Javascript-- The Good Parts" By Douglas Crockford.
Douglas has been a somewhat controversial figure due to his extremely strong opinions on the 'right' way to code in JS(and his refusal to allow JSLint customizations led to a split in the lint dev community), but I haven't read or heard of a better intro to the javascript language than that book
mfworks15 karma
I will continue to answer this question, but need to refresh the page real quick and make sure I'm answering other q's as well :)
shipworth56 karma
Are these people on "interesting" (cutting edge, hip, avant garde, risqué) sites a lot or are they checking AOL or CNN/FOX/MSNBC.
Are any of them (what ~%) researching policy positions in a meaningful way?
mfworks45 karma
Getting our tool on major news sites is a long process that we are currently working through.
However, govtrack.us has some very informative policy and bill information, and publicly releases(some) of the real time data of what Congress is visiting on their site. Very cool way to see what bills congress is interested in on a given day--
trai_dep30 karma
You're designing your code to be portable so that other groups can also use it to increase transparency in regards their governments.
Have you had any conversations in regards how non-American implementations might differ? I'd imagine Germans, French, let alone Turkish, Chinese or Saudi Arabian implementations of this would provide different – uhh – insights.
Is what you're doing portable enough to work in more repressive nations? Not that the US is all that now (sadly), but I'd imagine Turkey would present very different challenges.
Are you planning on having some kind of starter pack for non-profits in foreign locales, both to run the analytics and also basics on how to reach out to their local journalists so news can get out for their findings?
PM me if you'd like some suggested articles on how journalists might protect themselves when they're in a hostile environment while they're doing their important work. The EFF, Privacy International and the Freedom of the Press Foundation are excellent sites with many resources, FWIW.
mfworks16 karma
Really, really cool question.
The code would be super easy to implement for any world government, and one of our hopes when we make it open source is that people can take it and modify it to those ends.
The key in our case(and govtrack's, and CongressEdit's) is that the WHOIS records of the IPs that belong to the House, Senate, FCC, and White House are in the public domain. I'm not sure if the same is true for other countries.
mfworks7 karma
In terms of starter packs, the analytics stack we put together would be really useful info to have. We were actually thinking of packaging the solution we created and doing a non-profit cloud hosted clickstream analytics service that sites could use instead of Google Analytics. The benefit would be that we wouldn't own any of the data they route through us, so they wouldn't be contributing to some unholy data mega-farm that knows everything about everyone, a la Google and Facebook.
It turns out, however, that Piwik already offers this really cheaply, so my advice would be use our code with the piwik cloud hosted analytics service
mfworks21 karma
Not really.
In fact, if you are relying on plugins or chrome extensions to protect your privacy, incognito can hurt you because it disables those by default.
In terms of privacy, incognito… 1. Doesn’t keep a record of your site visits on your computer 2. Doesn’t use existing cookies. It will store new cookies from your incognito session for the duration of your session.
One of Speak Together's other projects is a multi-part service that will be able to protect people from the worst parts of this law.
Essentially it works like this:
We build a demand letter that opts you out from ISP data collection(this works because of FTC regs still in place that require ISPs to respect opt out requests of personal data)
The money we raise goes into a legal fund that we will use to bankroll civil cases against ISPs when they abuse your data/ignore opt out requests(and they almost definitely will abuse it, ISPs have a dubious history of following the law or respecting consumer rights as it is)
We have a lot of evidence that this works in other industries, Jon's previous engagement was running a company called CellBreaker, that did a similar thing with predatory cell phone contracts in the Wireless data contract space. Even though the total number of contracts wasn't huge(high 3 figures or low 4 figures I think) it was effective enough that the "contract free" marketing that came out enough was specifically using language in the demand letters he sent, and all the promotional pages for their campaign had "CellBreaker" "cell breaker" and other brand-specific keywords explicitly listed in their meta-info(the data sites put in their pages so search engines can rank them)
Right now we're raising 10k to see if the model also works with ISP opt outs, and if so, we're going to try to scale it and get everyone we can protected-- hopefully making this law worthless without having to get it repealed
The_Third_Three12 karma
As a recent graduate with degree in mathematics and beginners knowledge of R SQL and Python, what could a guy do to get into entry level data analysis? I'm not having much luck in my job hunt
mfworks11 karma
Hmm good question-- I just landed, give me a second to find a corner in this terminal somewhere and I'll respond to this
fifibuci3 karma
Sorry if I missed something obviious, but is your tool working with Reddit?
mfworks5 karma
I would loveee to have our tool on reddit. They aren't returning my emails :/
mfworks5 karma
If anything happens to me the data in its entirety will be released by an anonymous entity from an undisclosed location
trai_dep2 karma
Hey guys –
Great work and best of luck. Just wanted to let you know we're cross-posting this over in r/Privacy, so hopefully you'll get some questions from them as well!
Do you think that if Congress people find themselves under the same lens they're trying to turn on us, they'll re-consider their views on privacy? I'd like to believe so – self-interest does wonders for one's clarity of thought – but this hope is tempered by the rank hypocrisy that fuels much of Washington.
mfworks5 karma
I can't remember who specifically this was, but someone I follow on quora said it really well:
We're probably not going to change the minds of any of the hard-core anti-privacy individuals(i.e. most of the people who voted for this law).
But in some ways, we're not really trying to. We are trying to educate around them. They may not change, but I'm sure there are plenty of people in their constituencies who do care and just haven't heard about what is going on. We are trying to reach them, because if they know about this law and are angry about it, they can reach out to their reps about it. And if their reps don't change, they can vote them out.
LadiesWhoPunch2 karma
What are some the websites you have on board already? Is reddit one of them?
mfworks6 karma
No Reddit right now, I'm not being super open about which sites we have so far, because they are few enough still(only 20) that Congress could just firewall access to all of them and then we're hosed for this phase of our release(testing our plugin)
fifibuci1 karma
I have a question on a slightly different track.
Have you given much consideration to security and what you will do in the case of a hack or demand for information? Has there been anything of the sort as of yet (blink twice for ice cream)? What will you do if you found that you were on the receiving end of a NSL?
mfworks2 karma
It would be really strange if we got a National Security Letter, because all the data we have is just a less complete version of what the Federal IT department already has in their router logs.
Plus the NSA is already tracking all our browsing history anyway so its not like I have anything super valuable to them as is :P
lusciouslou1 karma
I'm a schmuck, and I honestly don't know the answer. Why should the fact that ISPs having access to my search history matter to me?
mfworks3 karma
One of the most common questions I get asked, this is actually a really, really interesting one to talk about.
Let me tell you about Richard Guthrie.
Richard was a 92-year-old veteran. He had lived the same Iowa home for decades, fathering 8 children, surviving his wife who had passed away a decade prior, and faithfully squirrelling away a life savings for his family. Then one day Richard was robbed, over the phone, by some men who pretended to be government employees. Richard lost his entire life’s savings.
How did this happen? InfoUSA, a "Consumer Data Clearing House," sold his personal info, along with over 3 million other elderly individuals, to known scam artists. Those artists swept through this list of senior citizens-- our parents and grandparents-- and rinsed them for over $100 million.
And that’s the problem. Data doesn’t just work for the good guys-- it can grant anyone enormous influence over our behavior--which is why you would be wise to be selective about who you grant access to that data. But now you don’t even have that option: ISPs decide who gets access, not you, and their only criteria seems to be “who is the highest bidder?”
clampie0 karma
Because of your ideological and activist leftist bent, how can your data be trusted?
I-Dissent964 karma
What kind of porn are they watching?
View HistoryShare Link