Now things start to get really exiting: I extended my Python hack to collect information on all relationships for a set of Twitter users, more specifically, I record the following:
- Each tweeter becomes a node
- Each #hashtag becomes a node
- A “mention” becomes a directed relationship from the “mentioner” to the “mentioned”
- A #hashtag in a tweet from a user creates a directed relationship between that user and the #hashtag.
I started the collection by iterating over all the twitter users I follow, and for each such user, collecting their 200 latest tweets, analyzing each tweet for mentions and #hashtags. Each mention in turn is analyzed, and is inserted to the list of nodes to be analyzed, if it isn’t there already.
By this “recursiveness”, the amount of nodes “explodes”: starting with some 200+ users in the initial “following-list”, I quickly end up with some 30000 nodes, and even more relationships. So, in my Gephi model, there’s about 60000 entities.
This amount of data starts to become painful for my 4 core laptop to handle in Gephi: the refinement of the layout seems now take hours, asop to the seconds or minutes for the previous examples I’ve done. Still, I’m impressed that Gephi is capable at all of dealing with layout of this amount of data – I know of many commercial visual modeling products, that have problems dealing with layouts of data sets 2-3 orders of magnitudes smaller…
Another interesting observation, from analyzing this data, is that this type of “big data” & “analytics”-approach really gives new insights into the population under focus: by looking at e.g. the relationships between a specific person and #hashtags, I can quickly extend my knowledge about that person’s interests. I’ve done a few such probes into the data set, looking at a the data & relationships of a few people I thought I knew fairly well, and there clearly some surprises….
Another example is that it’s very revealing to observe the volume of communication between any two nodes – this is the type of “Meta Data” that’s been so much discussed in the #NSA and #Snowden case, on #privacy and #surveillance.
To summarize the big data analysis & privacy & survillance perspective: IMO, even this tiny (compared to NSA’s, GCHQ’s & FRA’s data collection) dataset I collected in a matter of a few minutes, gives me more than plenty of new information about the underlying population, about their interests and relationships. And note that this data is collected from a source – Twitter – where people voluntarily publish information.
I hope this shows why the mass surveillance of closed sources, practiced by NSA, GCHQ, FRA and their ilk is such a bad idea, despite the fact that “you have nothing to hide”….
Imagine what kind of data I’d be able to dig up about you if I’d analyze your Google search history…. as @Mikko Hypponen says:
“Your search history knows more about you than your family or closest friends”