[eng] Online Social Networks are a key source of information when it comes to human interactions, due
to their extended use in contemporary society. In this work a weighted, directed network was built
using Twitter replies data from three different countries with varying population sizes during an
eight-year-long observation window. Once the network is built, community detection methods are
applied to find densely connected clusters of users. Said communities are then studied from a
thematical point of view, using hashtags as memes through which members within a community
share ideas and common interests. A statistical study is conducted on the variety and repetition rate
of hashtags inside communities, as well as quantifying similarities between pairs of groups. The
objective is to test if online communication through hashtags in Twitter follows two trends; first, if
the growth on the number of unique hashtags as a function of community size follows a well-known
law for written texts called Heap’s Law, in which the number of unique words grows as a sublinear
function with respect to text length; second, if the behavior in hashtag use of such groups has a
boundary around the value of SD = 150 members, which has been believed to be the limit of stable
social relationships a human being is able to maintain, following the ideas of Robert Dunbar: below
this threshold, communities should behave more similarly to close acquaintances in real life, exhibiting a wide range of topics that are repeated less, in contrast to big groups which should represent
communities that aggregate users that follow a certain topic, thus exhibiting a higher repetition rate.
From the second idea also follows that there should be more nonzero values of similarity for pairs
of small groups, since covering a larger amount of hashtags with less repetition should lead to some
overlap in their covered topics, differently from pairs of big communities, which should show a
large amount of zero similarity values due to their peaked hashtag distribution around certain topics.
In the first place, the vast amount of data that was gathered carried a high computational cost
of obtaining the desired metrics and forced to sample a small amount of communities for each of
the countries. Moreover, data from the most populated country had to be left out as a result of their
dimensionality. As for the hypotheses, the growth of unique hashtags as a funcion of group size was
confirmed, although said curve doesn’t resemble Heap’s Law, with such growth being significantly
low. Then, the separation that follows from Dunbar’s results reveals that indeed the repetition
rate for more populated groups of users grows with respect to that of small groups. Finally, the
similarity measure that was implemented for this work doesn’t yield very illuminating results to
test our hypothesis, mainly as a consequence of the community sampling that was conducted and
the little data processing that was done over hashtags. In a future work, the implementation of
this work must be improved to tackle problems such as lemmatization, intruder hashtags that don’t
belong naturally to their respective networks or computational efficiency by using Big Data tools
that relieve the cost of handling several millions of data entries. Moreover, we propose a network
of communities whose link weights are proportional to their similarity, to which we can apply a
propagation model for hashtags that emulates the transmission of specific topics in such a network.