This site Footnote dos was used as a way to collect tweet-ids Footnote 3 , this site brings researchers with metadata away from a great (third-party-collected) corpus of Dutch tweets (Tjong Kim Done and you will Van den Bosch, 2013). e., new historical limitation whenever requesting tweets considering https://datingranking.net/sugar-daddies-uk/birmingham/ a journey inquire). The fresh R-package ‘rtweet’ and you can subservient ‘lookup_status’ function were used to collect tweets within the JSON structure. Brand new JSON file comprises a dining table towards tweets’ suggestions, including the creation time, this new tweet text, together with supply (we.e., brand of Myspace visitors).
Investigation tidy up and preprocessing
The JSON Footnote 4 files were converted into an R data frame object. Non-Dutch tweets, retweets, and automated tweets (e.g., forecast-, advertisement-relatea, and traffic-related tweets) were removed. In addition, we excluded tweets based on three user-related criteria: (1) we removed tweets that belonged to the top 0.5 percentile of user activity because we considered them non-representative of the normal user population, such as profiles who created more than 2000 tweets within four weeks. (2) Tweets from users with early access to the 280 limit were removed. (3) Tweets from users who were not represented in both pre and post-CLC datasets were removed, this procedure ensured a consistent user sample over time (within-group design, Nusers = 109,661). All cleaning procedures and corresponding exclusion numbers are presented in Table 2.
The tweet messages was in fact changed into ASCII security. URLs, line breaks, tweet headers, display screen names, and you can references so you’re able to display screen labels was removed. URLs enhance the profile count when located when you look at the tweet. Although not, URLs don’t add to the reputation count while they are found at the termination of a good tweet. To end good misrepresentation of your own actual reputation maximum one pages had to deal with, tweets which have URLs (yet not media URLs eg extra photographs otherwise videos) was basically excluded.
Token and you can bigram study
The fresh Roentgen package Footnote 5 ‘quanteda’ was applied to help you tokenize the fresh new tweet messages with the tokens (i.age., isolated terms and conditions, punctuation s. At exactly the same time, token-frequency-matrices have been calculated with: the fresh frequency pre-CLC [f(token pre)], the brand new cousin frequency pre-CLC[P (token pre)], the regularity article-CLC [f(token blog post)], the fresh new cousin regularity article-CLC and T-score. New T-attempt is similar to a fundamental T-fact and you will computes the latest mathematical difference in mode (i.age., the brand new cousin keyword frequencies). Bad T-results imply a comparatively high density from a token pre-CLC, whereas self-confident T-scores mean a relatively highest occurrence from an excellent token article-CLC. This new T-rating equation utilized in the study are presented as Eq. (1) and (2). N is the final number away from tokens for each and every dataset (i.age., pre and post-CLC). Which equation is based on the procedure for linguistic data of the Chapel mais aussi al. (1991; Tjong Kim Carried out, 2011).
Part-of-address (POS) study
The brand new R plan Footnote 6 ‘openNLP’ was utilized in order to categorize and you will amount POS classes on the tweets (i.elizabeth., adjectives, adverbs, posts, conjunctives, interjections, nouns, numeral, prepositions, pronouns, punctuation, verbs, and you can various). The newest POS tagger works using a max entropy (maxent) chances model to assume the brand new POS category considering contextual has actually (Ratnaparkhi, 1996). The fresh Dutch maxent design used in the brand new POS class is actually educated towards the CoNLL-X Alpino Dutch Treebank analysis (Buchholz and you will ). The openNLP POS design has been stated with a reliability score off 87.3% when used for English social media data (Horsmann mais aussi al., 2015). A keen ostensible restriction of latest studies is the accuracy from the fresh new POS tagger. Although not, equivalent analyses was indeed did for both pre-CLC and post-CLC datasets, meaning the accuracy of your POS tagger will likely be consistent more than each other datasets. For this reason, we assume there are no logical confounds.