Computational Communication Science Conference & LDA for analysing visual frames

Last month it finally happened: The Computational Communication Science workshop conference took place in Hannover! We worked towards this as a team for almost two years. More than 100 participants took part in the discussions and method workshops were held in advance. Apart from some minor glitches, the conference was a complete success. As the main organizer, I wasn’t able to enjoy the conference so much, but the extremely positive feedback from the young and senior scholars makes me very happy in retrospect. I also had time to attend Wouter van Atteveldt’s workshop on Topic Modeling between organizational tasks. The workshop was fantastic and helped me a lot with my project about visual framing of politicians on Facebook. To identify the frames I will now perform a latent dirichlet allocation. At the moment I am still analyzing the images through the Google Vision API. With over 350,000 photo posts, it unfortunately takes some time. But the syntax is surprisingly simple, because with RoogleVision there is already an R packet for this purpose. So my code for this task will be on github shortly.

Politicians Facebook Posts: First descriptive results on parties and politicians

During the last weeks I was busy getting an overview on the data. Being kind of a graphic nerd, I wanted to create not only functional but also aesthetic outputs. While ggplot2 certainly has it´s quirks, it was fun to puzzle out charts that worked for me. Here come the first resulta.

The basis of the data collection was a list of the 2653 politicians who were either members of the 18th German Bundestag or running as candidates for the 19th election period starting in 2017. Since the parties are free to nominate as many candidates as they like this distribution hardly reassembles the voting share of any election. The politicians are divided among the parties as follows:

As stated in a previous post, naturally not all politicians maintain Facebook profiles, nor are their profiles necessarily public and available for download. The final sample of politicians is shown below:

  Clearly, the politicians of some parties are much more involved in Facebook campaigning than others. For instance, nearly as many SPD politicians as CDU/CSU politicians have a public profile although there were fewer names on the list. Further, the FDP politicians seem to be quite present on the platform, while politicians of the Grüne are not so well represented in the Facebook sample- To get this insight more comparable, the following chart presents the ratio of politicians with Facebook profile per party.

The chart makes it clear that the FDP politicians maintain a profile above all others (65 percent). Looking at individual FDP profiles I think one can identify that the FDP social media team provided unitary visual material that was distributed vastly among the candidates. In addition to the likewise strong pretence of the SPD on Facebook, it may come as a surprise, that the AfD candidates are not so strongly represented.  Eventually, the party’s success was partly attributed to its successful presence in the social networks. At the level of individual politicians, however, there no particular quantity can be found.  What is also surprising is the present of the Greens in the sample below average. Although the party appeals to a rather young target group and the Green parliamentarians are also younger than their colleagues from other parties, the Greens have the fewest Facebook profiles, both in terms of absolute and relative share.

Well, although these findings reveal interesting background knowledge, they parties are not the main objective of the current study, which is concerned with strategic communication of single politicians. On politician level I first analyzed popularity in terms of fan count. Unsurprisingly, current chancellor Angela Merkel leads the Top10 with great distance more than 2.5 million fans. Second is not her main contender Martin Schulz (SPD) but former opposition leader Gregor Gysi (Die Linke). Both candidates pair close to 0.5 million fans and are followed by the current parliamentary party leader of Die Linke, Sarah Wagenknecht (0.4 million fans). The leading candidate of the FDP, Christian Lindner cannot contend with these numbers. At least he has 0.2 million fans. On rank 6 Frauke Petry follows, the former party spokeswoman of the AfD who left the party right after the election. Because she was still am member for the main part of the enquiry period, I will keep her as an AfD member. Next, on rank 7 is the first leading politician of the Greens, Cem Özdemir (0.1 million fans). This completes all the factions now represented in the Bundestag. On rank 8 to 10 more leading candidates from AfD (Alice Weidel), CDU (Jens Span) and SPD (Sigmar Gabriel) follow. From 10th place in this ranking, the number of fans drops below the mark of 100000.

1 Angela Merkel CDU 2,522,139
2 Gregor Gysi DIE LINKE 476,350
3 Martin Schulz SPD 470,114
4 Sahra Wagenknecht DIE LINKE 405,792
5 Christian Lindner FDP 243,822
6 Frauke Petry AfD/independent 214,936
7 Cem Özdemir GRÜNE 142,475
8 Alice Weidel AfD 108,334
9 Jens Spahn CDU 105,997
10 Sigmar Gabriel SPD 83,299

The next step will be to analyze the > 710,000 posts which were collected in regard to the buzz they created as well as their textual as well as visual aspects. I already calculated some results but I´m still searching for a good way to present them online. This is difficult because so many pictures are involved…

Politicians Facebook Posts: Lab report on data collection

Since the data collection for my project on strategic communication of politicians on Facebook has been completed, it´s about time I write an extensive lab report on how it went. I have experimented with web scraping in R and Python for a while now, but this was by far the most extensive data collection I have ever conducted. In total, I collected the Facebook posts of 1,398 political candidates during the last 4 years, covering the whole 18th election period of the German Bundestag. The total sample resulted in about 710,000 posts. Of those, ca. 390,000 were classified as photo posts and for the purpose of considering also the visual aspects of strategic framing I also collected those.

The starting point of my data collection was a list of the 2653 politicians who were running in the German federal election in 2017 (19th election period of the German parliament) as well as those parliamentarians who were members of the preceding election period but did not compete in the current election. This list was compiled using several online sources like the website of the German Bundestag, the website wenwaehlen.de (an initiative comparable to Wahl-O-Mat, where voters can check on single candidates opinions not only on parties),  as well as Wikipedia. These sources supply the social media links for some of the politicians, but it turned out that many Facebook profiles were missing in automated data collection and some Facebook links seemed to be outdated. Thus, I corrected and supplemented the lists with the politicians Facebook identifiers in a manual search (*phew!*). The search resulted in 2066 Facebook profiles of election candidates.  In reverse, this means that for 587 politicians no Facebook was available (22.1 percent of all candidates). Although one could assume that a Facebook is a standard instrument in modern campaigning, even some very prominent politicians do not maintain a profile. – For instance, Federal Minister of the Interior Thomas de Maizière, CDU, dropped out of the sample for this reason.

My list of election candidates contains other social media links, but I did not systematically check those, because they are not relevant for the specific research purpose of the current project. Anyway, for those who are interested in conducting a similar analysis of German parliamentarians in social media, the list can be downloaded here and it will also be available on git.

Next, I had to choose an appropriate storage for the data. Since all individual records are relatively homogenous in their attributes and the proposed data records have a relational relationship to each other (politicians Facebook profiles, their posts as well as visuals contained in theses posts) a relational SQL database was the natural choice. Moreover SQL databases can inherit not only textual but also visual data in blob objects, which was a further advantage. Thus, I installed MariaDB as well as phpMyAdmin on my server and was ready for data collection.

I chose to conduct the scraping as well as the analysis in R and not in Python. Besides a general preference for the R language which has a low threshold and is very flexible, my main reason to take hold on R is because it is by far more compatible to my colleagues in communication then python is. Moreover, this project should serve as a proof of what the R language is capable of in regard to openness: I wanted my own showcase how R can be used for the entire research process from data collection, to analysis as well as publishing. And last but not least there is already a package to access the Facebook API via R: Rfacebook. Although it doesn´t solve every problem (more about this later), the package considerably facilitated my data collection (for this project I used the latest stable version 0.6.15).

The first step in the data collection was to store the .csv file of politicians into the database. Doesn’t sound too difficult, does it? Well, it nearly freaked me out!  The challenge was to get the encoding right. I don’t know if this is a problem that only Windows users will encounter. I finally I found a workaround which I will document here to remember it on future occasions:

  • Save the .csv file from Excel “separated by separators” (“mit Trennzeichen getrennt”).
  • Open the .csv in the simple editor provided by Windows and save it again with encoding = “UTF 8”.
  • When importing this .csv file in R, I set attribute encoding to „UTF-8_bin“ in the read.csv2(). Weirdly, when I check the dataframe with the view() function in R after this procedure it seems to be all messed up. But what is more important, the import to the SQL database works correctly.
  • Put dataframe into the database using the RMySQL::dbWriteTable function.

The next step and the beginning of the actual data collection was to check if the politicians Facebook profiles I collected manually were a) publicly available via the API and B) if they were conceived as “user” or as “page”. Although these infos are already listed in the table of politicians provided above, they might change over time so it´s worth considering to redo this check if the list of politicians is used in another context.

  1. Regarding the publicity of Facebook profiles, site admins may set the visibility of their profile to public or private. Of course, only those profiles whose owners have chosen to make their content available can be accessed via the Facebook API. Nonetheless some profiles are still accessible via manual search on the platform. Self-evidently, I respect user privacy. Nonetheless, I quarrel with this situation since some of these profiles are obviously not personal or private by content but clearly aim at a broader public. Hence, I guess that some of the politicians and/or their social media staff members are not aware of the fact that their Facebook profile is not completely public (and thus cannot be found via search engines etc.) or they do not care. In total, 671 profiles are configured as private and thus dropped out of data collection.
  2. Most politicians (n = 1315; 94.3 %) in the remaining sample created their personal Facebook representation as Facebook “page”. This makes sense since Facebook “pages” distinguish professional or business accounts from ordinary “user” profiles. Nonetheless the sample still contains 80 non-private “user” profiles (5.7 %). This has consequences for the profiles attributes, but not for the posts on theses profiles, so it is only marginally relevant for the project: A user profile does not contain or reveal as many attributes in data collection via API: E. g. information on affiliation, birthday, biography or category of the profile cannot be downloaded from user profiles. Since the collection of posts is not affected by this differentiation, it does not really matter but it needs to be taken into account when politicians profile information should be downloaded (which I did, seen next step).

The third step of the data collection was to access the politician´s profiles. I wanted to collect them to gather some background information on the sample as well as to crosscheck whether I got the “right” profiles. Some politicians have names which are very common and there are even duplicated names within the sample (like two “Michael Meisters” one CDU, one AfD). I plan another report on the crosschecks of the data that I did. But for now let´s get back to data collection. The accessing of the profiles was the first challenge for the Rfacebook package. Actually I didn´t find a function which exactly extracted all the info I wanted. Hence, I wrote a simple GET request which returned the specific fields I was interested in. Next challenge was again to store the newly encountered data in the database and keep the right encoding. This was ensured in allocating the Encoding to „UTF-8_bin“ for every non-English text variable. In total, I collected the profiles of 1395 campaigning politicians.

Until now, the data is neither very big nor does the collection take very long. This changed in the next steps, the collection of the posts and the collection of visuals, because my aim was to download all posts from all campaigning politicians with Facebook profiles during the hole 18th election period of the German Bundestag (4 years). I decided to separate these steps from each other and to use two tables in the database to collect the posts and the visuals. The script to collect posts on Facebook is not very notable; again I had to do several checks on the encoding before everything worked fine. Moreover I decided to collect the download errors in a separate database table to gain control over them. The script was running for several days (or weeks?), it was a bit annoying though, since I had to restart the script every two hours because the token to access the API was only valid for so long. Also I encountered that some politicians had changed or deleted their Facebook profiles in the meantime, which forced me to update the sample all along. To be able to trace when I have saved a certain post, I wrote the download time into the database.

Deletion of profiles or single posts was also a problem in the final step, the collection of the visuals that could not be resolved. For data safety and practical reasons, I decided to save the visuals in two ways: First as a blob object in the database and second as a .jpg on my local hard drive. I also decided to collect only visuals which were posted in “photo posts”, video material and visuals were left out due to practical as well as conceptual reasons. In total, 389,741 pictures were downloaded, which take up nearly 30 GB of data. Given this amount of data, I will probably have to rethink the scope of this project and reduce the sample to maybe only one year of posts. I know this project cannot be considered really big data, but for me this is quite an impressive number!

All in all I´m pretty pleased with how the data collection went. I learned a lot on R, the Facebook API, as well as SQL databases. The next task will be to describe and visualize characteristic features of the sample. Of course, I will proudly present some of the insights here soon. Before I close this post which has become incredibly long, I would like to mention and remember the five most annoying things I encountered during data collection. – Since it is good practice to document not only the triumphs but also the failures. So here is the top five of what annoyed me out during data collection:

  1. Bad Encoding. It took me a while, but now I found some working solutions, although they feel kind of wonky.
  2. Politicians changing their Facebook profiles or deleting their profiles, posts and/or visuals.
  3. Caching of the phpMyAdmin interface (due to the caching issues I was not able to log into my account for nearly a day – Of course I didn´t know it was a caching issue then…)
  4. Renewing the Facebook token over and over again… and again…
  5. Excels nasty habit to display and save large integers in scientific format. Of course, the Facebook identifier can be seen as a large integer (it has 15 digits or so). Well, but feeding the Facebook API with 1,2345E+14 and similar does not really work…

Fellow program Freies Wissen

The fellow program Freies Wissen (free knowledge) founded by Stifterverband, Wikimedia and Volkswagenstiftung starts into its second round and I´m on board! The program will support 20 young scholars to make their own research as well as their teaching open and transparent. Further, the young scholars are encouraged to take a leading role in the open science movement and spread the word into their scientific discipline. The program includes expert talks, workshops, webinars, a mentoring program as well as exchange with the other fellows. I´m excited to take part and I´m looking forward to learn more on the ideas behind open science and get to know some useful operative tools.

Also, as part of the program, I will challenge myself to design my current research project as open as possible. The project is concerned with the (visual) communication of politicians on Facebook, also computational methods will be applied (here is the link to the project´s outline on wikiversity, in German). In the project does not broach the issue open science in itself, like some of the fellow´s projects do. Rather, I want to apply and evaluate open science ideas and tools to my workflow and take a look at how open science tools can be integrated into my workflow as a communication researcher. I want to evaluate what works for me and what doesn´t. Ideally, I will create a “best practice” example I can refer to. As part of this I plan to document my progress and my thoughts on open science here on the blog. So stay tuned 🙂

Neue Aufgabe: Nachwuchssprecherin der DGPuK

Seit heute ist es offiziell: Ab Oktober bilde ich gemeinsam mit Manuel Menke (Uni Augsburg) für zwei Jahre das Nachwuchssprecher-Team der DGPuK. Von den 190 Personen, die sich für das Wahlverzeichnis registriert haben, haben 164 Personen gewählt. Das entspricht einer Wahlbeteiligung von 86,3 Prozent. Laut „amtlichem“ Endergebnis habe ich 81,1 % und Manuel hat 79,3 % der Stimmen erhalten (4,3 % Enthaltungen). Wir bedanken uns herzlich bei allen, die uns gewählt haben, bei Hanan Badr und Philipp Henn für die Wahlleitung und natürlich auch ganz besonderes bei unseren Vorgängern im Amt Annekatrin Bock und Christian Strippel für ihre hervorragende Arbeit. Wir freuen uns schon auf diese neue und herausfordernde Aufgabe!

Computational Communication Science – Towards A Strategic Roadmap

We are happy that the VolkswagenStiftung finally accepted our proposal for a one week conference event on computational communication science!  In February 2018, the Department of Journalism and Communication Research at Hanover University of Music, Drama, and Media cordially organizes an event that brings together young scholars as well as experts from the field. Our aims are twofold: First, we want to qualify young scholars so that they can adopt computational method in their research as well as their teaching. Thus, various training courses on computational research methods will be organized. Second, a workshop event aims to explore and elucidate the challenges that hinder communication scientists to apply the new methods in their work. Together, we will craft a strategic roadmap that shapes the future of computational communication science. More Information can be found on our new website.

Goodbye Stuttgart, Hallo Hannover!

Unfassbar, was man innerhalb von fünf Tagen alles schaffen kann: Vorstellungsgespräch, Disputation und Umzug. Meine Zeit in Hohenheim geht nun zu Ende und ich blicke zurück auf knapp sechs ebenso abwechslungsreiche wie spannende „Lehr- und Wanderjahre“. Vermissen werde ich vor allem die vielen Kollegen und die gemeinsamen Mittagsrunden, den fachlichen Austausch und das Socializing. Tatsächlich habe ich an meinem letzten Tag in Hohenheim auch gleichzeitig meinen akademischen Abschluss dort gemacht. Ich freue mich sehr über eine gelungene Disputation und einen noch gelungeneren Sektempfang, bei dem nicht nur die Kollegen, Doktorväter, Freunde und natürlich meine Familie dabei waren, sondern auch einige Teilnehmer der Methodentagung, die gleichzeitig stattfand.

Seit Dienstag ist nun auch klar wohin es beruflich geht: Sozusagen zurück nach Hause, ans IJK in Hannover. Ich bin sehr gespannt darauf, was sich dort verändert hat und was nicht und ich freue mich auf die neuen Herausforderungen, die mich dort erwarten.

Methodenprojekt-Tagung 2015

Gestern war der letzte Tag der Vorlesungszeit und zum Abschluss fand auch zum vierten Mal die Abschlusstagung der Methodenprojekte statt. Auf der Tagung präsentieren die Teilnehmer der vier Projektkurse ihre ersten selbst erhobenen Forschungsergebnisse. Das Programm war auch in diesem Jahr sehr vielfältig und äußerst interessant. Es begeistert mich jedes Jahr wieder, was für spannende Projekte unsere Hohenheimer Bachelorstudenten bereits im zweiten Semester auf die Beine stellen. Wirklich beeindruckend.

Auch in den vier von mir betreuten Projekten habe ich wieder viel gelernt. Ich habe meinen Datenerhebungskurs in diesem Jahr unter das Motto „FSK 18! Wir forschen zum Jugendmedienschutz“ gestellt und dazu ein Experiment, eine Inhaltsanalyse, eine Sekundärdatenanalyse und eine standardisierte Befragung durchgeführt.

Im Ersten Panel präsentierte die Experiment-Gruppe ihre Studie zur „Forbidden Fruit Hypothese“ (z. B. Buschman 2006). In der Hypothese wird davon ausgegangen, dass Alterskennzeichnungen die „verbotenen“ Medieninhalte für Jugendliche, die noch zu jung sind um sie rezipieren zu dürfen, attraktiver machen. Die Studie im Methodenprojekt wurde an 18- bis 29-Jährigen Studierenden untersucht, ob dieser Effekt im jungen Erwachsenenalter noch nachwirkt. Tatsächlich ist er bei den 18- bis 21-Jährigen Studierenden noch vorhangen, bei den 22- bis 29-Jährigen jedoch nicht mehr. Neben dem Alter sind Geschlecht und Genrepräferenzen weitere Einflussfaktoren.

Die Inhaltsanalysegruppe widmete sich der Konsistenz der FSK-Freigabebegründungen, die man auf der Webseite der Freiwilligen Selbstkontrolle der Filmwirtschaft einsehen kann. Sie konnten zeigen, dass mit steigender Altersfreigabe ausführlicher auf gewalthaltige und sexuelle Inhalte eingegangen wird. Zudem wird in den Freigabebegründungen Filmen der FSK-Freigabe „ab 12“ deutlich länger auf Gewalt als auf sexuelle Inhalte eingegangen, in Filmen der FSK-Freigabe „ab 16“ werden beide Inhalte im Durchschnitt gleich lange thematisiert. Offenbar werden junge Jugendliche eher mit gewalthaltigen Inhalten konfrontiert als mit sexuellen. Hier schließt sich natürlich die Frage an, ob dies nur auf die Freigabebegründungen oder auch auf die tatsächlichen Medieninhalte zutrifft, die jedoch im Rahmen des Methodenprojektes nicht untersucht werden konnte.

Im offenen Panel präsentierte eine Gruppe eine Sekundärdatenanalyse zu den Erfolgsfaktoren von Kinofilmen. Erfolg wurde dabei durch den wirtschaftlichen Erfolg in Form des Einspielergebnisses operationalisiert. Auf Basis der aus verschiedenen Online-Quellen kombinierten Daten konnten die Studierenden zeigen, dass Filme mit FSK 0 in Deutschland wirtschaftlich besonders erfolgreich sind, was vor allem auf Animationsfilme zurückzuführen ist. FSK 16/18-Filme sind hingegen am wenigsten erfolgreich. Wieder erwarten sind die steigt bei steigender Altersfreigabe auch die Bekanntheit der im Film mitspielenden Darsteller und im Sommer anlaufende Filme sind erfolgreicher als Filme die im Winter starten.

Die Befragungsgruppe untersuchte, an welchen Merkmalen sich Eltern orientieren, wenn Sie Medieninhalte für ihre Kinder auswählen. Befragt wurden 126 Eltern von Grundschulkindern in Stuttgart. Zwar konnte die Gruppe ihre Hypothesen nicht bestätigen – offenbar wählen Eltern Videospiele und Filme anhand der gleichen Kriterien aus und auch zwischen Müttern und Vätern gibt es keinen Unterschied – jedoch zeigte die Studie, dass die FSK-Freigaben für Eltern eine wichtige Entscheidungsgrundlage bilden. – Dieses Teilergebnis unterstreicht die Relevanz des gesamten Projektsemiars natürlich noch einmal sehr schön.

Hier noch ein Gruppenbild mit allen Teilnehmern:

Im Anschluss an die diskussionsfreudige Tagung wurde gegrillt und der Start in die Semesterferien gefeiert. Die haben wir uns jetzt auch alle verdient 🙂

Vortrag auf der ICA in Puerto Rico

Auch in diesem Jahr bin ich wieder mit einem Vortrag auf der ICA in San-Juan, Puerto-Rico vertreten. In dem Beitrag widme ich mich, gemeinsam mit meinen beiden Co-Autoren Michael Schenk und Anja Briehl, den Fragen danach, ob Blogger sich selbst die Eigenschaften von Meinungsführern zuschreiben, welche Journalistischen Rollen sie adaptieren und inwiefern die Selbstzuschreibung als Meinungsführer die Adaption journalistischer rollen begünstigt. Tatsächlich verstehen sich die 403 von uns untersuchten Themenblogger als Meinungsführer und sie möchten in ihrer Rolle als Blogger ähnlich wie Journalisten „Ratgeben und Trends setzen“ sowie „zur Meinungsbildung beitragen“. Tatsächlich besteht ein Zusammenhang zwischen der Selbstzuschreibung als Meinungsführer (gemessen mit Noelle-Neumanns Persönlichkeitsstärke-Skala) und der Zustimmung zu den beiden Rollen.

Korrelierte Korrelationen

Vor einiger Zeit habe ich mich für ein Kapitel im DFG-Projekt “Die Diffusion der Medieninnovation Web 2.0″ damit beschäftigt, verschiedene von einander abhängige Korrelationen zu vergleichen. Ziel war es dabei, mit einem Signifikanztest feststellen zu können, ob der Zusammenhang zwischen der Nutzung von Sozialen Netzwerkplattformen und dem Sozialen Kapital einer Person stärker ist, als zwischen der Nutzung anderer Web-Anwendungen und dem Sozialen Kapital. Da der Korrelationskoeffizient r nicht als Maßzahl im Sinne einer Intervallskala interpretiert werden darf, müssen laut Bortz die Fischer-Z-Werte berechnet werden. – Dann kann man dann zwar Abstände vergleichen, aber weiß ja immernoch nicht ob diese signifikant unterschiedlich sind.

Ich fand zunächst dieses Tool das einen Signifikanztest bietet, jedoch meinem Problem nicht entspricht, weil es hier voneinander völlig unabhängige Korrelationskoeffizienten verglichen werden sollen. Meine Korrelationen sind jedoch nicht unabhängig – eine der Variablen kann als abhängig und die anderen beeinflussend interpretiert werden. Nach einiger Suche stieß ich auf den Artikel von Meng, Rosenstein und Rubin (1992), die eine Lösung vorschlagen und dabei nicht nur auf den Vergleich zweier sondern sogar mehrerer Korrelationen eingehen. Da lohnte sich das genaue Lesen des Artikels und weil das Ausrechnen per Hand auf Dauer etwas mühsam wäre habe ich mir als Hilfestellung ebenfalls eine Excel-Tabelle erstellt, die ich hier zur Verfügung stelle. Viel Spaß damit 🙂