# Working with the project report

As described in the getting started section, you can create a report for your project, which is both

  • an export of the raw and aggregated data
  • some best guesses on what you might be interested in, in the form of already included plots

This document describes what is included in the report and gives you some ideas how to work with the exported data.

# Contents of the report

When you download a project report, you get a single ZIP file which you then have to extract on your computer. After extracting the ZIP archive, you'll see a new folder with the same name as the report file. Within that folder, you'll find the following structure (details on each item follow below):

  • plots: A folder with already plotted analyses.
  • sessions: A folder with one subfolder for each session you selected to be exported.
    • Within each session folder, a session.csv file containing the raw analyses results for the session.
    • Within each session folder, a stats.csv file containing already computed statistics for the session, based on the raw session data.
    • Optionally, a tags.csv file within the sessions' folders that have been assigned one or more tags.
  • stats_combined.csv: A convenience file containing all the data from the individual sessions' stats.csv files in a single file.
  • stats_aggregated.csv: A file containing already aggregated statistics over all your sessions.
  • report.log: A log file that would contain any error messages if something went wrong during the export. This file usually should be empty, but it makes sense to quickly check it to be sure that everything went smoothly during the report generation.

So depending on what you are interested in and how sophisticated of an analysis you want to do yourself, different parts of the report directory might be relevant for you. If you are a number cruncher, you might want to directly access the raw data in each session's session.csv file. If you however want to get a quick overview, the already created plots and the stats_aggregated.csv file is more suitable.

Let's see what each of the files/folders contains.

# The plots folder

In the plots folder, you find some pre-generated plots of things you might be interested in. How these plots are generated is determined by the phases and tags you have annotated in your project.

# Emotion box plots

One type of plot you will find is the so-called boxplot, which looks like this:

Automatically generated boxplot for different emotion categories

Automatically generated boxplot for different emotion categories

These plots either show the four emotion categories or the valence/arousal values. Since the plot shows the aggregation over the whole dataset (or a subset of it, depending on the file name), the plot elements visualize the distribution of the values in the dataset.

If you're not familiar with box plots, here's a quick summary:

  • The box marks the first quartile (the lower edge of the box), the median (the line within the box) and the third quartile (the upper edge of the box).
  • The end of the whiskers (the lines going downwards and upwards from the boxes) mark the minimum and maximum of the dataset, excluding any outliers.
  • Any points below or above the whisker ends mark outliers.

If that sounds too complicated, you might just want to look at the lines within the boxes (i.e., the median). This gives you an idea of what the "typical" value in that dimension is across the dataset.

Why are the boxes interesting? Because they give you an idea of how varying the values in the dataset are. A large box means that your participants had quite different reactions towards the stimulus, while a small box usually means that the reactions across your participants were very similar.

In the plots folder, you can find many of these box plots (depending on your phase and tag annotations) which all show different combinations of certain measures on certain parts of the dataset. The respective combination is encoded in the filename. If you read through the next sections, it should become quite clear, what the different file names signify.

# Valence/Arousal heatmaps

The second type of plots you can find in the plots folder are the so-called valence/arousal distribution plots. These plots are a kind of heatmap, showing the distribution of values in the two-dimensional valence/arousal space.

Automatically generated valence/arousal heatmap

Automatically generated valence/arousal heatmap

In this case, each plot always shows the distribution of the whole dataset or a specific subset of the dataset, e.g., the distribution of a certain phase.

# The sessions folder

While the pre-generated plots are meant for a quick look into the dataset, the data within the sessions folder gives you all the flexibility you might need to dive deep into the analysis yourself.

In the sessions folder, there is one subfolder for each session you had selected for the report. Each of these folders contains two or three files: session.csv, stats.csv and optionally tags.csv.

# session.csv

The session.csv is the rawest form of the export. It contains all the raw analysis results of the session, per second. It looks like this:

,phase,valence,arousal,sympathy-happiness,focus-anger,surprise,disappointment-sadness
0,Baseline,-0.249,0.119,0.000,0.465,0.001,0.229
1,Baseline,-0.200,0.092,0.000,0.415,0.001,0.217
2,Baseline,-0.247,0.051,0.000,0.428,0.001,0.231
3,Baseline,-0.269,0.054,0.000,0.558,0.001,0.223
...

with the following columns:

  • first unnamed column: The index of the rows, which is similar to the number of seconds since the start of the session.
  • phase: The current phase name, if a phase was annotated at that point in time.
  • valence: The value for valence, from -1.0 to +1.0.
  • arousal: The value for arousal, from -1.0 to +1.0.
  • sympathy-happiness: The value for sympathy/happiness, from 0.0 to 1.0.
  • focus-anger: The value for focus/anger, from 0.0 to 1.0.
  • surprise: The value for surprise, from 0.0 to 1.0.
  • disappointment-sadness: The value for disappointment/sadness, from 0.0 to 1.0.

So this data can be used if you want to calculate any statistics and aggregations on your own.

# stats.csv

Oftentimes, you do not need the detailed data from the session.csv but might want to look at an already aggregated form. In this case, the stats.csv file might suit your needs. It basically looks like this:

,phase,measure,stat,value
0,__full_session,valence,mean,-0.282
1,baseline,valence,mean,-0.271
2,tv_spot_1,valence,mean,0.237
3,tv_spot_2,valence,mean,-0.302
...

with the following columns:

  • first unnamed column: The index of the rows; not really relevant here.
  • phase: The phase name; TAWNY automatically inserts an additional "phase" called __full_session which as the name suggests spans the whole session. So if you do not have annotated any phases, there at least will be the __full_session entry.
  • measure: The name of the emotion measure; one of valence, arousal, sympathy-happiness, focus-anger, surprise, disappointment-sadness.
  • stat: The name of the aggregation measure; one of mean, std, median, quantile-10, quantile-90, opm-mean, opm-min, opm-max.
  • value: The corresponding value.

You probably wonder what these rows actually mean. So imagine a session, i.e., a recording of one of your participants. You have annotated two phases - tv_spot_1 and tv_spot_2 - because you want to compare two advertisements. So while you could look into the second-by-second values of the session.csv file, the stats.csv file gives you already pre-computed aggregations for your phases. Because what you are actually interested in might for example be the mean valence during tv_spot_1 compared to the mean valence during tv_spot_2. You can find this information very easily in the stats.csv file. Just look for the rows where measure is set to valence and stat is set to mean.

# The different types of stats

The stat column in the exported csv file contains the name of the used aggregation measure. For example, if you look at the above excerpt from stats.csv, the 4th line contains the following information:

2,tv_spot_1,valence,mean,0.237

It tells you that if you take all the valence values of the phase tv_spot_1 and calculate the mean of these values, you get 0.237. Instead of the mean, one can also use other measures to aggregate the values. The stats.csv contains the following measures (so you do not have to calculate them yourself):

  • mean: The arithmetic mean of the values. It usually is the first measure everyone looks at before digging deeper into the data.
  • std: The standard deviation of the set of values. This measure allows you to get a feeling for the distribution of the values (i.e., how different they are from each other).
  • median: The median is the "middle" value of the relevant values, i.e., 50% of the values lie below and 50% of the values lie above this value. Compared to the mean, the median is less sensitive to few but very large (or very small values), i.e., it is less sensitive to these outliers.
  • quantile-10: Similar to the median, the value of the 10th quantile separates the values into two subsets, but instead of creating a 50:50 split, it denotes the value that creates a 10:90 split. This means that in this case, 10% of all the values lie below this value, and 90% lie above it.
  • quantile-90: The 90th quantile is the opposite to the 10th quantile, as it denotes the value that creates a 90:10 split. This means that in this case, 90% of all the values lie below this value, and 10% lie above it.
  • opm-mean, opm-min, opm-max: We use OPM as the abbreviation for what we call occurences per minute. For measures which can be (more or less) considered a probability estimate, one can define a threshold above which one would consider the measured dimension to be actually present. This idea applies to the basic emotion categories (i.e., sympathy / happiness, surprise, focus / anger and indifference / sadness) but not to the dimensional emotion models (i.e., valence and arousal). Occurences of values higher than the defined threshold are counted and the total count is then normalized to the minute. For example: To calculate the occurences per minute for the surprise category, one would go through all the surprise values of the considered time span, e.g., a certain phase. Whenever a value is above the threshold - which is 0.8 in this case - we count this as one occurence. If we then divide the total number of occurences by the number of minutes, we get the occurences per minute value. Please note that instead of looking at all values at the same time, the TAWNY Platform calculates the OPM value for overlapping 10-second windows and then extrapolates the number of occurences to one minute. Consequently, the intermediate result of the OPM calculation still results in one OPM value per second. For this list of values we then calculate the mean (opm-mean), min (opm-min) and max (opm-max) value which is shown in the report.

# tags.csv

The tags.csv file is only available when the respective session actually has been given one or more tags. If this is the case, the file simply contains one tag of the assigned tags per line:

male
age-group-1

# The stats_combined.csv file

It is important to note that the previously described stats.csv files (within each session folder) obviously always only contain the data of a single session. Most of the time, you probably want to look at the data across all your sessions. For your convenience, the report directory contains the stats_combined.csv file which is a concatenation of all the individual stats.csv files. So, there's no new information here, but it might be easier to process instead of having to process all stats.csv individually.

# The stats_aggregated.csv file

Finally, the stats_aggregated.csv file provides an even higher level view onto the dataset. It looks like this:

,tag,phase,measure,stat,mean,median,std
0,,__full_session,valence,mean,-0.145,-0.147,0.079
1,,baseline,valence,mean,-0.149,-0.119,0.075
2,,tv_spot_1,valence,mean,0.202,0.244,0.174
3,,tv_spot_2,valence,mean,-0.174,-0.197,0.117
...

with the following columns:

  • first unnamed column: The index of the rows; not really relevant here.
  • tag: The tag name; Either empty, which means all sessions, or a specific tag, which means only the sessions that have been annotated with this tag.
  • phase: The phase name; TAWNY automatically inserts an additional "phase" called __full_session which as the name suggests spans the whole session. So if you do not have annotated any phases, there at least will be the __full_session entry.
  • measure: The name of the emotion measure; one of valence, arousal, sympathy-happiness, focus-anger, surprise, disappointment-sadness.
  • stat: The name of the aggregation measure; one of mean, std, median, quantile-10, quantile-90, opm-mean, opm-min, opm-max.
  • mean: The mean of the corresponding values across all sessions, tagged with the respective tag.
  • median: The median of the corresponding values across all sessions, tagged with the respective tag.
  • std: The standard deviation of the corresponding values across all sessions, tagged with the respective tag.

OK, now we have a mean of many means, where does this come from? As said before, for each session for each phase, all these stats get calculated. So you might have 10 participants watching your two tv spots, tv_spot_1 and tv_spot_2. Now let's look at the valence values. You can calculate the mean valence for participant 1 watching tv_spot_1 as well as the mean valence for participant 1 watching tv_spot_2. So you have one value for tv_spot_1 and one value for tv_spot_2. Now you repeat this for all your participants, which means in the end, you have 10 valence mean values for tv_spot_1 and 10 valence mean values for tv_spot_2. In order to end up with a final insight, you take the mean of the 10 values for tv_spot_1 as well as the mean of the 10 values for tv_spot_2, which gives you two final numbers - one for tv_spot_1 and one for tv_spot_2 - which you can compare and draw your conclusions.

While this may sound complicated, it actually is very easy to do with the stats_aggregated.csv file. You simply filter for the rows which have measure set to valence, stat set to mean and phase set to each of the phases you are interested in. Now you just look at the value in the mean column. This is the number you are interested in.

# How do I get started with this?

The exported reports definitely allow to dig deep into your dataset, which often might be far deeper than actually necessary for your project. Most of the time, a good start is to look at some of the plots and then have a look at the numbers in the stats_aggregated.csv file. You can import this file into many applications like Microsoft Excel, which then allows you to easily filter the rows according to the column values. We are in the process of creating more detailed articles on how to apply the emotion analytics techniques for different use cases. In the meantime, if you need any help, don't hesitate to register for a free onboarding tour with our emotion analytics experts: Get your guided onboarding here.