# Working with the project report

As described in the getting started section, you can create a report for your project, which is both

  • an export of the raw data
  • some already aggregated stats across your data set which allow you to quickly get some first insights

This document describes what is included in the report and gives you some ideas how to work with the exported data.

# Contents of the report

When you download a project report, you get a single ZIP file which you then have to extract on your computer. After extracting the ZIP archive, you'll see a new folder with the same name as the report file. Within that folder, you'll find the following structure (details on each item follow below):

  • sessions.csv: The main file containing the raw analysis results (second-by-second) for all the exported sessions.
  • stats_combined.csv: A file containing already computed statistics for each session individually.
  • stats_aggregated.csv: A file containing already aggregated statistics over all your sessions.
  • report.log: A log file that would contain any error messages if something went wrong during the export. This file usually should be empty, but it makes sense to quickly check it to be sure that everything went smoothly during the report generation.

So depending on what you are interested in and how sophisticated of an analysis you want to do yourself, different files of the report directory might be relevant for you. If you are a number cruncher, you might want to directly access the raw data in the sessions.csv file. If you however want to get a quick overview, the already computed statistics in the stats_aggregated.csv file is more suitable.

Let's see what each of the files contains.

# The sessions.csv file

The data within the sessions.csv file gives you all the flexibility you might need to dive deep into the analysis yourself. It contains all the raw analysis results of the sessions, second by second. It looks like this:

,time,sessionId,sessionName,participantId,processingSuccessful,usable,phase,valence,arousal,sympathy-happiness,focus-anger,surprise,indifference-sadness,bpm
0,0,N1O04LM8EkTcZ20j1NdC,participant_001.mp4,9876ab,True,True,baseline,-0.131,0.065,0.001,0.005,0.016,0.309,67.853
1,1,N1O04LM8EkTcZ20j1NdC,participant_001.mp4,9876ab,True,True,baseline,-0.101,0.044,0.001,0.002,0.009,0.335,68.050
2,2,N1O04LM8EkTcZ20j1NdC,participant_001.mp4,9876ab,True,True,baseline,-0.128,0.076,0.000,0.005,0.001,0.178,67.521
3,3,N1O04LM8EkTcZ20j1NdC,participant_001.mp4,9876ab,True,True,baseline,-0.250,0.080,0.002,0.019,0.330,0.059,67.669
...
1217,0,A98b73KLt5xU0WRn45sD,participant_021.mp4,3456ed,True,True,baseline,-0.074,0.047,0.002,0.003,0.032,0.473,58.708
1218,1,A98b73KLt5xU0WRn45sD,participant_021.mp4,3456ed,True,True,baseline,-0.158,0.124,0.000,0.000,0.573,0.033,58.853
1219,2,A98b73KLt5xU0WRn45sD,participant_021.mp4,3456ed,True,True,baseline,-0.261,0.125,0.000,0.001,0.094,0.119,58.250
1220,3,A98b73KLt5xU0WRn45sD,participant_021.mp4,3456ed,True,True,baseline,-0.249,0.130,0.000,0.000,0.662,0.006,58.305
...

with the following columns:

  • first unnamed column: The index of the rows, which is simply a consecutive sequence starting from 0.
  • time: The current second of the recording.
  • sessionId: The internal unique ID of the session.
  • sessionName: The session name as defined on the platform (can be changed in the session view).
  • participantId: The ID of the recorded participant. This is only set if the session was recorded through TAWNY's recording tool.
  • processingSuccessful: True if the recording could be analyzed without any technical errors or False otherwise.
  • usable: True if the session was marked as usable on the platform, False otherwise.
  • phase: The current phase name, if a phase was annotated at that point in time.
  • valence: The value for valence, from -1.0 to +1.0.
  • arousal: The value for arousal, from -1.0 to +1.0.
  • sympathy-happiness: The value for sympathy/happiness, from 0.0 to 1.0.
  • focus-anger: The value for focus/anger, from 0.0 to 1.0.
  • surprise: The value for surprise, from 0.0 to 1.0.
  • indifference-sadness: The value for indifference/sadness, from 0.0 to 1.0.
  • bpm: The estimated heart rate of the analyzed person in beats per minute.

So this data can be used if you want to calculate any statistics and aggregations on your own.

# The stats_combined.csv file

Oftentimes, you do not need the detailed data from the sessions.csv but might want to look at an already aggregated form. In this case, the stats_combined.csv file might suit your needs. It looks like this:

,sessionId,phase,measure,stat,value
0,N1O04LM8EkTcZ20j1NdC,__full_session,valence,mean,-0.282
1,N1O04LM8EkTcZ20j1NdC,baseline,valence,mean,-0.271
2,N1O04LM8EkTcZ20j1NdC,tv_spot_1,valence,mean,0.237
3,N1O04LM8EkTcZ20j1NdC,tv_spot_2,valence,mean,-0.302
...
63,A98b73KLt5xU0WRn45sD,__full_session,valence,mean,-0.152
64,A98b73KLt5xU0WRn45sD,baseline,valence,mean,-0.137
65,A98b73KLt5xU0WRn45sD,tv_spot_1,valence,mean,-0.089
66,A98b73KLt5xU0WRn45sD,tv_spot_2,valence,mean,-0.122
...

with the following columns:

  • first unnamed column: The index of the rows, which is simply a consecutive sequence starting from 0.
  • sessionId: The internal unique ID of the session.
  • phase: The phase name; TAWNY automatically inserts an additional "phase" called __full_session which as the name suggests spans the whole session. So if you do not have annotated any phases, there at least will be the __full_session entry.
  • measure: The name of the emotion measure; one of valence, arousal, sympathy-happiness, focus-anger, surprise, indifference-sadness.
  • stat: The name of the aggregation measure; one of mean, std, median, quantile-10, quantile-90, opm-mean, opm-min, opm-max.
  • value: The corresponding value.

You probably wonder what these rows actually mean. So imagine a session, i.e., a recording of one of your participants. You have annotated two phases - tv_spot_1 and tv_spot_2 - because you want to compare two advertisements. So while you could look into the second-by-second values of the sessions.csv file, the stats_combined.csv file gives you already pre-computed aggregations for your phases. Because what you are actually interested in might for example be the mean valence during tv_spot_1 compared to the mean valence during tv_spot_2. You can find this information very easily in the stats_combined.csv file. Just look for the rows where measure is set to valence and stat is set to mean.

# The different types of stats

The stat column in the exported csv file contains the name of the used aggregation measure. For example, if you look at the above excerpt from stats_combined.csv, the 4th line contains the following information:

2,N1O04LM8EkTcZ20j1NdC,participant_001.mp4,tv_spot_1,valence,mean,0.237

It tells you that if you take

  • all the valence values
  • of the phase tv_spot_1
  • of the session with the ID N1O04LM8EkTcZ20j1NdC
  • and calculate the mean of these values,
  • you get 0.237.

Instead of the mean, one can also use other measures to aggregate the values. The stats_combined.csv contains the following measures (so you do not have to calculate them yourself):

  • mean: The arithmetic mean of the values. It usually is the first measure everyone looks at before digging deeper into the data.
  • std: The standard deviation of the set of values. This measure allows you to get a feeling for the distribution of the values (i.e., how different they are from each other).
  • median: The median is the "middle" value of the relevant values, i.e., 50% of the values lie below and 50% of the values lie above this value. Compared to the mean, the median is less sensitive to few but very large (or very small values), i.e., it is less sensitive to these outliers.
  • quantile-10: Similar to the median, the value of the 10th quantile separates the values into two subsets, but instead of creating a 50:50 split, it denotes the value that creates a 10:90 split. This means that in this case, 10% of all the values lie below this value, and 90% lie above it.
  • quantile-90: The 90th quantile is the opposite to the 10th quantile, as it denotes the value that creates a 90:10 split. This means that in this case, 90% of all the values lie below this value, and 10% lie above it.
  • opm-mean, opm-min, opm-max: We use OPM as the abbreviation for what we call occurences per minute. For measures which can be (more or less) considered a probability estimate, one can define a threshold above which one would consider the measured dimension to be actually present. This idea applies to the basic emotion categories (i.e., sympathy/happiness, surprise, focus/anger and indifference/sadness) but not to the dimensional emotion models (i.e., valence and arousal). Occurences of values higher than the defined threshold are counted and the total count is then normalized to the minute. For example: To calculate the occurences per minute for the surprise category, one would go through all the surprise values of the considered time span, e.g., a certain phase. Whenever a value is above the threshold - which is 0.8 in this case - we count this as one occurence. If we then divide the total number of occurences by the number of minutes, we get the occurences per minute value. Please note that instead of looking at all values at the same time, the TAWNY Platform calculates the OPM value for overlapping 10-second windows and then extrapolates the number of occurences to one minute. Consequently, the intermediate result of the OPM calculation still results in one OPM value per second. For this list of values we then calculate the mean (opm-mean), min (opm-min) and max (opm-max) value which is shown in the report.

# tags.csv

The tags.csv file lists the tags which have been assigned to each session. Assigning tags is a feature of the TAWNY platform (see the Getting Started section). The tags.csv file looks like this:

,sessionId,tag
0,N1O04LM8EkTcZ20j1NdC,male
1,N1O04LM8EkTcZ20j1NdC,age-group-1

with the following columns:

  • first unnamed column: The index of the rows, which is simply a consecutive sequence starting from 0.
  • sessionId: The internal unique ID of the session.
  • tag: The assigned tag. Please note that there is only one tag per line, so if a session has been assigned multiple tags, the session will have several entries, one per tag.

# The stats_aggregated.csv file

Finally, the stats_aggregated.csv file provides an even higher level view onto the dataset. It looks like this:

,tag,phase,measure,stat,mean,median,std
0,,__full_session,valence,mean,-0.145,-0.147,0.079
1,,baseline,valence,mean,-0.149,-0.119,0.075
2,,tv_spot_1,valence,mean,0.202,0.244,0.174
3,,tv_spot_2,valence,mean,-0.174,-0.197,0.117
...

with the following columns:

  • first unnamed column: The index of the rows, which is simply a consecutive sequence starting from 0.
  • tag: The tag name; Either empty, which means all sessions, or a specific tag, which means only the sessions that have been annotated with this tag.
  • phase: The phase name; TAWNY automatically inserts an additional "phase" called __full_session which as the name suggests spans the whole session. So if you do not have annotated any phases, there at least will be the __full_session entry.
  • measure: The name of the emotion measure; one of valence, arousal, sympathy-happiness, focus-anger, surprise, indifference-sadness.
  • stat: The name of the aggregation measure; one of mean, std, median, quantile-10, quantile-90, opm-mean, opm-min, opm-max.
  • mean: The mean of the corresponding values across all sessions, tagged with the respective tag.
  • median: The median of the corresponding values across all sessions, tagged with the respective tag.
  • std: The standard deviation of the corresponding values across all sessions, tagged with the respective tag.

OK, now we have a mean of many means, where does this come from? As said before, for each session for each phase, all these stats get calculated. So you might have 10 participants watching your two tv spots, tv_spot_1 and tv_spot_2. Now let's look at the valence values. You can calculate the mean valence for participant 1 watching tv_spot_1 as well as the mean valence for participant 1 watching tv_spot_2. So you have one value for tv_spot_1 and one value for tv_spot_2. Now you repeat this for all your participants, which means in the end, you have 10 valence mean values for tv_spot_1 and 10 valence mean values for tv_spot_2. In order to end up with a final insight, you take the mean of the 10 values for tv_spot_1 as well as the mean of the 10 values for tv_spot_2, which gives you two final numbers - one for tv_spot_1 and one for tv_spot_2 - which you can compare and draw your conclusions.

While this may sound complicated, it actually is very easy to do with the stats_aggregated.csv file. You simply filter for the rows which have measure set to valence, stat set to mean and phase set to each of the phases you are interested in. Now you just look at the value in the mean column. This is the number you are interested in.

# How do I get started with this?

The exported files definitely allow to dig deep into your dataset, which often might be far deeper than actually necessary for your project. Most of the time, a good start is to have a look at the numbers in the stats_aggregated.csv file. You can import this file into many applications like Microsoft Excel, which then allows you to easily filter the rows according to the column values. We are in the process of creating more detailed articles on how to apply the emotion analytics techniques for different use cases. In the meantime, if you need any help, don't hesitate to register for a free onboarding tour with our emotion analytics experts: Get your guided onboarding here (opens new window).