# DataScience

## Contents

## Introduction

### General Objectives

This section deals with techniques for extracting meaningful information from a complex system, and specifically from air traffic management, by analysing sets of existing data. Complex systems are characterized by a large number of elements, interacting in a non-linear fashion; this results in a high importance of interactions, as they can lead to the emergence of behaviours at a macroscale that cannot be predicted by means of the analysis of each single element at a microscale. The ATM system dynamics are characterized by several features that are common to complex systems; among them, the presence of a large number of heterogeneous components, with different relationships happening on multiple temporal and spatial scales, and with the presence of adaptation and self-organization. Many of those characteristics could be modelled looking at the different datasets collected through the operation of the system. Tens of thousands of daily operations encapsulate important information on the actual system behaviour, the relationship between different phenomena and the cause-effect relationships that escape from pure theoretical analysis. Additionally, extracting knowledge from data becomes especially challenging due to the fact that some data is semi-structured or unstructured, some represent the same information in different periods without an obvious characteristic scale, some types of information are challenging to analyse (e.g. images, text, etc.), some datasets are imbalanced and require special classification techniques, some data is noisy and other datasets with uncertainty require ways to aggregate larger datasets into smaller ones (Zighed et al., 2009). It is expected that the knowledge discovery process would combine different techniques to unveil some of the complex system characteristics. From the pre-processing of the data to the visualization and interpretation of the results, different integrated techniques need to be developed in order to derive the knowledge from complex datasets available in ATM. Unveiling such relationships from raw data is the scope of complex data analysis. This includes three steps:

- Extracting the interaction and the relationship between two elements of the system, by means of analysing historical data (local analysis). In particular, this task frequently presents additional difficulty since the interactions themselves are not explicitly described by a set of data;
- Constructing system-wide representations, including all the microscale relations in a global picture; extracting meaningful knowledge from it (global analysis);
- Air transport evolves through different time scales. The analysis of the information needs to take into account those scales to understand if evolution is occurring due to short term fluctuations or long term trends. Separating the evolution of different elements throughout the different time scales might need specialized techniques.

Without being exhaustive in the presentation of the research lines, three research lines are presented as a way to structure potential research challenges on this topic: Local data analysis, global data analysis and evolutionary properties.

### Definitions

**Complex Systems** were defined in Section 0.2.1.

**Complex Data Analysis**: By Complex Data Analysis, we refer to any data-mining activity executed over data sets corresponding to a complex system, with the aim of extracting information about the structures of relationships between the elements of the system.

**Local Analysis**: With the term local analysis, we refer to the study of all those phenomena and characteristics that can be defined by analysing only some elements of the system.

**Global Analysis**: Global analysis refers to the study of those phenomena and properties of a system that can be understood only by analysing the system as a whole; therefore, macroscale is the typical scale in which emergent behaviours of a complex system may appear and may be studied.

**Relationship Between Elements**: A relationship between two (or more) elements is present when their dynamics are not independent, i.e. they share some common characteristics. Two main types of relationships can be defined:

- Correlation between systems A and B, when a given dynamic of system A corresponds to a similar (correlation) or different (generalized correlation) dynamic of system B.
- Causality, when the dynamics of a system (A) are forced by the dynamics of a second system (B), and therefore the former can be explained using knowledge about the latter.

While correlation is simple to detect in real data, causality cannot, in principle, be assessed. Yet correlation between two systems usually hints to the presence of a causality where a third system is affecting both of them.

### Scope

This document reviews some techniques that can be used to mine and analyse the relationships existing between the constituting elements of a complex system, by starting from the analysis of historical data sets. However, only structured alphanumeric digital data are within scope. For instance, techniques to analyse maps, audio files, video files, unstructured text (e.g. incident records) are not considered; even if they are very interesting, and can unveil relevant knowledge about the system, their analysis requires specific techniques that are out of the scope of the ComplexWorld position paper.

Techniques and specific software developments to pre-process data (e.g. data parsers) are needed to carry on research in this area, but documenting those are out of scope of this document.

Identification of the different data sources, understanding of specific potential relationships among these sources and identification of potential realities that might be analysed through them are within scope, although only some of them are documented inside the case studies section. These are intended as examples of the types of analyses that can be performed on complex systems, but the ComplexWorld position paper is not meant to be comprehensive in this context.

## Research Lines

### Local Data Analysis

#### Problem Statement

When dealing with complex systems, an important part of the information about them is codified (or embedded) within the relationships between the different elements of the system. Two levels of relationships can be defined:

- Physical connection between the elements. These connections are created by the normal operation of the elements, and are pre-defined (or hard-wired) in the system. For instance, an aircraft operating a flight between two airports has two physical connections with each one of them.
- Functional connections. These connections are indirectly created by the dynamics of the system, but were not explicitly built into the system. Following the previous example, when two airports are involved in a flight, their free capacities are linked - if the departure airport is saturated, and the aircraft cannot take off, the destination airport gains a free slot.

While physical connections are usually simple to identify, in the sense that information about them is usually available, functional connections are more interesting. Indeed, emergent behaviours appear in this last layer, and advanced data mining techniques are needed to define and control them. Functional relationships have some inherent characteristics that make their assessment difficult. First of all, they may be (and usually are) not linear: the dynamics of an element may have unexpected effects on a second element. Second, they are noisy, both for dynamical noise (i.e., the dynamics of an element is not the same under the same conditions) and for observational noise. Finally, extreme events may appear, for which little information is available. Finally, it should be noted that two main types of relationships may emerge:

- Correlations, i.e., situations in which the evolution of the dynamics of two elements share some common features. For instance, the capacity of two airports, in close proximity, may evolve coherently when some adverse meteorological phenomena appears nearby;
- Causality, i.e., when the dynamics of one element is driving the evolution of a second element. As an example, the delay at take-off at one airport may cause capacity problems in a second airport.

#### Literature Review

##### Correlations

**Linear dependence**
The most simple relationships that can appear between two features are linear dependences, i.e., situations in which one feature can be described as a linear function of a second one. The Pearson product-moment correlation coefficient (typically denoted by r of \sigma) is the most well-known measure for linear dependence between two variables X and Y. It is defined as:
\sigma = \frac\{ cov(X, Y) \}\{ S _X ^2 S _Y ^2 \}
where S _X ^2 and S _Y ^2 are the standard deviations of variables X and Y. \sigma ranges from ?1 to 1, where a value of 1 implies that a linear equation describes the relationship between X and Y perfectly, with all data points lying on a line for which Y increases as X increases; a value of 0 implies that there is no linear correlation between the variables.
While this metric is widely used to detect relationships between elements of a data set, it has to be noted that it only captures linear relationships, which are not the main characteristics of complex systems. In what follows, other techniques are reviewed, designed to detect more general forms of dependences.

**Synchronisation likelihood**
The concept of synchronization likelihood (Stam and van Dijk, 2002) has been specifically developed for the analysis of chaotic systems, where linear relationships are not relevant, and more complicated forms of dependence should be detected. Given two dynamical systems, A and B, the synchronization likelihood tries to answer to the following question: when the system A follows a given dynamic, does system B always answer with the same dynamic? Note that there is an important difference with a simple correlation: the dynamics of system A and B may be completely different and uncorrelated (due to the non-linearity between them); the important point is that the same reaction in B corresponds to a given action in A.
The process for calculating the synchronization likelihood is described as follows. Suppose that the data set under analysis is composed of two variables, A and B evolving with time, thus generating two time series a and b. Firstly, series a is analysed in search of relations between the values of different time windows. In other words, the existence of time windows with similar dynamics is assessed, usually by means of a linear correlation. Afterwards, the same analysis is realized for b. As a last step, results for both time series are compared, and the existence of a dependence between them is assessed.

**Mutual Information and Conditional Mutual Information**
Mutual Information is an information metric that measures the information that two systems, X and Y, share, i.e. it measures how much knowing one of these variables reduces uncertainty about the other (Guiasu, 1977). It is calculated by assessing the entropy (usually Shannon entropy) of the conditional probability distribution of the two systems. Mathematically, this is expressed by:
I(X; Y) = H(X) + H(Y) - H(X,Y)
where H() represents any type of entropy.

A step forward is represented by the Conditional Mutual Information, which expresses the mutual information of two random variables conditioned on a third: I(X; Y|Z) = H(X, Z) + H(Y, Z) - H(X, Y, Z) - H(Z)

##### Causality

Classical statistical instruments, like, for instance, correlation analysis, are only able to assess the presence of some common (equivalent) dynamics between two or more systems. However, we know that correlation does not imply causality. Even if it is theoretically impossible to assess the presence of causality in a data set with full precision, several tools try to solve this problem. All of them define causality from the same basic assumptions: • Causes must precede their effects in time; • The dynamics of the causes should be included inside the effects' future, and should describe their evolution. Below, two main techniques for detecting causality are described: Granger Causality and the Permutation Entropy Causality.

**Granger Causality**
The Granger Causality (Granger, 1969; Hoover, 2001) is an extremely powerful tool for assessing the information exchange between different elements of the system, and understanding whether the dynamics of one is led by the others. Yet, it should be noticed that its implementation is extremely complicated, and its assessment has a high computational cost. The method described below, i.e., the Permutation Entropy Causality, partly solves both of these drawbacks, and is thus preferable for the analysis of large data sets.

The method for calculating the Granger Causality is here described. Let us suppose that we are analysing two time series, X and Y, and that we are specifically trying to assess the presence of a causality going from the latter to the former: in other words, is Y is driving the dynamics of X? The first step involves the use of a standard forecasting technique to the time series X, i.e., we try to forecast one of the values of the time series by analysing all the values available before that moment in time. In the second step the same process is performed, but including also data from the other time series Y. Finally, both forecasts are compared: if the one obtained by using information about Y is better than the other, a relation between both elements does exist.

There are two features of this method that should be underlined. First, Granger Causality is not symmetric: the analysis of Y given X may return different values than the analysis of X given Y. While this property is usually observed in correlation analyses, generally speaking two systems are not expected to drive each other mutually; this situation cannot be detected through Granger Causality. Second, the other techniques described in this document try to assess the presence of relationships in the simultaneous dynamics of different elements. Conversely, Granger Causality tries to identify the influence that one element may have on another, even if there is an unknown delay in the reaction.

**Permutation Entropy Causality**
Permutation Entropy is an information metric that quantifies the complexity of structures present in a time series, thus characterizing with precision the characteristics of the underlying system (Bandt and Pompe, 2002; Zanin et al., 2012). Specifically, it is based on the calculation of the frequency of appearance of permutation patterns, that is, permutations defined by the order of relations among values of a time series. Among the different applications developed in the last decade, some of them have focused on the problem of detecting causality relationships between the evolution of different elements of a system.

A test based on permutation patterns was proposed by Matilla-García and Ruiz Marín (Matilla-García and Marín, 2008). Instead of creating normal permutation patterns for a 1-D time series, they consider a two-dimensional time series, e.g. W = (X, Y). To each subset of W, a bi-dimensional permutation pattern is assigned, and the frequency of appearance of all 2-D patterns is assessed; if the probability distribution does not asymptotically follow a $\chi^2$ distribution, both time series X and Y are not independent. This, combined with an analysis based on Conditional Mutual Information, allows an assessment of the existence of a causal relation between X and Y (Bahraminasab et al., 2008).

##### Symbolic Data Analysis

In the classical approach to data mining, a system is usually composed of a set of features organized in a matrix, where each individual takes one single value for each one of the different variables describing the system. When calculating correlations and causalities as before, it is assumed that the system fulfils this characteristic. Yet, in many practical situations, this representation is not enough to represent the complexity of the system under study, its variability, and the degree of uncertainty associated with each measurement; or the system is simply too big, and its features cannot be directly stored in the memory of a computer, nor directly analyzed. In these cases, it is useful to manipulate more complex features, like, for instance, intervals, distributions, or even several values linked by a taxonomy and logical rules: collections of such complex features are called symbolic data. The family of techniques extending the traditional data analysis to handle these structured data is called Symbolic Data Analysis (Brito, 2007; Diday, 2002). Symbolic Data Analysis enables more sophisticated studies with respect to traditional Data Analysis: for instance, it is possible to study the relationship between the probability distributions of delays of several airports and classify them in groups, without the need of analyzing each single flight. Relationships may include the detection of clusters (Gowda et al., 1991; El-Sonbaty et al., 1998), regressions (Neto et al., 2008), or Principal Component Analysis (Lauro et al., 2000), among other.

#### Research Challenges

**Assessment of correlations and causalities.**
While the problem of detecting correlations and causalities has been extensively studied in recent decades, this has usually been done for pairs of time series, each one composed by a set of values. Yet, in the context of the air transport system, it is usual to work with multivariate time series.
The most interesting example is trajectories, i.e., sets of two (or three) elements evolving with time; furthermore, it should be noticed that these components do not evolve independently, but their dynamics are strictly linked by the physics of the flight.
Developing techniques to assess the level of correlation (or causality) between different trajectories is therefore a relevant research challenge, which may have important applications, for instance, in safety analysis - see Section 3.3.1.

### Global Data Analysis

#### Problem Statement

As described in Section 3.2.1, there are several interesting techniques to extract the relationships between elements of a complex system, which are hidden in historical data. Yet, this should be seen as the first step only. Emergent behaviours in complex systems usually arise from complete patterns of relations between elements. In other words, it is not enough to assess the presence of a single relationship, but the whole structure should be studied. Furthermore, there is a need to synthesize the properties of this structure in simple numbers (i.e., metrics, or features) that can be easily understood. Up to now we have considered that all elements composing the system are homogeneous, or represent similar objects. In an aeronautical context, they can be aircraft, airports, passengers, and so forth. Most of the real systems, however, are composed of heterogeneous elements interacting between them. While the former situation can be seen as a set of elements interacting on the top of a single layer, the latter requires an extension to multiple layers, each one of them interacting with the others. An example of this will be shown in Section 3.3.1.

#### Literature Review

Various scales (micro, meso and macro) were introduced in Section 0.2.2. We now explore some corresponding metrics.

##### Microscale Metrics

**Number of nodes and links (N and M)** - They are defined by the cardinalities of the set of nodes and links composing the network.
**Link density (d)** - The link density is defined as the proportion of links present in the network, with respect to all possible links. This is given by d= M/N(N-1).

**Entropy of the degree distribution (Hdd)** - As seen in Section 3.2.2.1, it is usually useful to assess whether links are uniformly distributed across nodes or, on the contrary, few nodes act as hubs of the system, centralizing most of the connections. Information about the presence of highly connected nodes can be obtained through the degree distribution associated to the network, that is, a function P(k) expressing the fraction of nodes with a degree of k. Given a probability distribution, the entropy measures the expected quantity of information codified in it, or the uncertainty that we have about one missing value of the variable under study (Shannon, 1948). In this context, this measure quantifies how links are distributed across nodes, and has been related to the robustness of networks, i.e. their resilience to attacks (Wang et al., 2006).

**Maximum degree (Kmax)** - An important feature that can be extracted from the degree distribution P(k) is the maximum degree of nodes, i.e., the degree of the most connected node inside the network.

##### Mesoscale Metrics

**Clustering coefficient (C)** - The clustering coefficient, also known as transitivity, measures the presence of triangles in the network (Newman, 2001). A high clustering coefficient has been historically associated with social networks, where it means that "the friends of my friends are also my friends". Mathematically, it is defined as the relationship between the number of triangles in the network, i.e., a set of three nodes with one link between each pair.
Motifs - In real world networks, it has been found that some sub-graphs, i.e., specific structures created by a small number of the original nodes, have a higher than expected probability of appearance. When such probability is statistically relevant, these sub-graphs are called motifs (Milo et al., 2002; Milo et al., 2004). In other words, motifs can be seen as Lego blocks, with which the complex network is constructed. The presence (or absence) of some motifs is tightly related with the function or process performed by the network. For instance, triangular motifs (i.e., representing a loop in the flow of information) are used to create short time memories in neural and protein interaction networks (Milo et al., 2002).

##### Macroscale Metrics

Average geodesic distance (l) - Considering a network as a transportation system, the number of links included in the path connecting two nodes is called the length of such a path. Of special interest is the shortest path, also called the geodesic, between two nodes, and defined as one of the paths connecting these two nodes with minimum length (notice that more than one geodesic path may exist). The average geodesic distance is defined as the mean of all possible geodesic distances. Efficiency (E) - A problem with the definition of the average geodesic distance is that it diverges when the network is not connected, i.e., there are nodes which are not reachable. Although this can be solved by only taking into account the group of strongly connected nodes, this also introduces distortions in the metric, and situations with two independent and separated communities of nodes are not correctly handled. To overcome this situation, the measure of efficiency was proposed, defined as the mean of the inverse of the distances (Latora et al., 2001). The efficiency assesses the performance of the network in sending information. Small-worldness (S) – this was described in Section 0.2.1. More formally, it is defined as (with both C and l normalized with respect to random networks) (Humphries et al., 2008). When S is greater than one, the network is considered to be ‘small-world’.

#### Research Challenges

**Adaptation of metrics by means of a macroscale data analysis.**
As we have seen in Section 3.2.2.2, plenty of metrics have been developed in recent years to characterize the properties of complex networks, from a macro-scale, meso-scale and micro-scale point of view. Yet, most of this work has been performed through a top-to-bottom approach: that is, most metrics have been developed from a theoretical perspective, without considering the specificity of real-world applications.
For instance, one can assess the connectivity of a network by means of its efficiency (see Section 3.2.2.2 for its definition). While this measure is meaningful for a communication system, its applicability to the air transport system is reduced. The number of links (i.e., flights) needed to go from one airport to another one is not enough to characterize the mobility of passengers, as many other elements (duration of flights, availability of connections, delays, different airlines and alliances) should be taken into account.
Therefore, it is necessary to adapt the existing metrics to each problem studied by means of a macroscale data analysis, in order to improve the meaningfulness of obtained results.

### Evolutionary Properties

#### Problem Statement

The characterization of the topology and of the dynamics developing on top of a network, at both macro and micro-scales, is critical for the understanding of any complex system, with examples spanning from biological to social systems (Boccaletti et al., 2006). In the specific case of air transport, different network structures, ultimately given by the connections built by the airlines and other airspaces users, lead to a number of important properties, both inherent to the network and with implications in the way mobility as a service is provided to the final users. However, those macro and micro structures are far from being static. Different factors caused the network to evolve, varying significantly both the inherent properties as well as the mobility provided in Europe, e.g. seasonal variability in the short term, up to legislation changes in time windows of decades. In other words, the air transport system is not stationary, in that it is a system in a persistent out-of-equilibrium condition, whose characteristics are continuously varying with time. Due to this, the primary analysis of the topology and dynamics of the system should be complemented with information about the evolution of the same. This would provide additional information about the rate and type of the new connections of the air transport network, and therefore it would represent a better technique to produce a forecast. There are several aspects that should be taken into account: • The kind of relational data to be analysed, i.e., it is necessary to identify the type of network representation most relevant for the forecasting problem under study; • The identification of the scales (both temporal and spatial) characteristic of the evolutionary process; • Derived from the previous point, identification of the time windows in which the system can be approximated as stationary; • The choice of the best metrics to represent such evolution. Additionally, when forecasting the evolution of the network, different scenarios should be considered, requiring different data layers to be fed into the models towards determining the growth of the network. A given topological element can favour the evolution of the network towards certain structures: understanding how to extract information through complex data mining techniques supporting network analysis is the subject of this research line. It must be noted that such analysis is not restricted to the forecasting of traffic in the network. Beyond traffic growth, the analysis of the structure of the network could also provide insights into other metrics that are very relevant to the air transportation network, especially those that are non-trivially dependent on the dynamics of the system. For instance, a definition of mobility is associated with the concept of connectivity, but a coherent definition based on data is needed in this context; or forecasting fuel consumption, which is linked not only to traffic growth, but also to the number of connections that passengers need to make to arrive at their destination. Regarding this last example, the capability to describe the evolution of different network structures based on data could provide insights on the potentials for the industry to reduce fuel consumption while maintaining mobility, or, in general, the sensitivity of mobility to fuel consumption through an analysis of the connections evolution.

#### Literature Review

One of the most important problems tackled in the literature associated with complex networks is their evolution. Specifically, an important effort has been devoted to the understanding of the mechanisms and forces driving such evolution, both from a theoretical (see, for instance, Barabási et al., 1999) and from an applied point of view (Gautreau et al., 2009). The underlying idea is that different "natural" networks evolve, maximizing their chances of survival and adapting their configuration to the environment. Human-engineered networks have not necessarily grown through evolutionary adaptation, however, research indicates that the design principles could be similar (Shin et al., 2009). The evolution of the different properties for the air transport network, in particular link density (design cost) and efficiency (network congestion) have not been deeply researched on long temporal scales. While there are different algorithms to grow network structures to prevent congestion with minimum design costs, there are no evolutionary optimization models that provide guidance on the best structures for the transport network in its evolution. The air transport network presents a structure in which the centrality of hubs is not stable. Aircraft technology, state regulations and evolution of market demand are found to be the main variables that affect the centrality of hubs (Bowen, 2002). While globalisation and liberalisation fuelled the expansion of the network and its densification, it is not clear at this point what the regime is of the evolution of the network structure. Over decades, new business models, infrastructures, shifts in market demand, aircraft technology, international regulations (Bowen, 2002) and externalities, like fuel prices or airport subsidies, have caused the network to evolve, varying significantly both the inherent properties as well as the mobility provided in Europe. Additionally, historical "accidents" arising from geographical, political and economic factors have played an important role in the network evolution (Guimera et al., 2008). Connectivity and interconnectivity metrics are also evolving through time. Some metrics provide insights on the connectivity, like the Shimbel index, but important drawbacks lead to the need for new "effective affordability" indexes (Bowen, 2002).

#### Research Challenges

**Optimizing the discretization of evolving data.**
Although most of the researchers use OAG as the main source of data, it is not clear what is the right time step to be chosen to compute the different indicators and its evolution. For instance, several authors pick few years (e.g. less than five) and group the data in two seasons to derive the appropriate metrics of the network. It is not clear if evolutions are to be considered within those time scales or analysing the evolution of the network requires longer time series.
Several factors are considered in the literature as important influences in the network evolution, like new business models, infrastructures, shifts in market demand, aircraft technology, international regulations and externalities like fuel price, but there are not coherent datasets compiled to be considered as potential inputs for network evolution research.
The influence of these factors in network structure evolution has been researched through correlations, but there is no coherent framework for this either.

**Connectivity and interconnectivity.**
Modern network theory provides an interesting foundation for the study of transportation systems in general. However, it should be taken into account that this theory was initially developed from an abstract perspective, i.e., without the aim of modelling real-world problems. Therefore, most indicators, metrics or indexes currently available in the literature ignore fundamental aspects, such as, for instance, the connections schedule. Clearly, such elements are an essential part of transport networks, and therefore additional research needs to be done in order to evolve network theory indicators to incorporate such aspects. Additionally, while connectivity has been extensively studied (even if only from a theoretical point of view), interconnectivity, understood as paths connecting two nodes through a third, has not received enough attention, especially taking into account schedules. Both connectivity and interconnectivity evolution could provide powerful insights in to foresighting techniques.

## Case Studies

### Operational Safety from a Complex Data Analysis Perspective

**Introduction**
Safety is one of the most important requirements for ATM. In spite of the importance of safety for ATM, there are two main challenges that should be tackled in the future.
Firstly, there is an important lack of safety metrics. While most reports usually count the number of unsafe events, and how they have evolved from the last analysis, it is clear that this is not enough to really understand the safety characteristics of the system. This is because incidents and accidents are extremely rare events, and a good metric should take into consideration safe situations that may evolve into unsafe conditions.
Second, plenty of information about the safety of the system should be encoded in historical operational data. For instance, patterns that may develop into unsafe situations; records of decisions taken by pilots and ATCOs; interactions between procedures and unexpected meteorological conditions; and so forth.
In