# Complex systems and data science for air transportation

Seddik Belkoura/UPM

## Contents

## PhD Candidate and Supervisors

PhD student: Seddik Belkoura, UPM

Supervisor: Jose Maria Perez, UPM

Co-supervisor: Massimiliano Zanin, Innaxis

## PhD title

Complex Systems and Data Science: Towards a new Perspective for the Understanding of Air Transport

## Introduction

Many business functions have entered in a new era of decision making thanks to the growth in the abilities to store, access and analyze data in the last two decades. Many companies are focusing now in managing information streaming from their different business activities and developing new information analysis techniques. The idea of extracting some knowledge from a set of data is not recent, but has existed, in its manual form, since the beginning of civilization. For instance, one may think in the analysis of the astronomical observations performed by Johannes Kepler in the 16th century. Yet, as said before, the increasing storage, accessibility and manipulation abilities (mainly due to the drastic growth of computing technology) has exponentially accelerated the need of data analysis across a still expanding number of fields.

Air Transportation (AT) is surely one of those fields, in that the quantity of data collected is tremendous, even if their quality and usefulness may be debatable. Indeed, the future Air Transportation Management (ATM) system will consist of a large number of (both human and automated) agents at different levels of the ATM process, which collaborate in a sophisticated and resilient manner in order to achieve an optimal performance with minimal chance for hazardous events. This clearly implies a growing number of generated data sets, containing more or less knowledge about the system. To study the intrinsic characteristics and the emergent behaviors associated to this network, this PhD proposes to exploit simultaneously the complex system theory and what can be generically called Data Science. The latter comprises techniques from, and sometimes overlaps with, different scientific disciplines, like computer science (data compression, computer programming...), statistics (multivariate testing, cross-validation, stochastic processes, sampling, ...), machine learning, data mining, operational research and business intelligence. The result of this study may be used, among others, to increase awareness of weaknesses and bottlenecks in the organization (resilience of the system, optimization of the control upon the delay propagation), and to develop more effective training methods for human operators as well as more effective automated systems to forecast safety-related events. Throughout this Thesis, several concepts and techniques drawn from data mining will be used. Yet, as mentioned before, data mining is a particular step, involving the analysis of data, of Data Science, the latter being a set of fundamental principles that support and guide the extraction of information and knowledge from data. Actually, a more accurate name for the principles used would be, in a more general concept, Knowledge Discovery in Databases (KDD), referred nowadays as Big Data Analytics or just Data Analytics. The name Knowledge Discovery in Databases (KDD) was introduced in the first KDD workshop in 1989, and since then it has become popular in the artificial intelligence and machine learning fields. KDD has been defined as the non-trivial process of identifying valid, novel, potentially useful and ultimately understandable patterns in data. In other words, KDD is a process that transforms a set of raw data into other representations that might be more compact, more abstract, or more useful. This non- trivial transformation of the data implies both statistic knowledge and the use of databases visualization or machine learning and, contrary to the popular image, does not merely consist on the data mining process: it refers to the overall analysis, from the data preparation to the final decision making. The aspects of scalability and user-driven design had made evolve this concept towards Data Mining and Big Data Analytics.

In the ATM context, improving decision making through Data Science or enhance the understanding of the system through Big Data Analytics has a great potential for applicability, since the massive number of inter-connected parameters in the system enlarges the complexity of the network, and therefore, make it harder to have robust quantitative risk assessment or complete functional understanding.

## Problem Description and Motivation

In the last decades, many real-world systems have been recognized as complex, i.e. being composed of multiple elements mutually interacting in a non-linear way (Anderson, 1972). Far from being a mere theoretical classification, complex systems are characterized by a set of features that make difficult their management, thus creating the need for specific tools and techniques for their analysis. One paradigmatic example is the appearance of Self-Organized Criticality (SOC), that is, the fact that the system may naturally end in a condition in which small perturbations may trigger larger events, as for instance cascading failures ^{[1]}

If one is to characterize the complex nature of a system, e.g. the presence of SOC, two different approaches are possible. The first one involves the construction and analysis of synthetic models, trying to capture the fundamental aspects of the system, but avoiding unnecessary details. Probably the most famous example of such approach is the Ising Model ^{[2]}, initially developed to explain the existence of ferromagnetic materials, but since then applied to explain systems ranging from social science ^{[3]} to financial markets ^{[4]}, and neural dynamics ^{[5]}. Yet, there are situations in which the creation of simple models is not possible, both because the system under analysis is too complex, or because of a lack of understanding of its mechanics. In these cases, the alternative is the study of data extracted from the system, e.g. representing its dynamical evolution, with the aim of detecting marks associated with specific characteristics. Let us consider the example of the study of financial markets: long before the appearance of models describing investors’ behavior ^{[6]}, landmarks of SOC were detected through the study of price returns, specifically of their power-law probability distribution ^{[7]}.

Air Transport and Air Traffic Management have been claimed to be complex systems, in that their dynamics is determined by the interactions of a plethora of elements, e.g. aircraft, controllers, procedures, and organizations (ComplexWorld Wiki). Nevertheless, their study has been mainly focused on macro-scale characteristics, as for instance patterns created by connecting flights ^{[8]}, with less attention devoted to micro-scale dynamics. Specifically, no one has tried to systematically analyze the appearance of operational events, like safety-related situations, from the point of view of statistical physics and complex systems - one of the few examples can be found in ^{[9]}. Two are the main reasons behind this. On the one side, the complexity of the ATM system precludes the possibility of creating simple models; furthermore, on the other side, the analysis of real data is made difficult by they heterogeneity and incompleteness. About this latter point, for instance, it is known that information about aircraft trajectories over the European airspace is incomplete, as flights may be missing, and heterogeneous, with spatial and temporal resolutions varying across Europe ^{[10]}.

Starting from this general framework, this thesis aims at analyzing the macro- and micro-scale characteristics of the air transport system from a complex system approach, using a data-driven methodology. Therefore, two main hypothesis will be tested and used: firstly, that AT and ATM are complex systems, in which the interactions between elements are as important as the characteristics of the elements themselves; secondly, that historical data can be used to extract relevant features of the system. This will be accomplished through the reconstruction of the mechanisms behind the appearance of several emergent behaviors, and the design of strategies for improving its performance. Such a comprehensive approach has never been tried in air transport research, and this in spite of the multiple successful examples that can be found in the Literature: for instance, in finance (Mantegna and Stanley, 1995), neuroscience ^{[11]}, medicine and systems medicine ^{[12]} ^{[13]} or climatology ^{[14]}.

The fulfillment of the objectives of this thesis is expected to provide benefits in several key areas of air transport. Among them, one of the most important is safety, through the analysis the mechanisms behind the appearance of safety-related events, and through the creation of a new paradigm for their understanding, not based on fixed rules and procedures, but on the dynamical analysis of data generated by the system. Furthermore, benefits are also expected in the understanding of the resilience of the system, i.e. the system's capacity for blocking shock propagations; through the study of the appearance of non-stationary, non-ergodic and non-Poisson statistical processes, it should be possible to design optimized strategies for reducing delay propagations and improve resilience, as has been the case for epidemic spreading ^{[15]}.

## Aims, Objectives and Expected Outcomes

As previously described, this thesis aims at analyzing the macro- and micro-scale characteristics of the air transport system from a complex system perspective, using a data-driven methodology. Following the customary way of analyzing complex systems, such aim will be tackled at three different scales: dynamics, topology, and control and forecasting. Thus, the analysis will first start by studying the dynamical behavior of each one of the element of the system; next, the dynamics of multiple elements will be put in relation, thus analyzing the emerging structure of interactions; finally, this knowledge will be used to forecast the future behavior of the different elements and to design strategies for controlling the system, i.e. to improve its performance.

*Analysis of the complex dynamics of the system*

As a first step, it is necessary to study the dynamics of each single element of the system, as this will provide clues of the presence of complex behaviors. As previously introduced, there are several landmarks that are known to appear when studying time series, which are related with the presence of emergent behaviors: inverse power-laws probability distributions ^{[16]}, non- stationary and non-equilibrium dynamics ^{[17]}, and non-Poissonian statistical processes ^{[18]}. It is worth noticing the importance of such characteristics: for instance, it is difficult to forecast a non-stationary time series, and the detection of causality relations between pairs of them is an ill-posed problem.

*Analysis of the complex topology of the system*

Once the individual elements have been characterized, it is possible to go one step ahead by analyzing how they interact with each other. The reader should notice that this is essential for understanding the global behavior of the system on the one side, and the behavior of the single elements on the other. For instance, the propagation of delays across different airports is strongly determined by the patterns created by flights and passengers; also, safety events are seldom the consequence of a single cause, with multiple elements usually interacting towards a negative outcome ^{[19]}. As it is not simple to identify those connections and interactions, as they are not explicit defined in the system but are instead emergent behaviors, this thesis will rely on data analysis for their assessment.

*Predictive analytics: control and forecasting*

After understanding the elements composing a system and their mutual interactions, the researcher usually tries to improve the dynamics of the system as a whole, both by forecasting and controlling it. In this thesis, this problem will be addressed by means of predictive analytics tools.
Predictive analytics employs both a microscopic and telescopic view of data, allowing us to see and analyze the details of a system, and aimed at peering at its future ^{[20]}. From a technical perspective, predictive analytics includes a set of techniques designed to determine the probable outcome of a future event, or the likelihood of a specific situation, by automatically analyzing large amounts of data with different variables. For example, a predictive analytics tool for supporting air traffic management could consider predictors like traffic congestion and weather, to determine the possible occurrence of a safety-adverse event. Among the relevant techniques, we may mention clustering, decision trees, market basket analysis, regression modeling, neural nets, genetic algorithms, text mining, hypothesis testing, decision analytics, and more.

## Research Methodology

The research work that is going to be performed in this thesis will leverage on three main pillars: data gathering, data analysis, and validation.

First of all, different sources should be filtered and merged. In spite of the common belief, this is not a straightforward task in the air transport field. The interested researcher can have access to a large variety of data sourced, which in turn leads to a diversity of formats, update rate dispersion, and geographical and temporal diversities. Expecting the availability of a single, clean, structured data source against which proper data science activities could be run is not realistic, and this diversity should duly be treated. Also, efficient ways of storing the high quantity of information involved should be designed, in order to optimize any subsequent analysis; and data mining algorithms should be adapted to handle the volume of information, possibly with the help of distributed and on-the-cloud computation paradigms.

Once data have been properly formatted, its analysis will be performed by means of the techniques provided by the field of complexity science. Among others,Complex Networks have been successfully used in the detection of relevant structures inside ATM data ^{[21]}; yet, the study of time-evolving ^{[22]} or multi-layer ^{[23]} graphs has been still neglected. Beyond that, several other techniques, drawn from different fields, will be tested. For instance, the concept of causality has been largely studied in economics, and several techniques have been developed to detect causality between different elements of a system ^{[24]}; and the concept of synchronization, deeply studied in statistical physics, has been recently applied to ATM ^{[25]}.
Finally, it is of utmost importance to correctly validate any result obtained. It is recognized that any validation process should include analyses of both accuracy, reliability and usefulness. While the first one is usually straightforward, as just requires the analysis of the correctness of the obtained forecasts, the latter two will be tackled through different means. Specifically, the reliability will be ensured by the use of multiple and well-known public data sets, or of data sets already used in peer-reviewed publications. Similarly, the usefulness of the obtained results will be assessed by experts in the pertinent areas.

## On-going Projects

### Project 1: Dynamic Mode Decomposition

[60% finished]

On-going writing of a publication about the use of a mathematical decomposition to study Traffic patterns.
Abstract: In the continuous effort for ensuring increasing comprehension of air transportation, it is of utmost importance to understand where is located the information in this complex system. In the literature, network analysis have been extensively used in others fields to study the dynamical characteristics of a dynamic complex structure. Zanin ^{[26]} has examined the air transportation system as a complex network with a large number of nodes interacting in a non-linear way. In this PhD, the use of a specific mathematical tool is proposed for the study of such time-resolved dynamical complex network. As a concrete example, dynamic mode decomposition (DMD) is a data-decomposition technique that allows the extraction of relevant flow structures information from experimental data. It was extensively used in the field of fluid dynamic ^{[27]} for the study of turbulent flows of laminar water jets. The method locate the useful information to build the best linear map from one snapshot of the network to the other. The representation of a non-linear process by a linear sample-to-sample model may allow to de-trend the complex structure of all the periodic and linear relations within the network and allow a more thorough study of the remaining information. Hence, this two-parts project aims, in a first time to locate where the 'linear' information that describes the best the complex system is located, to use in a second time the non-linear and stationary part to manage a better forecasting power.

### Project 2: Delay Metric

[70% finished]

In parallel to the previous effort, a closer look have been taken into delays metrics. First intuition have been that delays where not normally distributed and that, this could jeopardize the relevance of the average mean, heavily used when tackling delay dynamics. Here is a brief description of what have been already done: in the continuous effort for ensuring increasing levels of efficiency, it is common to track the evolution through time of some performance indicators (PI). Those indicators are intrinsically linked to the studied feature’s distribution and it is of the utmost importance to chose the right PIs before monitoring the evolution of a system. One important feature in air transport and ATM is delay and a number of studies have used the average delay value to analyze the system’s dynamics. However, none has endeavored to investigate the entire (en-route) delay distribution to assess the relevance of the used indicators. In this first part of the project, various distributions have been suggested, aimed at demonstrating the incomplete knowledge about the delay system provided by the average value. Especially, results indicate that first order metrics such as average or standard deviation yield to erroneous or, at best, non- controlled conclusions about the evolution of the delay system. More generally, the found t-location-scale distribution suggests a sampling construction of the delays that naturally explains the heavy tail without resorting to extreme events occurrences.

This part has already been presented in The Fourth SESAR Innovation Days (November 24-27, 2014) at the Universidad Politecnica de Madrid, where we presented both a Poster about Delay metrics and a talk during the PhD session of ComplexWorld.

This project has evolved: Following the line drawn by previous work, we make a first step towards the understanding of en-route delays by presenting a computational methodology to assess events generating en-route delays based on the comparison between planned and realized trajectories. The basic idea is to synchronize planned and realized trajectories of an aircraft to have both its theoretical (planned) and real position at the same time of observations. Then, an absolute measure is considered looking at the distance to arrival for both trajectories making a direct comparison possible. We then consider a measure in that a growth or decrease of its value correspond to a positive or negative delay with an a priori rule that can be used to discern between both situations. We obtain an algorithm that counts the number of occurrences, i.e. of delay-generating events, in which the delay generated is superior to a given positive threshold (for a loss of time) or inferior to a negative threshold (for a recovery of delay equal to absolute value of the negative threshold). At this point, it is easy to re-construct the probability distribution of the amplitude of a delay caused by a single event for any flight, so we started thinking about useful applications. With adequate data, applications are numerous. Comes to mind the research of causal relations between specific ATM/ATC procedures and the generation of en-route delays; the development of a new route pricing for airlines taking into account the costs engendered by the delays probably generated within this route. Due to the limitations caused by the availability of data, we had to find another type of application. Specifically, the basic idea is to assess whether the system is able to compensate on average for the magnitude of a positive event, i.e. a delay-generating event, by a negative one of the same magnitude. In other words, we use the above described micro-scale delay analysis to define the resilience of the system in its en-route part, i.e. independently from delays managed or generated on the ground. The ATM system might be qualified as resilient if, for each event generating a positive delay, the system is theoretically able to create an event recovering that perturbation. The resiliency here does not concern each agent, or in other words, each flight, but the whole system. It does not mean that each flight is able to compensate for the generation of a positive delay, but that considering all flights, there has been a recovery of delay of the order of magnitude of the positive delay generated. This analysis is then performed across two dimensions. A first temporal analysis is performed assessing the resilience of system day by day. A second analysis tackles the problem spatially assessing the resilience of the diverse region air space considering the whole time range.

We are now closing a paper (that will be uploaded once published) and thinking about the futur steps of the project.

### Project 2: Network Topology

[70% finished]

A new idea enabled us to make a first step towards the understanding of the topology of air traffic network. The aim of such study is to warn the community about the risks of obtaining biased results while studying the air transport system with complex networks and suggesting a specific sampling process minimizing those errors. Indeed, it is increasingly common to study the air transport system through complex network theory, by defining static or dynamic structures to characterise how airports are connected. However, the construction of these structures is generally simplified by discarding a fraction of the network, e.g. minor airports or airlines, in order to reduce the computation cost or simply because of the limited availability of data. One open question is whether and how such sampling processes affect the topological stability of the representation, by biasing the observed structures and processes. To respond this problem, in this work we study how some observed topological metrics depend on the sampling strategy, i.e. on the nodes and connections used for the network representation. In this contribution, we study the behaviour of several static properties and some simple dynamical ones, for eight different air networks, including the European and USA air transport systems, using different source datasets. Results indicate that the most commonly used sampling process in the literature, i.e. the selection of a subset of the biggest or the most connected airports from the full system, is not a good strategy. Due to the sensitivity of the network topology to secondary airports, almost all airports should be included to maintain the representation error below 10%, suggesting that a sampling process based on only including highly connected airports should be avoided whenever possible. At the same time, considering a subset of airlines yields a better and more stable representation; this suggests the existence of a coherent and optimal sampling process, which effectively retains the structure of the whole system and stabilises the metrics, while using a small subset of the original nodes. We propose a candidate for such optimal sampling: an aposteriori process based on the minimisation of the error associated with the entropy of the degree distribution of the nodes of the network. Indeed, it appears that the degrees of the airports that can be discarded without perturbing significantly the value of some metrics (here, we used together the clustering coefficient and the efficiency) are very heterogeneous. This degree variation (from 1 to 70 connections) corroborates the first results: discarding only the little airports is a bad strategy. It seems, on the contrary, that it is important to keep the distribution of large and small airport as uniform as possible, thus the strategy of minimizing the Entropy of the degree distribution of the nodes of the network. Although we need to know the final value of the Entropy, this method can be a good proxy to use for deciding a sampling strategy.

This have been presented as poster in the SIDs 2015, and we are also closing a paper on this topic.

## References

- ↑ (Bak et al., Self-organized criticality: An explanation of the 1/f noise Phys. Rev. Lett. 59, 381).
- ↑ Brush, History of the Lenz-Ising Model, Rev. Mod. Phys. 39, 883
- ↑ Schelling, Thomas C. 1971. "Dynamic Models of Segregation." Journal of Mathematical Sociology 1:143-186.
- ↑ Zhou, D. Sornette, 2007. A case study of speculative financial bubbles in the South African stock market 2003-2006. Arxiv preprint physics/0701171
- ↑ Hopfield, J. J. (1982) Neural networks and physical systems with emergent collective computational properties. Proc. Nat. Acad. Sci. (USA) 79, 2554-2558.
- ↑ Sornette, Didier. "Critical market crashes." Physics Reports 378.1 (2003): 1-98.
- ↑ Mantegna, Rosario N., and H. Eugene Stanley. "Scaling behaviour in the dynamics of an economic index." Nature 376.6535 (1995): 46-49.
- ↑ Zanin, Massimiliano, and Fabrizio Lillo. "Modelling the air transport with complex networks: A short review." The European Physical Journal Special Topics 215.1 (2013): 5-21.)
- ↑ Cardillo, Alessio, Zanin et al. "Emergence of network features from multiplexity." Scientific reports 3 (2013).
- ↑ Zanin, Massimiliano, Francisco Del Pozo, and Stefano Boccaletti. "Computation emerges from adaptive synchronization of networking neurons." PloS one 6.11 (2011): e26467.
- ↑ Bullmore, Ed, and Olaf Sporns. "Complex brain networks: graph theoretical analysis of structural and functional systems." Nature Reviews Neuroscience 10.3 (2009): 186-198.
- ↑ Goldberger, Ary L. "Non-linear dynamics for clinicians: chaos theory, fractals, and complexity at the bedside." The Lancet 347.9011 (1996): 1312-1314.
- ↑ Liu et al., 2006
- ↑ Donges, Jonathan F., et al. "The backbone of the climate network." EPL (Europhysics Letters) 87.4 (2009): 48007.
- ↑ Pastor-Satorras, Romualdo, and Alessandro Vespignani. "Epidemic spreading in scale-free networks." Physical review letters 86.14 (2001): 3200.
- ↑ Bak, Per, Maya Paczuski, and Martin Shubik. "Price variations in a stock market with many agents." Physica A: Statistical Mechanics and its Applications 246.3 (1997): 430-453.
- ↑ Zubarev et al., 1996
- ↑ Snyder and Miller, 1991
- ↑ Netjasov, Fedja, and Milan Janic. "A review of research on risk and safety modelling in civil aviation." Journal of Air Transport Management 14.4 (2008): 213-220.
- ↑ Siegel, 2013
- ↑ Zanin, Massimiliano, and Fabrizio Lillo. "Modelling the air transport with complex networks: A short review." The European Physical Journal Special Topics 215.1 (2013): 5-21.
- ↑ Holme, Petter, and Jari Saramäki. "Temporal networks." Physics reports 519.3 (2012): 97-125.
- ↑ Boccaletti et al., 2014
- ↑ Granger, 1969
- ↑ Zanin, Massimiliano. "Synchronization Likelihood in aircraft trajectories." Proceedings of the Tenth USA/Europe Air Traffic Management Research and Development Seminar, Chicago, USA. 2013.
- ↑ Massimiliano Zanin (2014). Network analysis reveals patterns behind air safety events. Physica A: Statistical Mechanics and its Applications; 401:201–206. DOI:10.1016/j.physa.2014.01.032
- ↑ Peter J.Scmid (2011). Application of the dynamic mode decomposition to experimental data. Journal of Fluid Mechanics / Volume 656 / August 2010, pp 5-28

## Publications and Presentations

- File:Poster_SID14.pdf Poster for The Fourth SESAR Innovation Days (2014)

File:Poster_SID15 v002.pdf Poster for the Fifth SESAR Innovation Days (2015)

Paper presented at the 11th USA/Europe Air Traffic Management Seminar (Lisbon, 23 – 26 June 2015) in the topic session “Network and Strategic Flow”.