-
Notifications
You must be signed in to change notification settings - Fork 0
/
ba-milan-gruner.tex
736 lines (638 loc) · 43.4 KB
/
ba-milan-gruner.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
\documentclass[
a4paper, % Format A4
titlepage, % mit Titelseite
oneside, % twoside => zweiseitig
parskip % mit Durchschuss (Abstand zwischen Absätzen)
]{scrartcl} % KOMA-Script Grundklasse texdoc scrguide
\usepackage[USenglish]{babel}
\usepackage[utf8]{inputenc}
%\usepackage{ngerman} % Deutsche Sprache, neue RS texdoc germdoc (?)
\usepackage[T1]{fontenc} % Schriftkodierung mit Umlauten
\usepackage{textcomp,amsmath} % Mathezeichen etc.
\usepackage{graphicx} % Graphiken einbinden
\usepackage{hyperref} % Hyperlinks fuer TOC, Glossar, etc.
%\usepackage{beramono} % Font for code listings (?)
\usepackage{listings} % Code snippets
\usepackage{xcolor} % Custom colors in code listings
\usepackage{enumitem} % Custom list formatting
% Listing color definitions
\definecolor{numbercolor}{rgb}{0.5,0.5,0.5}
\definecolor{keywordcolor}{rgb}{0.7,0.4,0.0}
\definecolor{commentcolor}{rgb}{0,0.6,0}
\definecolor{stringcolor}{rgb}{0.7,0.0,0.2}
% Scala code snippet styling
\lstdefinestyle{scalaStyle}{
language=scala,
morekeywords={String, Long, UUID, Map, Nil, Date},
otherkeywords={!,<=,>=,=>,!=},
aboveskip=3mm,
belowskip=3mm,
showstringspaces=false,
columns=flexible,
basicstyle={\small\ttfamily},
numbers=left,
numbersep=1pt,
numberstyle=\small\ttfamily\color{numbercolor},
keywordstyle=\color{keywordcolor},
commentstyle=\color{commentcolor},
stringstyle=\color{stringcolor},
frame=none,
breaklines=true,
breakatwhitespace=true,
tabsize=2
}
% bibtex
\bibliographystyle{plaindin} % BibTeX Styles nach Norm DIN 1505
\include{glossaries}
\titlehead{
\includegraphics{hpi_logo_cmyk_wb_sl2}
}
\subject{Bachelor Thesis}
\title{
Analysis and Simplification of\\Business Graphs
\\ \bigskip
\large{Analyse und Vereinfachung von Unternehmensgraphen}
}
\author{Milan Gruner\\\small{\href{mailto:milangruner@gmail.com}{\nolinkurl{milangruner@gmail.com}}}}
\date{21.07.17}
\publishers{
Information Systems Group\\
Hasso Plattner Insitute\\
~\\
\textbf{Supervisors}\\
Prof. Dr. Felix Naumann\\
Michael Loster\\
Toni Grütze
}
\pagestyle{headings} % Seitenstil mit Kapitelüberschriften in der Kopfzeile
\begin{document}
\maketitle % Titelseite erzeugen
%\clearpage % neue Seite
\section*{Abstract}
{ \large
Constructing a graph made up of millions of businesses may be hard, but actually making sense of it is a lot harder.
With huge amounts of data potentially being integrated into the data lake every day, automatic methods for finding interesting areas in the graph are needed.
This paper discusses different approaches that can be taken to extract useful knowledge from such a graph.
First, the bachelor project "Ingestion" and the technologies used
for implementing the algorithms in this paper will be introduced.
Then, the data structures that are the basis of all data contained in the
data lake will be detailed, followed by a introduction to the used graph processing
framework and a description of the algorithm used to construct a graph from the
previously discussed data structures.
Afterwards, an algorithm to extract company group structures from the graph
is discussed, as well as a method to find structural patterns. These work using
connected component analysis and motif search and are provided as examples for
how to extract specific features from the graph.
Following, an automated process for simplifying the graph by applying PageRank
and thresholding algorithms and methods customizing the extracted companies
and for increasing the degree of interconnectedness are introduced.\\
Finally, the discussed algorithms are evaluated, conclusions are drawn
and an outlook on further possible improvements is provided.
}
\clearpage
\section*{Zusammenfassung}
{ \large
Einen Graph aus Millionen von Unternehmen zu konstruieren mag einige Probleme
bereiten, aber tats\"achlich Erkenntnisse daraus zu extrahieren ist um einiges schwieriger.
W\"ahrend potentiell riesige Datenmengen jeden Tag in die Datenbasis integriert werden,
sind automatisierte Methoden um interessante Bereiche des Graphen vonn\"oten.
Diese Arbeit diskutiert verschiedene Ans\"atze, um hilfreiches Wissen aus einem solchen Graphen zu extrahieren.
Zuerst wird das Bachelorprojekt "Ingestion" und die Technologien zur Implementierung
der Algorithmen in dieser Arbeit vorgestellt. Danach werden die Datenstrukturen, die die Basis der
Datenbasis darstellen, detailliert, gefolgt von einer Einf\"uhrung in das benutzte
Graphprozessierungs-Framework und einem Algorithmus zur Konstruktion einer Graph-Datenstruktur
aus den bereits diskutierten Datenstrukturen.
Im folgenden wird auf einen Algorithmus zur Extraktion von Konzernstrukturen aus dem Graph eingegangen,
sowie auf eine Methode, um strukturelle Muster zu finden. Diese funktionieren
mithilfe von Zusammenhangskomponenten-Analyse und Motif-Suche und sind als
Beispiele zur Extraktion auf Basis von strukturellen Eigenschaften dargelegt.
Nachfolgend wird ein automatisierter Prozess zur Graphvereinfachung durch die
Anwendung von PageRank- und Schwellwertbildungsalgorithmen vorgestellt, sowie
Methoden, um die Einfluss auf die Extraktion der Unternehmen und um die Verbundenheit
des resultierenden Graphen zu erh\"ohen.\\
Abschlie{\ss}end werden die diskutierten Algorithmen evaluiert
und es wird ein Ausblick auf zuk\"unftig m\"ogliche Verbesserungen gegeben.
}
\clearpage
\tableofcontents
\pagebreak
\section{The German Corporate Graph Project}
\subsection[Why a German Corporate Graph?]{Why a German Corporate Graph?\footnote{This and the next subsection were authored by Matthias Radscheit \cite{loeperradscheit}.}}
When it comes to economic decisions, uncertainty is a critical issue. Following the rational choice theory approach, every market player is constantly trying to maximize his utility and minimize his effort.
Uncertainty can be described as a lack of information of how a market - or herein the full German economic system - is constituted and about the future behavior of the market players. Presuming that every market player is acting on a rational basis, all information regarding his situation, resources, plans, and relations makes the results of his decisions more predictable. In this manner, we can state: The more relevant information a market player gathers about other players in the market or economy, the better the foundation of his decisions is. The broad range of that kind of information can lead to a significant competitive advantage. So it should be in a rational player's interest to collect as much relevant information as possible.
In a connected economy, a lot of those uncertainties lie in the relations between corporations\footnote{We define as \emph{corporation} any juristic entity that takes part in the German economy. This includes especially businesses but also other entities like public corporations.}. This became evident in the so called \emph{Abgas-Skandal} or \emph{Dieselgate} of the Volkswagen AG in 2015, wherein a lot of external suppliers spun out of control, from a financial perspective \cite{automobilwoche,stuttzeit}. This happened although most of the suppliers did not take part in the scandal itself. Since there are a lot of other examples like the \emph{Lehmann Brothers bankruptcy} or any other economic shock event, we can state that relations are a significant factor in the economic evaluation of corporations and their financial risks.
Because there are millions of corporations in the German economy\footnote{ The \emph{Federal Bureau of Statistics} notes 3,469,039 businesses in Germany in 2015 \cite{destatis1}. Following our definition of corporations, this number has to be seen as a lower bound for the total number of corporations in Germany.} and each corporation can potentially holding relations to hundreds or thousands of other corporations, collecting and to overseeing all those relations becomes a complicated matter.
The \emph{German Corporate Graph Project} is one approach to solve this problem. The project's purpose is to extract business entities from multiple structured knowledge bases (e.g. Wikidata and DBpedia), merge them, enrich them with relations extracted from unstructured documents and finally display the graph so that it can be visually explored.
The project consists of a pipeline, which starts with the import and normalization of structured knowledge bases. The next step is the Deduplication, which is the detection and fusion of occurrences of the same entity over multiple knowledge bases. These entities form a graph, whose nodes are businesses and whose edges are the relations between them. This graph is then enriched during the Information Extraction. In this step relations between entities are extracted from unstructured documents using Named Entity Recognition, Entity Linking and Relation Extraction.
The results of all these steps can be viewed and curated in the so-called Curation Interface. This is a web-interface, which can be used to control the pipeline itself, view statistical data generated by other pipeline steps and to view and curate the entities and relations of the graph itself. The final graph can be visually explored by using the Corporate Landscape Explorer, which is also a web-interface.
\subsection{One project - seven contributions}
This thesis is published as a part of a bachelor's project in 2016/2017 at Hasso-Plattner-Institute in Potsdam, Germany. The project's objective was to build the \emph {German Corporate Graph}, like described above, for Germany's corporate landscape. The project lasted ten months and was accompanied by Commerzbank AG, Germany. As part of the process, the project participants published several theses.
\medskip
See here a list of all published theses within the project's context:
\begin{itemize}
\item Pabst explores \emph{Efficient Blocking Strategies on Business Data} \cite{pabst}.
\item Löper and Radscheit evaluate duplicate detection in their thesis \emph{Evaluation of Duplicate Detection in the Domain of German Businesses} \cite{loeperradscheit}.
\item Schneider's thesis is entitled \emph{Evaluation of Business Relation Extraction Methods from Text} \cite{schneider}.
\item Janetzki investigates \emph{Feature Extraction for Business Entity Linking in Newspaper Articles} \cite{janetzki}.
\item Ehmüller explores the \emph{Evaluation of Entity Linking Models on Business Data} \cite{ehmueller}.
\item \emph{Analysis and Simplification of Business Graphs} is the title of this paper.
\item Strelow investigates the \emph{Distributed Business Relations in Apache Cassandra} \cite{strelow}.
\end{itemize}
\subsection{The Ingestion and Curation projects}
This paper describes the basic approaches that were taken for analyzing the
graphs generated as a part of the Ingestion bachelor project at HPI Potsdam.
It deals with the algorithms and approaches that were employed to extract
information from automatically generated graphs in the ball park of
tens of millions of nodes in a large and mostly unconnected and sparse graph.
This graph contains businesses, places and persons and can be augmented freely by adding
additional data sources (or manual data input) of your own. The data is visualized and controllable from the
Curation interface, which was a project developed alongside the Ingestion pipeline.
Curation generates graphs on-the-fly that contain the statistics data that was written
as a side effect of the numerous Spark jobs in the Ingestion pipeline and gives
a lot of insight into what's going on in the "data lake".
\subsection{Technologies used in this paper}
All algorithms in this paper were executed on the Apache Spark platform which
controls the parallelization and data flow of the functional steps described
in the code samples. The Scala programming language is used because Spark is
written in it and because it allows for a more terse representation of the
algorithms than the respective Java or Python implementations, shortening
the code listings and making them easier to understand. It is also the language
the rest of the Ingestion project is written in, resulting in a better interoperability
and facilitating code exchange and reuse with it.
Most of the topics in this paper deal with graph-related topics, so the GraphFrames framework
(in Scala) will be used to describe the algorithms that were employed,
but it can be freely seen as functional pseudo code and adapted to other popular
graph processing frameworks (such as GraphX, Giraph etc.).
Also one of the purposes of the Ingestion and Curation projects was to enable
the execution of modern text mining, deduplication and parallel data analysis techniques
to the german business landscape. This means that some of the examples will be
in German (but they will be translated if needed for understanding the technique).
\subsection{Usage Scenarios of Ingestion and Curation}
The process of classic risk analysis which is still majorly employed in today's banks is mostly
a model-oriented process that relies on predictions, which can be augmented by data-driven approaches.
The techniques that Ingestion, Curation and this paper offer are universally
applicable to analyze huge amounts of structured and unstructured data
in the form of WikiData, JSON, CSV or similar data as well as newspaper articles,
email messages or similar data sources.\\
The results of the algorithms contained in these projects can either be viewed as a graph or a relational model
using the conversion algorithms briefly discussed later.
Because this project consists of automatic processes
that control the flow, categorization and extraction
of all the data contained in the data lake, it may be effectively deployed in a bank
or similar institution as a data collection, curation and analysis software stack.\\
It is meant to be heavily extensible and allows the simple integration of new data sources % TODO really possible in the end?
as well as algorithms that allow for further data processing and enrichment.
The pipeline itself is very modular and is composed of many stand-alone
Spark jobs that transform the data in the Cassandra database.
%\pagebreak
\section{Data Structures for Business Entities}
\subsection{The subject data structure}
\label{sec:subject_structure}
The central data structure that exists inside the data lake is called $subject$.
It contains all the final businesses, industry sectors, countries or persons
that are the result of the computations in the Ingestion project.\\
This is what it looks like in Scala:
\begin{lstlisting}[style=scalaStyle,caption=Subject]
case class Subject(
var id: UUID = UUID.randomUUID(),
var master: UUID,
var datasource: String,
var name: Option[String] = None,
var aliases: List[String] = Nil,
var category: Option[String] = None,
var properties: Map[String, List[String]] = Map(),
var relations: Map[UUID, Map[String, String]] = Map(),
var master_history: List[Version] = Nil,
var name_history: List[Version] = Nil,
var aliases_history: List[Version] = Nil,
var category_history: List[Version] = Nil,
var properties_history: Map[String, List[Version]] = Map(),
var relations_history: Map[UUID, Map[String, List[Version]]] = Map()
)
\end{lstlisting}
The first three attributes ($id$, $master$, $datasource$) constitute the
primary key in the Cassandra databases.
The partitioning key that is used to decide on which Cassandra instance to store
a given $subject$ entry is the $master$ ID, which makes sure that a master node
and all its data source slave nodes are stored on the same machine.
$id$ and $datasource$ are used as clustering keys, which control the sorting
inside the data partitions produced by database queries. They are also easily
queried when specific IDs or data source entries for a given master ID are
needed.
When loading Cassandra data into a Spark job, simple fields (e.g. Strings)
need to be wrapped in an Option in case that the field contains no value
(otherwise there would be an error), which isn't necessary for the fields
containing Maps or Lists, which would just contain an empty data structure
in that case.
The $properties$ field contains all the extracted data from the various
data source extraction and normalization steps in the Ingestion pipeline.
It is structured in a way to make storing arbitrary data possible, by allowing
to store a List of Strings for each property key. This way, a list of
email addresses can be stored as well as a single postal code and they can
be treated the same way.
The $relations$ field is structured in a two-dimensional map. The first level
is addressed by the target $subject$ ID, while the second level keys are the
specific relation type. The value that is returned when the field is queried
by the target ID and relation type is the relation attribute, which is mostly
used to store a confidence measure (a value from 0 to 1) or the count of the
relation type between the current $subject$ and the target.
\subsection{A versioning scheme that stands the test of time}
Versioning is needed for the automatic system that includes various pipelines
that write to lots of different tables in Cassandra during their runtime,
but at the end it is expected that all data will flow back into the central
'data lake' that is the $subject$ table, described earlier in this chapter.
This means that fully automated processes may and will write millions of
data entries to the data lake at an hourly basis (or even more frequent),
so if any breaking changes are introduced by these, they need to be rerolled quickly.
This means that each version isn't saved as a difference like in version control systems
(e.g. git), but rather every single version has a copy of all the attributes,
a timestamp and the name and version of the program that modified it.
As our whole pipeline is made up of individual small Spark jobs that write
mostly into their own Cassandra tables, the error in the data can be quickly
located and traced to the corresponding source code.
Because our data domain included events, attributes and relations that had
temporal validity parameters attached to them, we had to realize a versioning
scheme that was compatible with the column-family oriented storage of Cassandra.
As everything is distributed across the cluster here, difference-based versioning
would be hard to keep up, so our team decided on a deeply structured deep copy
backup system for our versions.
The version data structures are stores for each attribute and each individual
key of the $subject$ entries, which makes restoring individual columns or single
data entries efficient and easily parallelizable.
The version data structure that was used is structured like this:
\begin{lstlisting}[style=scalaStyle,caption=Version]
case class Version(
version: UUID = UUIDs.timeBased(),
var program: String,
var value: List[String] = Nil,
var validity: Map[String, String] = Map(),
var datasources: List[String] = Nil,
var timestamp: Date = new Date()
)
\end{lstlisting}
It is stored inside the collections of the $history$ fields of the $subject$
data structure and contains useful meta information for restoring old versions,
modeling temporal validity (attributes only being valid for a certain timespan)
or filtering out data from certain data sources. They also contain the Spark
program's name that created the relevant changes and the time they were made.
Additionally, the $version$ table contains a list of all changes to the data lake,
making it traceable which programs were ran at what time, which table they wrote
to and what data sources they processed. Using this, the Curation interface
can display the annotated version list, and run Spark jobs that restore the
whole data lake to a previous version or compute the changes that a version
made in the $subject$ table.
More detail on the design of the $subject$ data structure and the versioning
scheme is discussed in Strelow \cite{strelow}.
\pagebreak
\section{Using Spark and GraphFrames for Graph Analysis}
This chapter will briefly discuss the reasons for using the
Apache Spark\footnote{\url{https://spark.apache.org}}
and DataStax GraphFrames\footnote{\url{https://graphframes.github.io}}
frameworks alongside the
Apache Cassandra database\footnote{\url{https://cassandra.apache.org}}.
Then the transformations that are necessary to extract GraphFrames-compatible
data from the subject data structure will be detailed.
The following chapters will build on the techniques described in this one, so
that only the relevant graph processing sections need to be shown later on.
\subsection{Why Spark, Cassandra and GraphFrames?}
The subject table contains millions of entries. In order to efficiently extract
a graph and calculate the results of complex graph algorithms, a parallel processing
framework like Apache Spark is a really important measure to take into consideration when runtimes
of multiple hours (or even days) aren't desired, which normally would be the case for data of this magnitude.
Cassandra is a database that is up to the challenge of delivering the input for and storing the results of
these computations without much complicated serialization, like storing them
in CSV or binary files (like e.g. the Parquet format commonly used for storing Spark's data frames) would involve.
It is also well integrated into the Spark API and can take care of complex nested
data structures like e.g. the properties and relations attributes of the $subject$
data structure discussed, which was discussed in chapter \ref{sec:subject_structure}.
GraphFrames is a really useful tool for interacting with huge amounts of graph
data. The advantages in comparison to the GraphX framework (included in Spark) are that arbitrary data types
can be used for IDs, instead of just $long$ values, and that the whole data frames
are always usable, instead of just a single vertex attributes that are overwritten
by algorithms executed on the graph. This is necessary, when the algorithms need to
evaluate certain attributes like the size, industry sector or geographic location of the companies.
GraphFrames also comes with handy implementations
of common graph algorithms like breadth-first search, PageRank and motif search
that will be heavily used in the following chapters.
\subsection{Constructing a GraphFrame from subject data}
\label{sec:graphframe_construction}
The core class of the GraphFrame API is the class GraphFrame itself, which
is basically a combination of a vertex and relation data frame as well as a
lot of operations which can be performed on them. It is a useful tool for
easily handling the graph data structure and interacting with it in a functional
manner whilst still being able to interact with the relational API of Spark's data frames.
The case class entries that are queried from Cassandra are transformed into vertex
and relation data using the Spark RDD API. The results are then translated into
data frames, which GraphFrames expects when creating a new graph instance.\\
The $extractGraph$ function from $GraphExtractor$ implements this process:
\begin{lstlisting}[style=scalaStyle,caption=extractGraph in $GraphExtractor$]
val subjectVertices = subjects
.map(subject => (subject.id.toString, subject.name))
.toDF("id", "name")
val subjectRelations = subjects.flatMap(subject => {
subject.masterRelations.flatMap { case (id, relTypes) =>
relTypes.map { case (relType, value) =>
(subject.id.toString, id.toString, relType, value)
}
}
}).toDF("src", "dst", "relationship", "value")
GraphFrame(subjectVertices, subjectRelations)
\end{lstlisting}
If entries from the $properties$ map are needed (e.g. for calculating the custom weights in
chapter \ref{sec:custom_weights}), they need to be manually extracted
and added to the column names like shown in listing \ref{lst:custom_weights}.
They can then be used as a vertex attribute.
\begin{lstlisting}[style=scalaStyle,caption=Custom weight extraction,label=lst:custom_weights]
val subjectVertices = subjects
.map(subject => {
val weight = subject.properties("graph_weight")
.headOption.getOrElse("1.0").toDouble
(subject.id.toString, subject.name, weight)
}).toDF("id", "name", "weight")
\end{lstlisting}
The three $flatMap$ and $map$ statements flatten the nested structure of the $relations$ attribute,
transforming it into multiple tuples of source and target node, relation type, and relation value.
It then names them in the scheme that the $GraphFrame$ constructor expects.
The result is a GraphFrame instance containing the company and relation data of all $subject$
entries in the database that will be used to execute the algorithms in the following chapters.
\pagebreak
\section{Approaches for Subgraph Extraction}
\label{sec:subgraph_extraction}
This chapter deals with various graph processing algorithms developed using the
GraphFrames framework that extract useful subsets of the full graph in the data lake.
These are some examples of simple graph processing that lay the cornerstone
for the remaining graph evaluation techniques discussed in later chapters.
\subsection{Finding company groups in the business graph}
\label{sec:company_groups}
As a first experiment with real-world applications, company group extraction
was a good starting algorithm that could be implemented using the connected component
tagging implementation in the GraphFrames
API\footnote{\url{https://graphframes.github.io/api/scala/index.html\#org.graphframes.lib.ConnectedComponents}}.
Here, all businesses that are connected via partial or full ownership
relations are considered as a company group.
Before the ConnectedComponets algorithm is run, all ownership-related company
relations are extracted from the $subject$ graph. This is done by whitelisting
all relations (in $ownershipRelations$) that concern this particular subgraph,
e.g. "owns" and "ownedBy".
Then, after executing the search, all entries are grouped by the ID of the
associated component and their names and UUIDs are extracted and saved to Cassandra
as an entry of the $graphs$ table. The following code snippet describes how
this process is expressed using GraphFrames:
\begin{lstlisting}[style=scalaStyle,caption=Company Group Extraction]
val ownershipEdges = graph.edges.filter((edge) =>
ownershipRelations.contains(edge.getAs[String]("relationship")))
val ownershipGraph = GraphFrame(graph.vertices, ownershipEdges)
ownershipGraph.connectedComponents.run()
.select("id", "component", "name")
.rdd
.map(row => (
row.getAs[Long]("component"),
UUID.fromString(row.getAs[String]("id")),
row.getAs[String]("name")))
.groupBy(_._1) // component
.map(_._2.map(t => (t._2, t._3)).toList) // extract id and name
.filter(_.length >= minComponentSize)
.map(idNameTuples => ResultGraph(
outputGraphType,
idNameTuples.map(_._1),
idNameTuples.filter(_._2 != null).map(_._2)))
\end{lstlisting}
As a result we get a list of businesses like the following:
\begin{itemize}[noitemsep]
\item Oracle
\item Oracle America
\item Sun Microsystems
\item Hyperion Solutions
\item Micros Systems
\item Oracle Financial Services Software
\item PeopleSoft
\item Acme Packet
\item Art Technology Group
\end{itemize}
The corresponding IDs are also stored alongside the names and the graph type (here: "CompanyGroup"). This makes it easier to analyze
the results without joining all subject names and IDs to this dataset. It also facilitates displaying
a graph list in the Curation interface, because only the content of one table needs to be fetched for it.
\subsection{Using motif finding to find structural patterns}
\label{sec:motif_finding}
Motif finding is an approach to search for specific relation structures between
vertices in the graph. On a GraphFrame, it is expressed using strings in a
domain-specific format, not unlike the Cypher query language that is used for
querying the Neo4j database, and executed using the $find$ method of the
GraphFrame\footnote{\url{https://graphframes.github.io/api/scala/index.html\#org.graphframes.GraphFrame}}.
Vertices are selected using arbitrary names inside of parentheses, while edges are
expressed using square brackets, which have to be preceded by a dash and followed by
an arrow. Vertices and edges don't have to be named if they aren't needed for further processing.
A single definition consists of an edge between two nodes, and these constructions are
separated from each other by semicolons.
The resulting data frame that is returned by the $find$ method contains a column
for each of the assigned names, which contain $StructType$ instances that have the original
data of the corresponding vertices or relations inside of them
(in the same format as the data frames in chapter \ref{sec:graphframe_construction}).
As an example, the following expression selects chains of three vertices $x$, $y$, $z$:\\
{\small\ttfamily "(x)-[e1]->(y); (y)->[e2]->(z)"}
Because the queries are expressed using strings, they can be automatically constructed.
This is necessary, because each query can only search for a constant number of
vertices and relations, but most relevant graph structures are indifferent
to their count.\\
Listing \ref{lst:star_motif} details how to automatically construct queries for
stars of varying sizes:
\begin{lstlisting}[style=scalaStyle,caption=Star Motif Query Creation,label=lst:star_motif]
('a' to 'z')
.take(starSize)
.map(char => "("+char+")-[]->(center)")
.mkString("; ")
\end{lstlisting}
This approach is very flexible and can be used to generate many types of patterns,
like chains or cycles of varying lengths. It is executed (with the associated $find$ call)
inside a counting loop (incrementing $starSize$).
After this motif search iteration has been run,
custom conditions can be evaluated on the resulting data frame to limit
the number of returned patterns. For instance, only patterns containing companies
with certain attribute values, only certain relation types or at least
one relation of a specific type can be filtered, resulting in a more fine-grained control
over the extracted patterns.
In the end, the final vertices and relations are stored in the $graphs$ table in the same
fashion as the company groups in the previous section.
%\pagebreak
\section{Graph Simplification using PageRank and Thresholding}
\label{sec:graph_simplification}
The PageRank algorithm by Page et al. \cite{pagerank1999} was originally intended
to compute the likelihood that a user might arrive at a web page
in a network of interconnected documents, to objectively quantify their "importance" in this network.
It can, however, be adapted to most
graph structures and also finds an application when dealing with business graphs,
as the navigation between businesses over their outgoing relations is quite
alike to the navigation between web pages over their hyper links.
Therefore, the vertex output values ($pagerank$) of the algorithm can be interpreted as the
probability that a user might arrive at a certain business when freely
navigating the graph, starting at any connected node.
The relation output values ($weight$) can be seen as the likelihood that the user
travels along this relation to the next company.
\subsection{Using PageRank to find relevant companies}
The PageRank implementation found in the GraphFrames
library\footnote{\url{https://graphframes.github.io/api/scala/index.html\#org.graphframes.lib.PageRank}}
is used for parallel execution of this algorithm.
There are two different possibilities to call it on a given graph: it can either
be executed for a given number of iterations, or it can be run until convergence
is reached within a certain tolerance interval to keep the algorithm from running endlessly.
The former is a lot faster, because it does not repeat until the iterations produce
a small enough change in each node, but the number of iterations
has to be manually adapted to the graph size and its degree of interconnectedness,
otherwise the results differ too much from the values when convergence is reached,
possibly making for a high chance of errors. Here, a trade-off between execution
speed and resulting graph quality has to be made.
By running until convergence, the only parameter that has to be controlled is the tolerance threshold,
which is independent of the input graph because it only depends on the accuracy that is desired,
as the algorithm continues executing until all $pagerank$ value changes
are under this threshold.
After the PageRank algorithm is executed, the results still have to be interpreted,
meaning that the nodes have to be separated into "important" and "unimportant" groups.
This can be done using thresholding techniques which can be parallelized
over the connected components of the business graph.
First, the individual $pagerank$ values assigned to every node have to be normalized
over the whole connected component to limit them to the interval $[0, 1]$,
so that the following thresholding step can work with a constant maximum value.
This is done by finding the maximum of this
value in the component and dividing every node's value by it.
Then, a threshold value has to be picked for the component. This can either be done
using an absolute threshold that has to be determined for the whole graph, or
using an adaptive thresholding technique like Otsu's method \cite{otsu1979threshold},
which is normally used for grayscale image thresholding, but is equally applicable
here as it works on all one-dimensional histograms.
It is particularly useful in this case because it removes the reliance
on the degree of interconnectedness in the components and their size,
and finds a good threshold for all connected component sizes that separates
two value classes ("important" and "unimportant") from each other.
It works by creating a histogram of the absolute frequency of
each $pagerank$ value present in the component, which are grouped into bins.
Then, the values are iteratively separated into two classes while trying to
minimize the variance inside these classes.\\
In the end, a threshold is picked that can be used to decide whether or not to
include a node in the important group based on its $pagerank$ value.
The same can be done for the relations in the current connected component.
By only showing the edges that the user is likely to use, the graph can be greatly
simplified, making use of the weights the PageRank algorithm assigns to the
relations present in the graph. This approach is useful when the input graph
contains a great number of edges that are not desired in the output, but can,
depending on the chosen threshold, lead to a very segmented and unconnected result graph.
\pagebreak
\subsection{Adding custom weights to influence company extraction}
\label{sec:custom_weights}
In many cases, a fully automated extraction process can be undesirable because
it leads to a loss of control over which companies are chosen to be part of the
output. To introduce custom weights into the process, more data can be joined and multiplied
into the results of the PageRank algorithm before applying the thresholding process.
For all companies that are to be modified, its UUID ($id$) and a custom floating point
value ($weight$) need to be present in the new data frame ($weightFrame$).
This data can be acquired by manually adding attributes (e.g. $graph\_weight$)
into the concerned companies' properties in the Curation interface, or by logging and counting
every manual change in the Curation interface and using these values as an offset for the
PageRank values (these would have to be added instead of multiplied in then).
The weight values can also be calculated by a function that measures the existence
of relevant attributes in the $properties$ structure of the companies, or by
using another dataset like a blacklist of companies that should never be included.
Then, the vertex data frame containing the PageRank output and the new
data frame containing the custom weights can be joined and multiplied like this:
\begin{lstlisting}[style=scalaStyle,caption=Custom Weight Joining]
val weightedVertices = rankGraph.vertices
.join(weightFrame, List("id"))
.select("id",
(rankGraph.vertices.pagerank * weightFrame.weight).alias("pagerank"))
\end{lstlisting}
After applying this step and the thresholding pass, companies that had a weight of $0.0$ assigned won't
show up in the resulting graph, while companies with a weight greater than $1.0$
are more likely to be part of it.
\subsection{Finding paths between relevant companies}
The companies extracted in the previous sections are a small subset of the actual
graph. This means that most of them won't share any relations, resulting in a
mostly unconnected graph. This of course isn't desirable, so the graph needs to
be enriched by additional edges.
These edges however won't exist in the original graph, so they have to be generated
by searching for paths between the extracted nodes. This can be done by using
GraphFrames' breadth-first search (BFS)
implementation\footnote{\url{https://graphframes.github.io/api/scala/index.html\#org.graphframes.lib.BFS}},
which is able to use arbitrary attribute selectors for the source and target nodes.
For this algorithm, either all vertices are selected or only ones with few relations,
meaning small degrees, which can be determined by joining $graph.degrees$ into the vertex data frame beforehand.
It can be limited by a maximal path length that should be set to three or four to not
include too complicated paths which wouldn't make for useful relations in a simplified graph.
After executing the BFS algorithm, the output needs to be processed. For instance,
the resulting data frame can be filtered by only containing a single
relation type or by summing up the weights all contained edges and thresholding
this value as previously discussed. Also, all paths of length zero and one have to
be filtered out because they don't provide any meaningful information.
Then, a description has to be chosen for the resulting relations.
A simple approach would be to concatenate the company names along the path
(without the source and target), or using the relation types with ellipses (...) to
make it evident that the concerned relationship is not present in the source graph.
Afterwards, the $to$ and $from$ columns (which contain the source and target vertices)
as well as the constructed relation description are extracted and concatenated with the thresholded
relations from the PageRank output, and the final vertices and relations are written
to the database.
%\pagebreak
\section{Conclusion and Evaluation}
Using the subgraph extraction algorithms from chapter \ref{sec:subgraph_extraction},
interesting areas in the graph can be highlighted and extracted for manual inspection
or further analysis. The company group graphs are a helpful tool to understand the intricate
ownership network of the global business landscape and may also be used a means to evaluate
what the parent companies of customers, competitors or business partners are.
With the motif finding algorithm, long chains of specific relation types like
trade relations may be found, which are important to know about in case one of
the involved businesses is unable to deliver its goods. It can also find star
structures or cycles in the graph, which are normally hard to detect by manual
search.
The graph simplification algorithm describe in chapter \ref{sec:graph_simplification}
is useful for getting a better understanding of the structures present in the subject graph.
It can also be used to provide users with a basic navigation structure when the
graph gains more relations so its connectedness increases. This graph might
be preloaded in the Corporate Landscape Explorer developed during the previous
bachelor's project to show a base skeleton of companies that include its most
important connections.
\subsection{Technical challenges}
The graph processing algorithms in this paper are to be seen as examples of
how the specific graph features could be extracted. They are in no way the most
efficient implementation (particularly because GraphFrames is quite slow in
processing the huge amounts of data contained in the data lake), but are already
quite useful for extracting different views of the subject graph that represent
specific features like contained company groups or patterns that provide a different
way to navigate the graph, instead of just traversing it using outgoing relations.
The inclusion of automatically extracted relations holds many challenges of its own, because
most algorithms in this paper are heavily dependent on the quality of the relations.
For instance the company group extraction algorithm is extremely vulnerable
to the introduction of wrong ownership relations into the subject graph, because
it is based on finding connected components in it, which leads to the two company
groups being considered as one, leading to a lot of wrong data.
\subsection{Outlook}
When more relation types become available in the data lake using the relationship extraction methods
contained in the text mining subproject in Ingestion, many more extraction algorithms
can be implemented, like the extraction of production chains of multiple companies
using supply and trade relations (utilizing motif search like in chapter \ref{sec:motif_finding})
or the extraction of competitor graphs (using connected components like in chapter \ref{sec:company_groups}).
Another issue is the performance of the discussed algorithms. To increase it,
they can be partially ported over to the GraphX framework integrated into Spark.
This is easily integrated into the current algorithms, because
GraphFrames offers conversion methods that translate its GraphFrame structure
into GraphX graphs and back. Because it works on a much smaller data structure
that is tightly packed into a binary format, the data exchange over the
local network is faster and less data needs to be stored between stages.
Using this method, particularly slow sections of the algorithms can my be sped up
by a lot, depending on how much data is removed from the vertices before execution.
\clearpage
\sloppy
\bibliography{references}
\clearpage
\pagestyle{plain}
\section*{Statutory Declaration}
I declare that I have authored this thesis independently, that I have not used
any other than the declared resources, and that I have explicitly marked all
material which has been quoted either literally or by content from the used sources.
%Zitat Raschkowski
Potsdam, July 21, 2017
~\\
~\\
~\\
\begin{tabular}[t]{@{}l@{}}
\makebox[2.5in]{\dotfill}\\
\strut Milan Gruner \strut
\end{tabular}
\end{document}