Portfolio Project

SQL sees rows.
Graphs see risk.

A Mexican bank's loan book is a table. Each row is a borrower, an amount, a status. The rows don't know about each other. The borrowers do. They guarantee each other's loans, sit on each other's boards, own pieces of the same companies. This page walks through four risks that are invisible in the table and obvious in the graph.

Step 1: What the flat table reports

Dataset: 853 nodes, 726 edges, $200M MXN tracked, 4 structural findings. Live in Neo4J AuraDB.

This is the view a risk analyst opens every morning: 458 loans, a healthy headline NPL, five sectors. Everything an aggregate can tell you is here. What it can't tell you is who is connected to whom.

SQL View: What the analyst reports
458
Total loans
$200M
Exposure (MXN)
7.9%
NPL ratio
5
Sectors ("diversified")
loan_idclientbalancestatusscore
PRE-00313CLI-00010$87,200ACTIVO725.8
PRE-00127CLI-00011$132,511ACTIVO798.2
PRE-00254CLI-00012$132,511VENCIDO649.6
PRE-00087CLI-00020$50,922ACTIVO710.1
PRE-00308CLI-00021$9,200ACTIVO695.3
... 453 more rows

Looks like a healthy portfolio. One past-due loan in five shown. Nothing unusual.

but
Graph View: What the first 3 rows actually look like #313 ACTIVO score 725 Homero A. #127 ACTIVO GARANTIZA score 798 Manuel M. #254 VENCIDO GARANTIZA score 649 Cecilia G. 45 days past due

Homero has a score of 725 and looks fine. But he guarantees Manuel's loan. Manuel guarantees Cecilia's loan. Cecilia's loan is 45 days past due. Homero's score never saw this coming.

The 4 findings SQL can't see

Four questions a loan table cannot answer. Not because SQL is broken. Because the questions are about connections, and a flat table has no language for them.

01

The safe client who isn't safe

AIG parallel: guarantee chains on CDO tranches

Homero Aguayo has a bureau score of 725. Well above the threshold. His own loan is current, no past-due days. Every standard metric says he's fine.

The graph says: Homero guarantees Manuel's loan. Manuel guarantees Cecilia's loan. Cecilia's loan is 45 days past due.

If Cecilia can't pay, Manuel is liable. If Manuel can't cover it, Homero is liable. His credit score was answering a different question: one about Homero in isolation, not Homero in context.

In 2008, AIG's guarantee chains on CDO tranches worked exactly this way. The depth of exposure was unknowable until the underlying asset failed.

SQL: requires depth known at write time
hop count hardcoded when query is written
chain 3 hops long? this query doesn't exist
SELECT c.id_cliente, c.score_buro
FROM clientes c
JOIN garantias g1 ON c.id_cliente = g1.id_garante
JOIN prestamos p1 ON g1.id_prestamo = p1.id_prestamo
JOIN clientes c2 ON p1.id_titular = c2.id_cliente
JOIN garantias g2 ON c2.id_cliente = g2.id_garante
JOIN prestamos p2 ON g2.id_prestamo = p2.id_prestamo
WHERE p2.estatus = 'VENCIDO'
Cypher: depth is runtime
MATCH (a:ClienteIndividual {
  id_cliente: 'CLI-00010'
})-[:GARANTIZA]->(p1)
  <-[:TIENE_PRESTAMO]-(b)
  -[:GARANTIZA]->(p2)
  <-[:TIENE_PRESTAMO]-(c)
WHERE p2.estatus = 'VENCIDO'
RETURN
  a.score_buro AS guarantor_score,
  p2.saldo_vigente_mxn AS exposure,
  p2.dias_mora
Variable-depth traversal (*1..N) is native in Cypher. SQL requires one JOIN per hop, and the hop count must be decided when writing the query, before you know how deep the exposure actually goes.
$132,511 MXN hidden exposure two hops from a client scored 725. invisible in any loan table
02

Coverage that doesn't exist

AIG parallel: insured tranches AIG had also insured

Three borrowers guarantee each other's loans in a closed loop: A guarantees B's loan, B guarantees C's loan, C guarantees A's loan.

A risk system checking "does this loan have a guarantor?" returns YES for all three. They look fully covered.

But the coverage is fiction. If any one of them can't pay, the guarantee activates for someone already exposed to the same failure. Nobody outside the loop is backing anything.

This is a regulatory violation. CNBV Circular 3/2012 prohibits guarantee structures where coverage is fictitious due to circular dependency.

SQL: recursive CTE (breaks in practice)
postgres only. MySQL breaks here.
no cycle guard: stack overflow on real data
you still don't know what the exposure is
WITH RECURSIVE chain AS (
  SELECT g.id_garante,
         ARRAY[g.id_garante] AS visited
  FROM garantias g
  UNION ALL
  SELECT g2.id_garante,
         visited || g2.id_garante
  FROM garantias g2, chain
  WHERE NOT g2.id_garante = ANY(visited)
)
found the cycle. now what?
loan balances are in another table. start over.
Cypher: one pattern
MATCH (c1)-[:GARANTIZA]->(p1)
  <-[:TIENE_PRESTAMO]-(c2)
  -[:GARANTIZA]->(p2)
  <-[:TIENE_PRESTAMO]-(c3)
  -[:GARANTIZA]->(p3)
  <-[:TIENE_PRESTAMO]-(c1)
WHERE c1 <> c2
  AND c2 <> c3
  AND c1 <> c3
RETURN c1, c2, c3,
  p1.saldo_vigente_mxn,
  p2.saldo_vigente_mxn,
  p3.saldo_vigente_mxn
A cycle is a structural pattern in Cypher: the path ends at the same node it started. No visited-array tracking needed. SQL recursive CTEs approach this problem from the opposite direction and require explicit cycle-checking at each step.
CNBV Circular 3/2012 violation
$110,975 MXN reported as 100% guaranteed. actual external coverage: 0%
03

One company falls. Five people follow.

Bear Stearns parallel: most interconnected, not largest

HARI MASA SA DE CV is under restructuring. In the loan table, that's one row. In the graph, it's a hub.

Five shareholders have personal loans with the same bank. Two cross-guarantee each other's personal loans. If HARI MASA collapses, all five lose their equity stake simultaneously. The two cross-guarantors are doubly exposed.

SQL sees five independent borrowers. The graph sees one interconnected risk cluster where the company's failure is a single correlated event affecting all five at once.

Bear Stearns in 2008 wasn't the largest bank. It was the most interconnected. Size doesn't predict damage. Topology does.

SQL: 3 JOINs, static fan-out
table names baked in, not discoverable
"all connected entities" has no SQL equivalent
SELECT c.nombre, p.saldo_vigente_mxn
FROM empresas e
JOIN accionistas a
  ON e.id_empresa = a.id_empresa
JOIN clientes c
  ON a.id_cliente = c.id_cliente
LEFT JOIN prestamos p
  ON c.id_cliente = p.id_titular
LEFT JOIN garantias g
  ON c.id_cliente = g.id_garante
WHERE e.id_empresa = 'EMP-00031'
Cypher
MATCH (e:Empresa {
  id_empresa: 'EMP-00031'
})<-[:ES_ACCIONISTA_DE|ES_DIRECTOR_DE]-(c)
OPTIONAL MATCH (c)-[:TIENE_PRESTAMO]->(p)
OPTIONAL MATCH (c)-[:GARANTIZA]->(g)
RETURN
  c.nombre_completo,
  sum(p.saldo_vigente_mxn) AS personal,
  sum(g.saldo_vigente_mxn) AS guaranteed
OPTIONAL MATCH fan-out in Cypher naturally discovers all connected entities regardless of how many there are. SQL must specify relationship types and table names before execution. The traversal structure is static, not dynamic.
$4,954,280 MXN guarantee exposure correlated through one restructured company's 5 shareholders
04

Four companies that are really one risk

Lehman parallel: 7,000 entities, one control group

Emilio Maldonado appears in four separate loan files as a controlling shareholder. The portfolio report shows exposure spread across four companies in three sectors. It looks diversified.

The graph shows: it's one person. Score 583. Personal income $18,784 MXN/month. If he has a legal problem, a health crisis, or makes one bad strategic decision, all four companies lose their key decision-maker at the same moment.

Entity-level diversification is an illusion when control is concentrated. SQL sees four borrowers. The graph sees one person.

Lehman Brothers had 7,000 legal entities across 40 countries. The portfolio looked diversified. When the parent failed, all 7,000 failed at once. This is why the G20 mandated the creation of GLEIF in 2011.

SQL: self-join to find shared director
finds the count. not the exposure.
two more JOINs for that. which you haven't written.
SELECT a1.id_cliente,
  COUNT(DISTINCT a1.id_empresa) AS n
FROM accionistas a1
JOIN accionistas a2
  ON a1.id_cliente = a2.id_cliente
  AND a1.id_empresa <> a2.id_empresa
GROUP BY a1.id_cliente
HAVING COUNT(DISTINCT a1.id_empresa) >= 2
still 2 JOINs short of knowing the actual risk
Cypher: pattern is the query
MATCH (c:ClienteIndividual)
  -[:ES_ACCIONISTA_DE|ES_DIRECTOR_DE]->(e)
WITH c, count(e) AS n
WHERE n >= 2
MATCH (c)-[:ES_ACCIONISTA_DE]->(e2)
  -[:TIENE_PRESTAMO]->(p)
RETURN
  c.nombre_completo, n,
  sum(p.saldo_vigente_mxn)
    AS true_concentration
ORDER BY true_concentration DESC
Shared-controller detection in SQL requires a self-join on the relationship table, then additional joins to aggregate exposure. In Cypher, "find persons connected to multiple companies" is a structural pattern: the relationship is first-class, not a derived fact from table joins. Mexican regulation requires this under CNBV Circular Única de Bancos, Article 73 (personas relacionadas).
$7,163,560 MXN true single-person concentration. SQL reports 4 separate borrowers across 3 sectors

The pipeline: 8 scripts, sequential

No orchestrator. Each script reads the output of the previous one. Real data sources for ownership topology, synthetic credit layer calibrated to CNBV benchmarks.

00
acquire_gleif.py
7,015 MX LEI entities + 287 parent-child edges
GLEIF API (live)
01
acquire_psc.py
600 UK PSC controllers + 3,935 ownership edges
BODS Datasette
02
extract_subgraph.py
NetworkX components → 300 persons, 366 companies
NetworkX
03
map_to_creditgraph.py
Faker MX + lognormal amounts + 4% NPL calibration
Synthetic
04
embed_scenarios.py
5 structural patterns planted: chains, cycles, hubs
Synthetic
05
embed_quality.py
127 quality issues: neg. incomes, dup. CURPs, orphan FKs
Intentional
06
generate_neo4j.py
Validated CSVs + Cypher UNWIND, 26 FK rejections
Validation
07
load_neo4j.py
Batch inserts → 853 nodes, 726 edges live in AuraDB
Neo4J AuraDB

What this project is, and what it isn't

These numbers aren't auditable facts. They come from synthetic distributions calibrated to CNBV benchmarks. Each finding is better read as: this kind of exposure exists in real portfolios, and here is how you'd find it.

What it demonstrates

  • Graph databases make relationship queries structurally different, not just faster: cycle detection, variable-depth traversal, and fan-out are native operations
  • The same structural patterns (circular guarantees, hidden concentration) appear in real Mexican banking regulation under CNBV Circular Única de Bancos, Article 73
  • Ownership topology from real public sources (UK PSC + GLEIF), not randomly generated graphs
  • Data quality issues catalogued from operational experience: negative incomes, duplicate CURPs, orphaned foreign keys
  • The method transfers: swap synthetic data for a real loan book, keep the graph schema and Cypher queries, and it runs unchanged

What it doesn't claim

  • Specific MXN amounts are not meaningful: they come from synthetic lognormal distributions
  • LightGBM scores are not validated against real defaults (scoring module is planned, not built)
  • The 7.9% NPL rate does not represent a real Mexican portfolio: it is calibrated to CNBV public benchmarks
  • This is not investment advice or regulatory analysis

The difference between this prototype and production is data source and scale, not logic. The UK PSC topology was chosen precisely because real ownership structures (shared directors, nested ownership) cannot be replicated by random generators. You need real corporate control relationships to make graph queries meaningful.

Explore the portfolio graph

A small slice of the live graph: 29 of the 853 nodes, chosen because each one participates in at least one of the four patterns below. Drag a node, hover for details, click a finding to see just its subgraph.

Individual client
Company
Active loan
Defaulted loan
Restructured
Drag nodes • Scroll to zoom • Click finding buttons above