I'm trying to analyze some data I have but there is a lot of inconsistencies in my data.
I have a SQL table that I'm trying to analyze.
The table is a table of universities with the following structure: name:string, city:string, state:string, country:string
Name is always present however city, state, country can be missing.
My main issue is that there are a ton of typos and different declination of a university name.
For example here are the declination of Standford Unversity I find when I do SELECT "universities".* FROM "perm_universities" WHERE (name like '%stanford%'):
stanford university - stanford - ca - united states of america
the leland stanford junior university - stanford - ca - united states of america
leland stanford jr. university - stanford - ca - united states of america
stanford university graduate school of business - stanford - ca - united states of america
the leland stanford junior university (stanford university) - stanford - ca - united states of america
leland stanford junior university - stanford - ca - united states of america
stanford university - stanford - -
leland stanford jr. university, graduate school of business - stanford - ca - united states of america
stanford law school - stanford - ca - united states of america
stanford - stanford - ca - united states of america
stanford university, graduate school of business - stanford - ca - united states of america
stanford graduate school of business - stanford - ca - united states of america
stanford univerity - stanford - ca - united states of america
stanford university (the leland stanford junior university) - stanford - ca - united states of america
the leland stanford jr. university - palo alto - ca - united states of america
leland stanford junior university, school of law - stanford - ca / n/a - united states of america
stanford universit - stanford - ca - united states of america
the leland stanford university - stanford - ca - united states of america
leland standford stanford junior university - stanford - ca - united states of america
stanford university - cambridge - ma - united states of america
the leland stanford junior university 'stanford university' - stanford - ca - united states of america
stanford university school of law - stanford - ca - united states of america
stanford univresity - stanford - ca - united states of america
the leland stanford jr. university (stanford university) - stanford - ca - united states of america
leeland stanford junior university - stanford - ca - united states of america
leland stanford junion university - - ca - united states of america
leland stanford junior university (stanford university) - stanford - ca - united states of america
the leland stanford junior university - stanford - -
stanford university - graduate school of business - stanford - ca - united states of america
graduate school of business, stanford university - stanford - ca - united states of america
stanford universoty - stanford - ca - united states of america
leland stanford junior university - stanford - -
stanford univeristy - palo alto - ca - united states of america
leland stanford university - palo alto - ca - united states of america
stanford university - stanford - ca / n/a - united states of america
the leland stanford junior university, stanford university - stanford - ca - united states of america
the leland stanford junior university graduate school of business - stanford - ca - united states of america
stanford universtiy - stanford - ca - united states of america
stanford univerisity - stanford - ca - united states of america
stanford university - stanford - ct - united states of america
stanford law scool - stanford - ca - united states of america
mba: stanford university - stanford - ca - united states of america
They are all the same university, but some have typos, some have different names, some have no cities, some have the wrong cities, ... The data isn't great.
So I'm trying to fix it. How can I consolidate this data?