Is your database GDPR proof? How can we use python and NLP tools to check it?

Of course, everybody tries to anonymize sensitive information although it is not such an easy task. Sometimes we make mistakes.

The aim of this post is to give you a tool that helps you highlight which column of which table might contain sensitive data.

First I will show you what can you use if your database contains Hungarian text. It is a harder task than if you work with English text, because spacy and nltk packages don't fully support Hungarian. Indeed, there are some solutions that try to integrate Hungarian into spacy, but it is not so handy.

I will talk about the following:

  • Which types of NER are worth checking?
  • How can you do it?

Which types of NER are worth checking?

Next, check rule-based NER for phone numbers, email addresses, etc.

Then you can try to use the NLTK and the spacy python packages, but it won’t be easy. I recommend the blog of György Orosz, who is one of the most experienced NLP experts.

https://awesomeopensource.com/project/oroszgy/awesome-hungarian-nlp

If you accept my advice, start the cherry-picking task as sketched above.

How can you implement it in python?

pois table
  1. Let’s create a list of string type columns.

We are just going to examine these columns.

string_type_columns = 
[col for col in table.columns if table[col].dtype == ‘O’]
sting type columns

2. Let’s read NER dictionaries which You can download here:

with open('firstnames.txt',encoding='latin1') as f:
lines = f.readlines()
firstnames = [x.replace('\n','') for x in lines]
print(firstnames[10:20])
with open('company_types.txt',encoding='latin1') as f:
lines = f.readlines()
company_types = [x.replace('\n','') for x in lines]
print(company_types[10:20])
first names and company types

3. Let’s check if there are any findings in the database.

My purpose is not to find every row that contains sensitive data. My purpose is to highlight tables that aren’t GDPR proof. For this reason, I just always query sample datasets from tables.

def return_sample_df_by_column(db_name,table_name,column_name):

conn = sqlite3.connect(db_name)
select_str = f”””select distinct(t1.{column_name}) as col
from
(select {column_name} from {table_name}
where {column_name} is not null limit 1000) t1
limit 100"””
sample_data = pd.read_sql(select_str,con=conn)
sample_data[‘col’] = sample_data.col.apply(lambda x: x.lower())
return sample_data

Create a function that checks your cell if it contains NER element.

def get_diff(list1,list2):
if len(list2) == 1:
#regexp part
diff = re.findall(list2[0],list1.replace(' ',''))
else:
set_1 = set(list1.split())
set_2 = set(list2)
diff = list(set_1 - (set_1 - set_2))
if len(diff) == 0:
diff = None
else:
diff = " ".join(diff)
return diff

Let’s check if there are any matching with first name dictionary:

get_diff function

Let’s check if there are any matching with companies dictionary:

Let’s check if there are phone numbers in cells:

phone numbers check

Ok, put everything together and iterate along with the string type columns.

def get_findings_df(sample_data,
entity_name,
entity_list,
db_name,
table_name,
column_name):
sample_data['findings'] = sample_data['col'].apply(lambda x: get_diff(x,entity_list))
findings = sample_data[sample_data.findings.notnull()]
findings['entity'] = entity_name
findings['db_name'] = 'db_1.db'
findings['table_name'] = table_name
findings['column_name'] = column_name
return findings
def return_findings_by_column(sample_data,entity_dict,db_name,table_name,column_name):
step = 1
for key in entity_dict.keys():
#print(key)
if step == 1:
df = get_findings_df(sample_data,key,entity_dict[key],db_name,table_name,column_name)
step = step + 1
else:
df_temp = get_findings_df(sample_data,key,entity_dict[key],db_name,table_name,column_name)
df = df.append(df_temp)
return df.reset_index(drop=True)

Now run your GDPR proof tester:

db_df = pd.DataFrame()
db_name = path+’db_1.db’
table_name = ‘pois’
print(string_type_columns)
for column_name in string_type_columns:
print(column_name)
sample_data = return_sample_df_by_column(db_name,table_name,column_name)
db_df = db_df.append(return_findings_by_column(sample_data,entity_dict,db_name,table_name,column_name))
db_df = db_df.reset_index(drop=True)

Use this table format as shown above because if you want to put in a BI tool, for example, this format will be handy for visualization.