[자연어처리] a little spell-checker using string edit distance

edit distance 를 사용한 간단한 spell-checker를 만들어 보겠다.

단어를 입력한 후 edit distance(거리)를 비교해서 작은 수의 단어를 찾아 대체해준다.

먼저 기본적으로 String Edit Distance란 두개의 단어가 주어졌을 때 단어 사잉의 "distance(거리)"를 구해서 비교하는 것이다.

쉬운 아이디어 이다.

예를 들어

s1 = apple, s2 = applas 가 있다면 두 단어의 차이는 1. substitute(대체) s2의 a->e, 2. delete(지우기) s2의 s

하면 s1과 s2가 같아진다.

두 단어를 바꾸거나 교체하는 연산은: delete, insert, substitute, copy 가 있다.

연산에서 드는 cost값을 넣을 수 있는데,

delete - (cost 1)

insert - (cost 1)

substitute - (cost 1)

copy - (cost 0) (매칭이 되는 단어이기 때문에 cost 0)

이렇게 볼 수 있다.

이것을 'Levenshtein distance'라고 한다.

그럼 과정을 보겠다.

예를 들어 python cookbook의 문장이 있다.

s - p y t h o n c o o k b o o k

| | | | |

t - p y h e n c o k b a a k

cost 0 0 1 1 2 2 2 2 3 3 3 4 5 5

2개의 삽입과 3개의 단어 교체가 있기 때문에 cost 5가 나왔다.

이렇게 찾는 것이 최소값인데, 만약

s - p y t h o n c o o k b o o k

| | | | | | | | | | | |

t - p y h e n c o k b a a k

cost 0 0 1 2 3 4 5 6 7 8 9 10 11 12

전체적인 문장을 보지 않고 insert, substitute를 한다면 cost는 12까지 나올 수 있다.

사람의 눈으로는 python cookbook의 거리를 최소로 찾는데 가능하다. 하지만 비교 문장이 길어진다면 어떠한 규칙이 있지 않는 이상 비교가 힘들다

예) s - If there is anyone out there, who still doubt that america is ...

t - if those who doubt that anyone ...

cost가 엄청크게 나올 수 있다.

그래서 사용하는 것이 Dynamic Program Table for String Edit Distance 이다.

Dynamic Program Table은 말 그대로 앞에 했던 결과를 축적해가면서 테이블을 만드는 것이다.

PARK와 SPAKE의 거리를 측적하는 것이다.

아래로 가는 연산은 delete(cost 1), 오른쪽으로 가는 것은 insert(cost 1), 대각선은 substitute/copy (cost 1 or 0)

목표는 cost가 가장 적은 값을 구해가는 것이다.

위의 표를 자세히 천천히 잘 살펴보면 이해할 수 있을 것이다.

우리가 원하는 것은 최소의 distance 값이다. 곧 cost가 가장 작은 것을 찾는 것이다.

위의 이론을 가지고 python으로 프로그래밍해보겠다.

import argparse
def stredit (s1, s2):
len1 = len(s1) #vertical
len2 = len(s2) #horizontal
#distance table 생성
table = [None]*(len2+1)
for i in range(len2+1):
table[i] = [0]*(len1+1)
for i in range(1, len2+1):
table[i][0] = i
for i in range(1, len1+1):
table[0][i] = i
for i in range(1, len2+1):
for j in range(1, len1+1):
if s1[j-1] == s2[i-1]:
d = 0
else:
d = 1
table[i][j] = min(table[i-1][j-1] + d,
table[i-1][j] + 1,
table[i][j-1]+1)
# substitute or copy
# delete
# insert
return table[len2][len1]
ap = argparse.ArgumentParser()
ap.add_argument("-s", "--string", required = True,
help = "write to find the word")
args = vars(ap.parse_args())
#input word
str1=args["string"]
# make dictionary
dic = {0:'apple', 1:'banana', 2:'watermelon', 3:'orange'}
# 각 dic과 비교해서 얻을 distance를 넣을 result list생성
result = [None] * len(dic)
for i in range(0, len(dic)):
result[i] = stredit(str1, dic[i])
# str1에 distance가 최소인 단어 저장
str1 = dic[result.index(min(result))]
print(str1)

콘솔에 돌린 결과

'IT > Python' 카테고리의 다른 글

color histogram 연습 (0)	2015.06.02
[자연어처리] 간단하게 만든 긍정, 부정, 중립 분류 using naive bayes classifier (8)	2014.12.19
파일을 만들기 전에 파일 존재 확인하는 코드 (0)	2014.12.19
python beautifulsoap - table parsing (0)	2014.12.19
python txt파일 읽기 에러 'cp949' (4)	2014.12.19

Creation & Freedom

[자연어처리] a little spell-checker using string edit distance

'IT > Python' 카테고리의 다른 글

티스토리툴바

[자연어처리] a little spell-checker using string edit distance

'IT > Python' 카테고리의 다른 글

관련글

티스토리툴바