I decided to shift my focus on this project temporarily to fit the needs of another. What I'll be working on for the next couple weeks is data collection and sorting. There are a couple of things I want to accomplish:
- Censorship that works. All I can think of when coding chatbots and interactive programs is Microsoft's Tay AI. So naturally, censorship of racist content, profanity, and NSFW images is at the top of my list.
- Duplicate Prevention. Less duplicates saves storage space and improves data quality. Easy to do with strings/numbers, harder to do with images.
- Readable representation of Data. The collected data needs to be represented in a way that it can be read by us humans, and be formatted in a way to allow easy use in other programming languages and functions.
Censoring
For the censorship of words, I assume the easiest method would be to create a word bank and scan every piece of data before storing it, replacing or removing offensive words or phrases.
For images, I see it being more of a challenge. Text in images would first need to be converted to a string (Neural Net/TensorFlow?), then run against the same word bank. For images overall, an option I came across is a python program that uses pre-learned examples to determine a "NSFW score" to represent the likelihood of that image being NSFW. If that program lives up to its GitHub ReadMe, running every image through that against a score threshold should do the trick. Otherwise building a program to do that myself might take a while.
Duplicates
Scanning a set of strings for duplicate data is a cinch. Looking at the image scanning python program I mentioned above, gave me the idea of implementing a "duplicate score" of sorts to determine how similar strings are for greater accuracy in more complex data.
Images however are a different story. After a bit of reading, I came across a couple articles on image hashing. Image Hashing is the process of creating a hash value based on the image's contents (Different images will have different hash values). Comparing an image's hash value to the hash value of another can give a somewhat accurate result when checking for duplicates.
Image hashing is a pretty cool topic with a lot to talk about, I'll write an article about it in the near future. When I do, I'll post a link to it here.
Readability
The two options for readability I came up with are:
- A simple text report.
- A HTML webpage.
The obvious simple solution that I'll probably end up using at first is the simple text report. While fine tuning the data collection, I see no use in going over the top with the report. However, in the finished product I'd like to have an auto-generating webpage to nicely display all the collected data.
As for usability in other languages, I'll likely go with storing the data in SQL Databases.
As always, any comments, suggestions, or questions are greatly appreciated!
Leave a comment, or message me on twitter 🚀
Comments
Post a Comment