User talk:Equinox/code/ExtractBookWords

From Wiktionary, the free dictionary
Latest comment: 7 years ago by Equinox
Jump to navigation Jump to search

Hello @Equinox,

I used this, changing the line

               if ((ch >= 'A' && ch <= 'Z') || (ch >= 'a' && ch <= 'z'))

to

               if ((ch >= 'A' && ch <= 'Z') || (ch >= 'a' && ch <= 'z') || (ch == 'æ') || (ch == 'ø') || (ch == 'å') || (ch == 'Æ') || (ch == 'Ø') || (ch == 'Å'))

yet the program stills throws out those extra letters. It works fine apart from that. Perhaps you can tell me what I've done wrong?__Gamren (talk) 14:00, 5 October 2016 (UTC)Reply

My immediate thought is that the program uses File.ReadAllText(string) and not File.ReadAllText(string, Encoding): if your file contains complex characters, you might need to specify the encoding, such as UTF-8 or Windows-Latin-1. That's just a guess. Since "throws out" in your comment might mean either "discards" or "outputs" (isn't English dumb?), I don't really understand what problem you are having. Equinox 20:00, 5 October 2016 (UTC)Reply
Thank you! UTF8 didn't work for some reason, so I used
           string[] s = File.ReadAllText(INPUT_FILE, Encoding.UTF7)
which worked. I am a complete newbie to C#, or C in general. By "throws out" I meant "discards".__Gamren (talk) 06:26, 6 October 2016 (UTC)Reply
By the way, how would you recommend easily getting books in text format, apart from Gutenberg?__Gamren (talk) 09:20, 6 October 2016 (UTC)Reply
I don't really know any other (legal!) sources. Equinox 16:11, 7 October 2016 (UTC)Reply
If copyright is the issue, how about if one changed the sequence of words, e.g. through alphabetization? Surely that would not constitute infringement?__Gamren (talk) 09:40, 8 October 2016 (UTC)Reply
You're using the copyrighted work to create another work, so I think that counts as a "derivative work" or something. IANAL. Equinox 09:51, 8 October 2016 (UTC)Reply
I mean, I don't think that generating a word list from a typical novel etc. is a problem, but I thought you were asking where to get hold of computerised copies of books that are still in copyright. You'd have to go to illegal torrents etc. (or maybe hack Amazon Kindle's DRM!). Equinox 09:53, 8 October 2016 (UTC)Reply