How to make Pleco dictionaries

Community-Created Courses Chinese

Arete_Hime (Ultra-Sadist Delinquent) July 5, 2017, 4:47pm 1

0a. Install https://www.chinesetextanalyser.com/

0b. If you’re not going to use the CTA lua scripting feature (to get more paragraphs/sentences per word), familiarize yourself with this Excel macro. If you do know lua scripting, using that is going to be quicker and easier. If you do write scripts, please share. If you only want to see the first paragraph/sentence associated with a word, you can ignore most of the below.

1a. Get the text(s) you want in text format. I recommend getting several texts at once so you only have to do the work once. I’ve found books at http://www.qisuu.com/ for example. If you find other sources, please share. Searching in baidu or google for 免费下载 also often works.

1b. In my dictionaries I like to use the entire paragraph, and I like to see the source and the paragraph number. So I paste the text into Excel and (if I am making one dictionary for several books) a column for the initials of the book, and a column for the paragraph number. Concatenate the columns.

2… Import into Chinese Text Analyser to segment. The default segmenter uses a list of 100k+ words, I downloaded some extra dictionaries and use the words from those as well, so my words.u8 file has 600k+ words. It has more idioms for example, and more rubbish. CTA allows you to export the frequency of the word in the text, and the first sentence or paragraph out of the box, along with some other data. If you know lua scripting you can get it to do more: https://www.chinesetextanalyser.com/docs/windows/lua-api.html and you can ignore much of what I am doing in Excel.

3a. Export the list of words. Paste into Excel or Notepad++ to delete duplicates. Filter out the most frequent words - say those with a frequency higher than 1k (see note 3).
3b. Export the segmented text.

4a. Import the list of words into Pleco “Import Cards.” Pleco will add the correct pinyin (and dictionary definitions).
4b Export the cards again.

You want the Pleco pinyin so that your user dictionary entry is displayed together with the other dictionaries with the same pinyin, and you can cycle through it together with the other dictionaries when reviewing, and is not a separate entry in the (left) word list.

5… Associate the list of words with the sentences using the Excel macro (or a lua script). If you want to use the Excel macro, first make the list look like this for example @@@单词@@@, and in word or notepad change every space in the text to @@@. This to make the association better.

So in the Excel macro enabled worksheet, in Sheet1 have the paragraphs with the extra @@@s, and in Sheet2 have the @@@wordlist@@@, both lists have headers. Run the macro. For 武林外传 I think running it took me almost an hour (for almost 1 million characters)? Excel was working but was unresponsive.

My paragraphs (this one associated with the word 原原本本) looked like this for example for 武林外传 (episode 76, paragraph 305):
x3原原本本x3 [x376x3:x3305x3] x3老x3x3白x3：（x3再次x3x3恳求x3）x3湘x3x3玉x3x3我x3x3真x3x3没x3x3想x3x3骗x3x3你x3，x3这x3x3钱x3x3我x3x3有x3x3急用x3。x3我x3x3现在x3x3没x3x3时间x3x3跟x3x3你x3x3解释x3，x3但是x3x3我x3x3向x3x3你x3x3保证x3x3我x3x3不是x3x3用x3x3在x3x3歪道x3x3上x3。x3你x3x3先x3x3把x3x3钱x3x3借给x3x3我x3，x3回头x3x3有时x3x3间x3，x3我x3x3原原本本x3x3一个x3x3字x3x3都x3x3不x3x3落的x3，x3给x3x3你x3x3解释x3x3清楚x3，x3行x3x3吗x3？

6a. Paste the words, the pinyin, the frequency count(s), the number of paragraphs you want into a new Excel sheet.
6b. Do a substitute function on the columns (still with the @@@ in them) and delete the extra @@@s to change it to this for example (I don’t remember how many steps this was for me):

[76:305] 老白：（再次恳求）湘玉我真没想骗你，这钱我有急用。我现在没时间跟你解释，但是我向你保证我不是用在歪道上。你先把钱借给我，回头有时间，我原原本本一个字都不落的，给你解释清楚，行吗？

Concatenate stuff. You need to add a line-break character where you want line-breaks, Pleco uses “”

The result you want is:
风华正茂
feng1hua2 zheng4 mao4
外传: 2 | CTA: 1 | Subs: 0.21 | Int: 0.94 | News: 
[76:286] 秀才：（众人拉住掌柜的）等会儿，我呢新写了首诗，送给风华正茂的您。
[76:320] 掌柜的：秀才你送给我的那首风华正茂的诗呢？

That is:
单词[TAB]dan1ci2[TAB](frequency counts, optionally)[LINE-BREAK]First paragraph[LINE-BREAK]Second paragraph[LINE-BREAK] …

7… Save as a .txt file and import as a Pleco user dictionary.

Note: I’ve never actually done this for more than one book yet. Now that I think about it, perhaps if a word is somewhat frequent, you perhaps don’t want to only see the paragraphs from the first book?

Note 2: I have some more frequency counts from a few studies that I also add. You could also gather your own corpus and get the frequency count from that.

Note 3: The Excel macro is mainly slow I think for longer texts because it will associate every paragraph that has the word with the word. For words like 了 and 的 this will give you thousands of columns. My Excel file for 武林外传 has 13,000 columns…

Note 4: This is from memory, I may have skipped something or put it in the wrong order.

Chinese Reading Club?

[Course Forum] Arete_Hime's Courses

Arete_Hime (Ultra-Sadist Delinquent) July 15, 2017, 10:26am 2

IIRC this link will be active for a month. Let me know if you’d like it reactivated after that.

For Harry Potter 1-7 I’ve also included the Excel file I used. With that you can make a different dictionary. I included the first 6 paragraphs the words were used in, you could decide to use more or less for example. I also made the length of that 4th column max 1500 characters (but most entries stay below that, see the E column to check), you can change the formula in Excel to change that as well (I probably would).

As an example of an entry:

虽然 sui1ran2
HP: 175 | CTA: 6050 | Subs: 153.25 | Int: 351.36 | News: 177.81 
1 976 麦格教授大步朝城堡走去，哈利机械地跟在后面。他离开时发觉马尔福、克拉布和高尔脸上露出了得意的神情。他只知道他要被开除了。他想说几句话为自己辩护，但他的嗓子似乎出了毛病。麦格教授大步流星地朝前走，看也不看他一眼；他必须小跑着才能跟得上。他现在完了。他来了还不到两个星期。再过十分钟，他就要收拾东西滚蛋了。达力一家看见他出现在大门口，会说什么呢?两人登上大门前的台阶，登上里面的大理石楼梯，麦格教授还是一言不发。她拧开一扇扇门，大步穿过一道道走廊，哈利可怜兮兮地跟在后面。教授大概是要带他去见邓布利多吧。他想起了海格，虽然被开除了，但还是获准作为狩猎场看守继续留在了学校里。也许他可以给海格当个助手。他仿佛看见自己拎着海格的口袋，拖着沉重的脚步在场地里走来走去，眼巴巴地看着罗恩和其他人成为巫师，他一想起这些就觉得胃拧成了一团。
1 1091 第二天，马尔福简直不敢相信自己的眼睛，他看见哈利和罗恩居然还在霍格沃茨，虽然显得有些疲倦，但非常开心。确实，哈利和罗恩第二天一早醒来，都觉得看见那条三个脑袋的大狗是一次十分精彩的奇遇，巴不得再经历一次。而且，哈利原原本本地对罗恩讲了那个似乎已从古灵阁转移到了霍格沃茨的小包裹，于是他们花了许多时间猜测，是什么东西需要这样严加看守。
1 1351 大家都迫不及待地盼着放假。虽然格兰芬多公共休息室和礼堂里燃着熊熊旺火，但刮着穿堂风的走廊却变得寒冷刺骨，教室的窗户玻璃也被凛冽的寒风吹得咔哒作响。最糟糕的是，斯内普教授的课都是在地下教室上的，他们一哈气面前就形成一团白雾，只好尽量靠近他们热腾腾的坩埚。
1 1436 图书馆内漆黑一片，阴森可怖。哈利点亮一盏灯，端着它走过一排排书架。那灯看上去就像悬浮在半空中，晗利虽然感觉到自己甩手端着它，但这景象仍然使他毛骨悚然。 
1 1560 比赛渐渐临近，哈利虽然对罗恩和赫敏的说法满不在乎，但他的心情越来越紧张了，其他队员也不太平静。一想到要在学院杯比赛中战胜斯莱特林，大家就激动不已。在将近七年的时间里，还没有人能够打败他们。然而，有这样一个偏心的裁判，他们能成功吗?哈利不知道是他多心呢还是事实如此，似乎他不管走到哪里都会碰到斯内普。有时，他甚至怀疑斯内普在跟踪他，想独自把他抓住。每周一次的魔药课变成了一种痛苦的折磨，斯内普对哈利的态度很恶劣。难道斯内普知道他们发现了魔法石的奥秘?哈利不明白他怎么能知道—— 哈利经常有一种可怕的感觉，似乎斯内普能看透别人的思想。 
1 1669 “不光是我的手，”他低声说，“ 虽然它疼得像要断了一样。更糟糕的是，马尔福对庞弗雷夫人说，他要向我借一本书，这样他就进来了，尽情地把我嘲笑了一通。他不停地威胁说，他要告诉庞弗雷夫人是什么东西咬了我—— 我对庞弗雷夫人说是狗咬的，但我认为她并不相信—— 我不应该在魁地奇比赛时跟马尔福打架，他现在是报复我呢。”

This word occurs in all the Harry Potter books 175 times. In a selection of the other books I’ve included in the download (with a total character count of 10 million IIRC) it occurs 6050 times. In the 2010 Subtlex study, that used the subtitles of 6k tv programs, it has a frequency per million of 153.25. In a 2005 study that used an internet corpus which gives the frequencies of the top 50k words, its frequency is 351.36 per million. And in a 2001 study, based on a newswire corpus, that gives the frequencies of the top 25k words, its frequency is 177.81 per million.

The first number is the book, the second the paragraph number from the book. So these examples are all from the first book. The first 3 books are all a bit over 2000 paragraphs, after that they are generally between 6000 and 7000 paragraphs.

I haven’t installed the dictionary yet. I’m planning to do that in a little while. So let me know if I did something wrong or if something’s not working. If I were to use the file I’ve included to read it, I would probably first delete the extra blank lines.

I have not included the 400 most frequent words that occur more than 400 times in the book. If you don’t want to see the other slightly less frequent words, you can easily delete those as I’ve ordered the entries from most to least frequent.

If you’re not interested in the user dictionaries for the books and the tv-series 《武林外传》, I still recommend you to download this for the Wenlin + learnm dictionary, which gives etymology and related characters for 5k characters.