0a. Install https://www.chinesetextanalyser.com/
0b. If you’re not going to use the CTA lua scripting feature (to get more paragraphs/sentences per word), familiarize yourself with this Excel macro. If you do know lua scripting, using that is going to be quicker and easier. If you do write scripts, please share. If you only want to see the first paragraph/sentence associated with a word, you can ignore most of the below.
1a. Get the text(s) you want in text format. I recommend getting several texts at once so you only have to do the work once. I’ve found books at http://www.qisuu.com/ for example. If you find other sources, please share. Searching in baidu or google for 免费下载 also often works.
1b. In my dictionaries I like to use the entire paragraph, and I like to see the source and the paragraph number. So I paste the text into Excel and (if I am making one dictionary for several books) a column for the initials of the book, and a column for the paragraph number. Concatenate the columns.
2… Import into Chinese Text Analyser to segment. The default segmenter uses a list of 100k+ words, I downloaded some extra dictionaries and use the words from those as well, so my words.u8 file has 600k+ words. It has more idioms for example, and more rubbish. CTA allows you to export the frequency of the word in the text, and the first sentence or paragraph out of the box, along with some other data. If you know lua scripting you can get it to do more: https://www.chinesetextanalyser.com/docs/windows/lua-api.html and you can ignore much of what I am doing in Excel.
3a. Export the list of words. Paste into Excel or Notepad++ to delete duplicates. Filter out the most frequent words - say those with a frequency higher than 1k (see note 3).
3b. Export the segmented text.
4a. Import the list of words into Pleco “Import Cards.” Pleco will add the correct pinyin (and dictionary definitions).
4b Export the cards again.
You want the Pleco pinyin so that your user dictionary entry is displayed together with the other dictionaries with the same pinyin, and you can cycle through it together with the other dictionaries when reviewing, and is not a separate entry in the (left) word list.
5… Associate the list of words with the sentences using the Excel macro (or a lua script). If you want to use the Excel macro, first make the list look like this for example @@@单词@@@, and in word or notepad change every space in the text to @@@. This to make the association better.
So in the Excel macro enabled worksheet, in Sheet1 have the paragraphs with the extra @@@s, and in Sheet2 have the @@@wordlist@@@, both lists have headers. Run the macro. For 武林外传 I think running it took me almost an hour (for almost 1 million characters)? Excel was working but was unresponsive.
My paragraphs (this one associated with the word 原原本本) looked like this for example for 武林外传 (episode 76, paragraph 305):
x3原原本本x3 [x376x3:x3305x3] x3老x3x3白x3:(x3再次x3x3恳求x3)x3湘x3x3玉x3x3我x3x3真x3x3没x3x3想x3x3骗x3x3你x3,x3这x3x3钱x3x3我x3x3有x3x3急用x3。x3我x3x3现在x3x3没x3x3时间x3x3跟x3x3你x3x3解释x3,x3但是x3x3我x3x3向x3x3你x3x3保证x3x3我x3x3不是x3x3用x3x3在x3x3歪道x3x3上x3。x3你x3x3先x3x3把x3x3钱x3x3借给x3x3我x3,x3回头x3x3有时x3x3间x3,x3我x3x3原原本本x3x3一个x3x3字x3x3都x3x3不x3x3落的x3,x3给x3x3你x3x3解释x3x3清楚x3,x3行x3x3吗x3?
6a. Paste the words, the pinyin, the frequency count(s), the number of paragraphs you want into a new Excel sheet.
6b. Do a substitute function on the columns (still with the @@@ in them) and delete the extra @@@s to change it to this for example (I don’t remember how many steps this was for me):
[76:305] 老白:(再次恳求)湘玉我真没想骗你,这钱我有急用。我现在没时间跟你解释,但是我向你保证我不是用在歪道上。你先把钱借给我,回头有时间,我 原原本本 一个字都不落的,给你解释清楚,行吗?
Concatenate stuff. You need to add a line-break character where you want line-breaks, Pleco uses “”
The result you want is:
风华正茂
feng1hua2 zheng4 mao4
外传: 2 | CTA: 1 | Subs: 0.21 | Int: 0.94 | News:
[76:286] 秀才:(众人拉住掌柜的)等会儿,我呢新写了首诗,送给 风华正茂 的您。
[76:320] 掌柜的:秀才你送给我的那首 风华正茂 的诗呢?
That is:
单词[TAB]dan1ci2[TAB](frequency counts, optionally)[LINE-BREAK]First paragraph[LINE-BREAK]Second paragraph[LINE-BREAK] …
7… Save as a .txt file and import as a Pleco user dictionary.
Note: I’ve never actually done this for more than one book yet. Now that I think about it, perhaps if a word is somewhat frequent, you perhaps don’t want to only see the paragraphs from the first book?
Note 2: I have some more frequency counts from a few studies that I also add. You could also gather your own corpus and get the frequency count from that.
Note 3: The Excel macro is mainly slow I think for longer texts because it will associate every paragraph that has the word with the word. For words like 了 and 的 this will give you thousands of columns. My Excel file for 武林外传 has 13,000 columns…
Note 4: This is from memory, I may have skipped something or put it in the wrong order.