Rev 6879 | Details | Compare with Previous | Last modification | View Log | RSS feed
Rev | Author | Line No. | Line |
---|---|---|---|
6793 | bpr | 1 | WIMS' search engine and als |
6797 | czzmrn | 2 | =========================== |
6405 | czzmrn | 3 | |
6797 | czzmrn | 4 | WIMS' search engine works in two stages: |
6405 | czzmrn | 5 | |
7690 | bpr | 6 | 1) update of index files when server data is changed (module added...), |
6802 | reyssat | 7 | typically once a day. |
8 | 2) use of index files at each user's request to find some activities |
||
9 | |||
10 | |||
7690 | bpr | 11 | Here are some details : |
6802 | reyssat | 12 | |
7690 | bpr | 13 | 1) update of index files |
6802 | reyssat | 14 | =========================== |
15 | A series of scripts creates a set of auxiliary files (generally |
||
6797 | czzmrn | 16 | stored in ~/public_html/bases/sys/, see description further down) and |
17 | a list of "keywords" (stored in ~/public_html/bases/site/). |
||
6405 | czzmrn | 18 | |
6797 | czzmrn | 19 | (the scripts must be run in the order given here, as some files |
20 | created on earlier stages are used in subsequent stages). In general |
||
21 | the whole process is run by the script ~/bin/mkindex. |
||
22 | |||
7690 | bpr | 23 | * Firstly a series of 3 perl scripts (mkdomain, mkwgrp, modindclass), |
24 | that ~/bin/mkindex calls via ~/public_html/bases/sys/mkindex.sh : |
||
6802 | reyssat | 25 | |
26 | - the programm ~/public_html/bases/sys/mkdomain.pl creates the lists |
||
27 | of domains from the graph in domain/domain with its translations |
||
28 | (domain/domain.$lang) and in json format (english) to be used for |
||
6879 | bpr | 29 | completion in modtool properties ; create also the domain/domaindic.xx |
30 | to be used as a dictionnary in modind and in the search engine |
||
7690 | bpr | 31 | |
6797 | czzmrn | 32 | - the perl program ~/public_html/bases/sys/mkwgrp.pl reads the INDEX |
7690 | bpr | 33 | files of all the modules on the site and generates |
6797 | czzmrn | 34 | |
35 | - keywords (in format .json) to be used for completion in the search |
||
36 | engine) |
||
37 | - the files in wgrp |
||
38 | |||
39 | (using the keywords and keywords_lang in the INDEX files, according |
||
40 | to this rule: taking keywords_$lang if it exists, or keywords |
||
41 | (whatever it is a $lang-module or not). |
||
42 | |||
43 | Some files are created in keywords as keywords/algebra.fr.tmp, but |
||
44 | not used for the moment. The keywords in these "keywords file" are |
||
45 | exactly those in the variable keywords (or keywords_$lang if it |
||
46 | exists), doing it with the following rules: taking keywords_$lang if |
||
47 | it exists, or keywords (whatever it is a $lang-module or not). |
||
6879 | bpr | 48 | It adds also the lang version of the domains (see domain/domain.xx). |
6797 | czzmrn | 49 | |
6793 | bpr | 50 | - the program ~/public_html/bases/sys/modindclass.pl creates the lists |
6797 | czzmrn | 51 | of keywords coming from the example classes in |
6800 | reyssat | 52 | ~/public_html/bases/class as well as the files author, |
6797 | czzmrn | 53 | description, language, level, title (no ranking is done). |
6879 | bpr | 54 | |
55 | Be careful : to be used as dictionary, must be sorted by the command |
||
56 | bin/dicsort (for example for domaindic). |
||
57 | |||
7690 | bpr | 58 | * Secondly the binary program "modind" (compiled from ~/src/Misc/modind.c) reads |
6797 | czzmrn | 59 | |
7690 | bpr | 60 | -- the INDEX files of all the modules on the site |
6797 | czzmrn | 61 | -- the auxiliary files in ~/public_html/bases/sys/ (see description |
62 | below) |
||
6405 | czzmrn | 63 | |
6797 | czzmrn | 64 | and produces keywords lists stored in ~wims/public_html/bases/site : |
65 | they contains the words (or words groups) coming from the variable |
||
66 | keywords of the INDEX but also words of the title, description |
||
67 | (deleting small words). |
||
6795 | bpr | 68 | |
6797 | czzmrn | 69 | "modind" creates as well a serial list of all the modules available |
70 | on the site, see ~/public_html/bases/site/serial, and calculates the |
||
71 | ranking of the site's modules. The modules are classified according |
||
72 | to their types: A=all (except sheet and classes), D=document, O=OEF, |
||
73 | X=exercise, T= tool, R=recreation, M= data module. |
||
6405 | czzmrn | 74 | |
6808 | czzmrn | 75 | To do that, "modind" uses some dictionnaries in |
6879 | bpr | 76 | ~/public_html/bases/sys/ (as suffix.xx, wgrp, domaindic.xx ...) |
6797 | czzmrn | 77 | |
78 | -- separately "modind" reads also the files in |
||
6879 | bpr | 79 | ~/public_html/bases/sys/sheet and do the same type of works. |
6797 | czzmrn | 80 | |
6802 | reyssat | 81 | |
7690 | bpr | 82 | 2) use of index files |
6802 | reyssat | 83 | =========================== |
84 | The script ~/public_html/modules/home/search.proc (called by the |
||
6797 | czzmrn | 85 | "Search" form) reads the lists above, do the actual search in such |
86 | lists and displays the modules found. It reads also the files of |
||
87 | ~/public_html/bases/sys/class and ~/public_html/bases/sys/sheets |
||
88 | |||
6802 | reyssat | 89 | |
90 | |||
91 | More technical details about both stages |
||
92 | ======================================== |
||
93 | |||
6808 | czzmrn | 94 | In both stages files in this directory ~/public_html/bases/sys/ (see |
95 | comments below) are used to process the keywords present in the |
||
96 | modules' INDEX files. Each "search language" has its own series of |
||
97 | files. |
||
6797 | czzmrn | 98 | |
6808 | czzmrn | 99 | The contents of the files in ~/public_html/bases/sys/ and of the |
100 | modules' INDEX files should be checked by developers and translators, |
||
101 | to improve the behaviour of the search engine. |
||
6797 | czzmrn | 102 | |
6808 | czzmrn | 103 | The files in this directory ~/public_html/bases/sys/ are automatically |
104 | generated (on install) by the corresponding ".src" file in the "src" |
||
105 | subdirectory, if it exists. |
||
6797 | czzmrn | 106 | |
107 | If any of the files described below is omitted, then the corresponding |
||
6879 | bpr | 108 | feature in the corresponding language is disabled. |
6793 | bpr | 109 | |
6876 | bpr | 110 | In version < 4.05c, if there is no file words.$lang, the file |
111 | suffix.$lang was not used (correction in Misc/translator.c to check |
||
7690 | bpr | 112 | in other situations). |
113 | The group words were badly treated when the words were already in |
||
6879 | bpr | 114 | the title, properties, etc. because of |
6876 | bpr | 115 | some option unknown_type=unk_delete in modind.c but it has other consequences |
116 | so it is not the situation. |
||
6798 | czzmrn | 117 | |
6797 | czzmrn | 118 | , will be done by the script in the stable release if we are OK) |
119 | |||
6792 | czzmrn | 120 | Syntax: the lines for most of these files are in the form |
6552 | bpr | 121 | |
6792 | czzmrn | 122 | == |
123 | givenword:substitute |
||
124 | == |
||
125 | |||
126 | ============================================================= |
||
127 | |||
128 | Files |
||
129 | ===== |
||
130 | |||
6879 | bpr | 131 | words.xx : correct misprints in the search words |
7690 | bpr | 132 | (used both by "mkindex" and "search.proc"). |
6792 | czzmrn | 133 | |
134 | E.g. if the file words.en contains the line |
||
135 | |||
136 | == |
||
137 | analytical:analytic |
||
138 | == |
||
139 | |||
140 | then the word "analytical" is considered a misprint and any occurrence |
||
141 | of the string "analytical" is replaced in the search by the string |
||
142 | "analytic" (for the language "en") |
||
143 | |||
6797 | czzmrn | 144 | Note: words.fr was deleted because it caused the search engine not to |
145 | work properly. The site manager can reactivate the functionality by |
||
146 | adding the file again (?? how to get the "original" files from the |
||
147 | svn?). |
||
148 | |||
6792 | czzmrn | 149 | Note: the file words.en is used by the module tool/wcalc.en (see |
150 | ~/public_html/modules/tool/wcalc.en/dic ) |
||
151 | |||
152 | ===================== |
||
153 | |||
6879 | bpr | 154 | suffix.xx : process common suffixes in the search words |
7690 | bpr | 155 | (used both by "mkindex" and "search.proc"). |
6792 | czzmrn | 156 | |
157 | E.g. if the file suffix.en contains the line |
||
158 | |||
159 | == |
||
160 | ertem:meter |
||
161 | == |
||
162 | |||
163 | then any word ending in "metre" ("ertem" the other way round) is |
||
164 | substituted by the corresponding one ending in "meter" (kilometre --> |
||
165 | kilometer) |
||
166 | |||
6797 | czzmrn | 167 | Note: suffix.fr was deleted because it caused the search engine/the |
168 | keyword completion not to work properly. The site manager can |
||
169 | reactivate the functionality by adding the file again. |
||
170 | |||
6792 | czzmrn | 171 | ===================== |
172 | |||
6879 | bpr | 173 | wgrp/wgrp.xx : groups of word |
6797 | czzmrn | 174 | (these files are automatically generated, and used by "mkindex") |
6792 | czzmrn | 175 | |
176 | E.g. if the file wgrp/wgrp.en contains the line |
||
177 | |||
178 | == |
||
179 | affine geometry:affine geometry, |
||
180 | == |
||
181 | |||
182 | then the search matches for the group of words "affine geometry" as a |
||
183 | whole: if the the user searches for "affine geometry" the search |
||
184 | engine returns only the modules containing as keyword the exact string |
||
185 | "affine geometry" (if such line were not present the search engine |
||
186 | would return both the modules containing the word "affine" and the |
||
187 | modules containing the word "geometry"). |
||
188 | |||
189 | The "wgrp" files are now generated from the modules' keywords by the |
||
190 | script ~/public_html/bases/sys/mkwgrp.pl : whenever a module contains |
||
7690 | bpr | 191 | multiple words keywords, such keywords are added to the wgrp files. |
6792 | czzmrn | 192 | |
7690 | bpr | 193 | E.g. tool/algebra/smallgroup.fr/INDEX contains the keyword |
6792 | czzmrn | 194 | |
195 | keywords=group, finite group, order, subgroup, conjugacy class, center, normal subgroup, subgroup lattice |
||
196 | |||
197 | so for each of the groups of words between two commas the |
||
198 | corresponding groups of words are created |
||
199 | |||
200 | finite group |
||
201 | conjugacy class |
||
202 | normal subgroup |
||
203 | subgroup lattice |
||
204 | |||
205 | (in the corresponding language file) |
||
206 | |||
207 | NOTE: problems when the strings contains the apostrophe "'" |
||
208 | (e.g. "algorithme d'euclide") |
||
209 | |||
210 | ===================== |
||
211 | |||
6879 | bpr | 212 | domaindic.xx |
213 | |||
7690 | bpr | 214 | use the files domain/domain.xx to replace the "language" domain in the |
6879 | bpr | 215 | english/technic way. |
216 | |||
217 | ===================== |
||
218 | |||
219 | indignore.xx : ignored words |
||
6792 | czzmrn | 220 | (used by "mkindex") |
221 | |||
7690 | bpr | 222 | All the words listed in the file are ignored by the search engine. |
6792 | czzmrn | 223 | |
224 | ===================== |
||
225 | |||
6879 | bpr | 226 | abuse.xx : swearwords to be ignored by the search engine |
6792 | czzmrn | 227 | (used by ??) |
228 | |||
229 | ===================== |
||
230 | |||
7690 | bpr | 231 | andor.xx : conjunctions ("and", "or") to be ignored by the |
6792 | czzmrn | 232 | search engine |
233 | |||
6797 | czzmrn | 234 | The file andor.xx is mentioned in src/insmath.c (processing logic |
235 | statements in math formulas) but this is for the moment used by no |
||
236 | modules (to be used, one must have insmath_logic=yes which do not |
||
237 | exist in any public module as I know). |
||
6794 | bpr | 238 | |
6797 | czzmrn | 239 | |
6792 | czzmrn | 240 | ===================== |
241 | |||
242 | keywords.fr : ?? |
||
6794 | bpr | 243 | (used by ??) should be deleted |
6792 | czzmrn | 244 | |
245 | ======================================================= |
||
246 | |||
247 | |||
248 | Some indexing examples |
||
249 | ====================== |
||
250 | |||
6797 | czzmrn | 251 | U1/algebra/vecshoot.en |
6793 | bpr | 252 | |
6797 | czzmrn | 253 | As this is an exercise module it is indexed in the lists A.$lang (All) |
254 | and X.$lang (eXercise). |
||
6793 | bpr | 255 | |
6797 | czzmrn | 256 | This is a multilanguage module (main language "en", translation |
7690 | bpr | 257 | language "it"). |
6797 | czzmrn | 258 | |
259 | The index file contains the following (nonempty) lines |
||
260 | |||
261 | title=Vector shoot |
||
262 | description=click on a linear combination of 2D vectors. |
||
263 | language=en |
||
264 | category=exercise |
||
265 | domain=algebra, linear algebra |
||
266 | level=H4,H5,H6,U1,U2 |
||
267 | keywords=vector, linear combination |
||
268 | scoring=yes |
||
269 | copyright=© 1998- (<a href="COPYING">GNU GPL</a>) 2013 |
||
270 | author=XIAO,Gang |
||
271 | address=xiao@unice.fr |
||
272 | version=2.20 |
||
273 | wims_version=4.05a |
||
274 | translation_language=it |
||
275 | title_it=Colpisci i vettori |
||
276 | description_it=individuare una combinazione lineare di vettori 2D. |
||
277 | keywords_it=vettore, combinazione lineare,bersaglio |
||
278 | translator_it=Anna, Lucci |
||
279 | translator_address_it=anna.lucci@gmail.it |
||
280 | |||
281 | In stage 1 the module is given a serial number (depending on the |
||
282 | modules actually available on each site, on my site the serial number |
||
283 | is "1003"). As the distribution also includes the modules |
||
284 | U1/algebra/vecshoot.cn (1002) and U1/algebra/vecshoot.fr (1004) that |
||
285 | correspond to translation of this module into "cn" and "fr" |
||
286 | respectively, the A.cn/X.cn and A.fr/X.fr contain no reference to this |
||
287 | module (1003) but contain only reference to the corresponding |
||
288 | translated module (1002 resp 2004). --> HELP there is no A.cn file!! |
||
289 | |||
290 | The files A.en contains the following lines related to this module. |
||
291 | |||
6879 | bpr | 292 | ?2 or ?4 is the ranking |
7690 | bpr | 293 | It is a weight -- see name of variable in modind.c -- |
294 | giving more importance to the title words : 4 if the word appears |
||
6879 | bpr | 295 | in the module title, 2 otherwise |
6797 | czzmrn | 296 | |
297 | 2d:1003?2 from description and description_it |
||
298 | algebra:1003?2 from domain |
||
299 | bersaglio:1003?2 from keywords_it |
||
300 | click:1003?2 from description |
||
301 | combination:1003?2 from description (_not_ from keywords) |
||
302 | combinazione:1003?2 from description_it |
||
303 | combinazione lineare:1003?2 from keywords + wgrp.en |
||
304 | gang:1003?2 from author |
||
305 | levelh4:1003?2 from level=h4 (and so on) |
||
7690 | bpr | 306 | levelh5:1003?2 |
6797 | czzmrn | 307 | levelh6:1003?2 |
308 | levelu1:1003?2 |
||
309 | levelu2:1003?2 |
||
310 | linear:1003?2 from description |
||
311 | linear algebra:1003?2 from keywords |
||
312 | linear combination:1003?2 from keywords |
||
313 | lineare:1003?2 from description_it |
||
314 | shoot:1003?4 from title |
||
7690 | bpr | 315 | vector:1003?4 from title + description |
316 | (vectors --> vector because of |
||
6797 | czzmrn | 317 | directive "sr:r" in suffix.en) |
318 | vettore:1003?2 from keywords_it |
||
319 | xiao:1003?2 from author |
||
320 | |||
321 | The file A.it contains the following lines related to this module. |
||
322 | |||
323 | (NOTE: only difference is that in A.it there is the keyword "vectors", |
||
324 | no difference in keywords, the only difference is in the list of |
||
325 | modules, list that I omitted to clarify this example) |
||
326 | |||
327 | 2d:1003?2 |
||
328 | algebra:1003?2 |
||
329 | bersaglio:1003?2 |
||
330 | click:1003?2 |
||
331 | combination:1003?2 |
||
332 | combinazione:1003?2 |
||
333 | combinazione lineare:1003?2 |
||
334 | gang:1003?2 |
||
335 | levelh4:1003?2 |
||
336 | levelh5:1003?2 |
||
337 | levelh6:1003?2 |
||
338 | levelu1:1003?2 |
||
339 | levelu2:1003?2 |
||
340 | linear:1003?2 |
||
341 | linear algebra:1003?2 |
||
342 | linear combination:1003?2 |
||
343 | lineare:1003?2 |
||
344 | shoot:1003?4 |
||
345 | vector:1003?4 |
||
7690 | bpr | 346 | vectors:1003?2 no corresponding in A.en because |
6797 | czzmrn | 347 | of directive in suffix.en |
348 | vettore:1003?2 |
||
349 | xiao:1003?2 |
||
350 | |||
351 | NOTE: title_it is missing from the index: you cannot find the module |
||
352 | by searching for its Italian title |
||
353 | |||
354 | The file A.$lang for languages different from the above contains lines |
||
355 | related to this module. |
||
356 | |||
357 | E.g. A.nl |
||
358 | |||
7690 | bpr | 359 | 2d: |
6797 | czzmrn | 360 | algebraisch: directive "algebra:algebraisch" in words.nl |
7690 | bpr | 361 | bersaglio: |
6797 | czzmrn | 362 | clicking: directive "click:clicking" in words.nl |
363 | combinaison: "combination:combinaison" in words.nl |
||
364 | combinazione: |
||
365 | combinazione lineare: |
||
366 | gang: |
||
367 | levelh4: |
||
368 | levelh5: |
||
369 | levelh6: |
||
370 | levelu1: |
||
371 | levelu2: |
||
372 | lineare: |
||
373 | linearly: "linear:linearly" in words.nl |
||
374 | niet: "on:niet" in words.nl |
||
375 | ofwel: "of:ofwel" |
||
376 | shooting: "shoot:shooting" |
||
377 | vector: |
||
378 | vettore: |
||
379 | xiao: |
||
380 | |||
381 | the wgrp groups "linear algebra" and "linear combination" are missing |
||
382 | because of the directive "linear:linearly" in words.nl which is |
||
383 | executed before wgrp (?? check). |
||
384 | |||
385 | note: ?? words.nl contains both the line algebra:algebraisch and |
||
386 | algebraisch:algebra ?? (and more similar pairs) |
||
387 | |||
388 | E.g. A.de |
||
389 | |||
390 | almost the same as A.en except for the lines "vectors" (suffix.en) and |
||
391 | "vector shoot" (WHY??). There is no "wgrp.de" file. |
||
392 | |||
393 | 2d: |
||
394 | algebra: |
||
395 | bersaglio: |
||
396 | click: |
||
397 | combination: |
||
398 | combinazione: |
||
399 | combinazione lineare: |
||
400 | gang: |
||
401 | levelh4: |
||
402 | levelh5: |
||
403 | levelh6: |
||
404 | levelu1: |
||
405 | levelu2: |
||
406 | linear: |
||
407 | linear algebra: |
||
408 | linear combination: |
||
409 | lineare: |
||
410 | shoot: |
||
411 | vector: |
||
412 | vector shoot: WHY??? |
||
413 | vectors: cfr. A.it |
||
414 | vettore: |
||
415 | xiao: |
||
416 | |||
417 | |||
418 | |||
6793 | bpr | 419 | ==================================== |
420 | |||
421 | In popup.fr, I change also the way to use the keywords for analogous |
||
422 | reason, I do not have done it in popup.$lang for $lang != fr). |
||
423 | |||
424 | The file suffix.fr was also used by wcalc.fr , for compatibility |
||
425 | with popup on the external web pages, I keep it (so copy it |
||
426 | in the wcalc.fr modules). |
||
6795 | bpr | 427 | |
6797 | czzmrn | 428 | Be careful (MC: I know, I hope it is better now with the example): keywords have two significations here : |
6795 | bpr | 429 | - the perl script takes only the words in the variable keywords |
430 | (so only them are in the list of completion) |
||
431 | - modind.c creates files A.$lang etc which are based on words of keywords, |
||
432 | title, description. They are not all of them in the "completion list" |
||
433 | but can be written and found by the search engine. |
||
434 | |||
6804 | reyssat | 435 | |
7690 | bpr | 436 | |
6804 | reyssat | 437 | Technical things about modind.c (ER. just to avoid forgetting work in progress) |
438 | =============================== |
||
439 | |||
7690 | bpr | 440 | The tasks done are in order : |
6804 | reyssat | 441 | |
442 | - prep() : * replaces if possible the default language list (defined at top of file) |
||
443 | by the list of languages installed on the server. |
||
444 | * gets the list of all modules prepared by a previous script |
||
445 | * opens files bases/site2/author|description|language|... |
||
446 | |||
447 | - modules() : for each language{for each module{extract information}}. |
||
448 | |||
449 | - clean() : closes files bases/site2/author|description|language|... |
||
450 | |||
451 | - sprep(),sheets() : idem for sheets. |
||
452 | |||
453 | |||
454 | |||
7690 | bpr | 455 | Extracting information from one module for a given language (function onemodule) : |
6804 | reyssat | 456 | |
457 | - write author,description,language,etc. information in each corresponding file |
||
458 | bases/site2/author|description|language|... |
||
459 | |||
7690 | bpr | 460 | - normalizes data (suppress uppercase, accents, apostrophe, plural) |
461 | according to dictionary domaindic, then maindic with suffix, to get normalized |
||
6879 | bpr | 462 | author, description, title, etc. |
6804 | reyssat | 463 | This is done in the loop for(i=0;i<trcnt;i++){...} |
464 | |||
7690 | bpr | 465 | - transforms the (normalized) title into words (change commas to spaces) |
6804 | reyssat | 466 | and for each word, appends it with weight 4 using function appenditem. |
467 | the variables are the word itself, the current language treated, the serial number of module, |
||
7690 | bpr | 468 | the weight=4, and the module language. |
6804 | reyssat | 469 | |
7690 | bpr | 470 | - put every information other than title (description, keywords, foreign titles, author...) |
6804 | reyssat | 471 | in a buffer, transforms it into words and appends this as above except than weight=2. |
472 | |||
473 | - the 2 preceeding points (treatment of title and other info) are repeated with the difference |
||
7690 | bpr | 474 | that the transformation into words is replaced by a translation : |
6804 | reyssat | 475 | the commas are kept, but some usual words are deleted. |
7690 | bpr | 476 | BUG ? : Another difference is that part of "other information than title" is missing, |
6804 | reyssat | 477 | for instance the foreign titles, require, author. |
478 | |||
7690 | bpr | 479 | ER : I don't know why the process is repeated : should look at appenditem |
6879 | bpr | 480 | to see where it is appended, maybe the second time is somewhere else. |
6804 | reyssat | 481 | |
482 | |||
483 | =============================== |