Subversion Repositories wimsdev

Rev

Rev 6808 | Rev 6879 | Go to most recent revision | Details | Compare with Previous | Last modification | View Log | RSS feed

Rev Author Line No. Line
6793 bpr 1
WIMS' search engine and als
6797 czzmrn 2
===========================
6405 czzmrn 3
 
6797 czzmrn 4
WIMS' search engine works in two stages:
6405 czzmrn 5
 
6802 reyssat 6
1) update of index files when server data is changed (module added...), 
7
   typically once a day.
8
2) use of index files at each user's request to find some activities
9
 
10
 
11
Here are some details : 
12
 
13
1) update of index files       
14
===========================
15
A series of scripts creates a set of auxiliary files (generally
6797 czzmrn 16
stored in ~/public_html/bases/sys/, see description further down) and
17
a list of "keywords" (stored in ~/public_html/bases/site/).
6405 czzmrn 18
 
6797 czzmrn 19
(the scripts must be run in the order given here, as some files
20
created on earlier stages are used in subsequent stages). In general
21
the whole process is run by the script ~/bin/mkindex.
22
 
6802 reyssat 23
* Firstly a series of 3 perl scripts (mkdomain,mkwgrp,modindclass), 
24
that ~/bin/mkindex.sh calls via ~/public_html/bases/sys/mkindex.sh : 
25
 
26
- the programm ~/public_html/bases/sys/mkdomain.pl creates the lists
27
  of domains from the graph in domain/domain with its translations
28
  (domain/domain.$lang) and in json format (english) to be used for
29
  completion in modtool properties
30
 
6797 czzmrn 31
- the perl program ~/public_html/bases/sys/mkwgrp.pl reads the INDEX
32
  files of all the modules on the site and generates 
33
 
34
  - keywords (in format .json) to be used for completion in the search
35
    engine)
36
  - the files in wgrp
37
 
38
  (using the keywords and keywords_lang in the INDEX files, according
39
  to this rule: taking keywords_$lang if it exists, or keywords
40
  (whatever it is a $lang-module or not).
41
 
42
  Some files are created in keywords as keywords/algebra.fr.tmp, but
43
  not used for the moment. The keywords in these "keywords file" are
44
  exactly those in the variable keywords (or keywords_$lang if it
45
  exists), doing it with the following rules: taking keywords_$lang if
46
  it exists, or keywords (whatever it is a $lang-module or not).
47
 
6793 bpr 48
- the program ~/public_html/bases/sys/modindclass.pl creates the lists
6797 czzmrn 49
  of keywords coming from the example classes in
6800 reyssat 50
  ~/public_html/bases/class as well as the files author,
6797 czzmrn 51
  description, language, level, title (no ranking is done).
6793 bpr 52
 
6802 reyssat 53
* Secondly the binary program "modind" (compiled from ~/src/Misc/modind.c) reads 
6797 czzmrn 54
 
6793 bpr 55
  -- the INDEX files of all the modules on the site 
6797 czzmrn 56
  -- the auxiliary files in ~/public_html/bases/sys/ (see description
57
     below)
6405 czzmrn 58
 
6797 czzmrn 59
  and produces keywords lists stored in ~wims/public_html/bases/site :
60
  they contains the words (or words groups) coming from the variable
61
  keywords of the INDEX but also words of the title, description
62
  (deleting small words).
6795 bpr 63
 
6797 czzmrn 64
  "modind" creates as well a serial list of all the modules available
65
  on the site, see ~/public_html/bases/site/serial, and calculates the
66
  ranking of the site's modules. The modules are classified according
67
  to their types: A=all (except sheet and classes), D=document, O=OEF,
68
  X=exercise, T= tool, R=recreation, M= data module.
6405 czzmrn 69
 
6808 czzmrn 70
  To do that, "modind" uses some dictionnaries in
71
  ~/public_html/bases/sys/ (as suffix.$search_lang, wgrp, ...)
6797 czzmrn 72
 
73
  -- separately "modind" reads also the files in
74
  ~/public_html/bases/sys/sheet and do the same type of works
75
 
6802 reyssat 76
 
77
2) use of index files       
78
===========================
79
The script ~/public_html/modules/home/search.proc (called by the
6797 czzmrn 80
"Search" form) reads the lists above, do the actual search in such
81
lists and displays the modules found. It reads also the files of
82
~/public_html/bases/sys/class and ~/public_html/bases/sys/sheets
83
 
6802 reyssat 84
 
85
 
86
More technical details about both stages
87
========================================
88
 
6808 czzmrn 89
In both stages files in this directory ~/public_html/bases/sys/ (see
90
comments below) are used to process the keywords present in the
91
modules' INDEX files.  Each "search language" has its own series of
92
files.
6797 czzmrn 93
 
6808 czzmrn 94
The contents of the files in ~/public_html/bases/sys/ and of the
95
modules' INDEX files should be checked by developers and translators,
96
to improve the behaviour of the search engine.
6797 czzmrn 97
 
6808 czzmrn 98
The files in this directory ~/public_html/bases/sys/ are automatically
99
generated (on install) by the corresponding ".src" file in the "src"
100
subdirectory, if it exists.
6797 czzmrn 101
 
102
If any of the files described below is omitted, then the corresponding
103
feature in the corresponding language is disabled. E.g. the files
104
words.fr/words.fr.src and suffix.fr/suffix.fr.src will be/have been
105
deleted in order to make the search engine correctly working.
6793 bpr 106
 
6876 bpr 107
  In version < 4.05c, if there is no file words.$lang, the file
108
  suffix.$lang was not used (correction in Misc/translator.c to check
109
  in other situations). 
110
  The group words were badly treated when the
111
  words were already in the title, properties, etc. because of
112
  some option unknown_type=unk_delete in modind.c but it has other consequences
113
  so it is not the situation.
114
  I think that I will put again the suffix.fr again (but one must now really
115
  check it : do we want that capital and capitale are the same, which is
116
  the case for the moment).
6798 czzmrn 117
 
6797 czzmrn 118
, will be done by the script in the stable release if we are OK)
119
 
6792 czzmrn 120
Syntax: the lines for most of these files are in the form
6552 bpr 121
 
6792 czzmrn 122
==
123
givenword:substitute
124
==
125
 
126
=============================================================
127
 
128
Files
129
=====
130
 
131
words.$search_lang : correct misprints in the search words
132
(used both by "mkindex" and "search.proc"). 
133
 
134
E.g. if the file words.en contains the line
135
 
136
==
137
analytical:analytic
138
==
139
 
140
then the word "analytical" is considered a misprint and any occurrence
141
of the string "analytical" is replaced in the search by the string
142
"analytic" (for the language "en")
143
 
6797 czzmrn 144
Note: words.fr was deleted because it caused the search engine not to
145
work properly. The site manager can reactivate the functionality by
146
adding the file again (?? how to get the "original" files from the
147
svn?).
148
 
6792 czzmrn 149
Note: the file words.en is used by the module tool/wcalc.en (see
150
~/public_html/modules/tool/wcalc.en/dic )
151
 
152
=====================
153
 
154
suffix.$search_lang : process common suffixes in the search words
155
(used both by "mkindex" and "search.proc"). 
156
 
157
E.g. if the file suffix.en contains the line
158
 
159
==
160
ertem:meter
161
==
162
 
163
then any word ending in "metre" ("ertem" the other way round) is
164
substituted by the corresponding one ending in "meter" (kilometre -->
165
kilometer)
166
 
6797 czzmrn 167
Note: suffix.fr was deleted because it caused the search engine/the
168
keyword completion not to work properly. The site manager can
169
reactivate the functionality by adding the file again.
170
 
6792 czzmrn 171
=====================
172
 
173
wgrp/wgrp.$search_lang : groups of word
6797 czzmrn 174
(these files are automatically generated, and used by "mkindex")
6792 czzmrn 175
 
176
E.g. if the file wgrp/wgrp.en contains the line
177
 
178
==
179
affine geometry:affine geometry,
180
==
181
 
182
then the search matches for the group of words "affine geometry" as a
183
whole: if the the user searches for "affine geometry" the search
184
engine returns only the modules containing as keyword the exact string
185
"affine geometry" (if such line were not present the search engine
186
would return both the modules containing the word "affine" and the
187
modules containing the word "geometry").
188
 
189
The "wgrp" files are now generated from the modules' keywords by the
190
script ~/public_html/bases/sys/mkwgrp.pl : whenever a module contains
191
multiple words keywords, such keywords are added to the wgrp files. 
192
 
193
E.g. tool/algebra/smallgroup.fr/INDEX contains the keyword 
194
 
195
keywords=group, finite group, order, subgroup, conjugacy class, center, normal subgroup, subgroup lattice
196
 
197
so for each of the groups of words between two commas the
198
corresponding groups of words are created
199
 
200
finite group
201
conjugacy class
202
normal subgroup
203
subgroup lattice
204
 
205
(in the corresponding language file)
206
 
207
NOTE: problems when the strings contains the apostrophe "'"
208
(e.g. "algorithme d'euclide")
209
 
210
=====================
211
 
212
indignore.$search_lang : ignored words
213
(used by "mkindex")
214
 
215
All the words listed in the file are ignored by the search engine. 
216
 
217
=====================
218
 
219
abuse.$search_lang : swearwords to be ignored by the search engine
220
(used by ??)
221
 
222
=====================
223
 
224
andor.$search_lang : conjunctions ("and", "or") to be ignored by the 
225
search engine
226
 
6797 czzmrn 227
The file andor.xx is mentioned in src/insmath.c (processing logic
228
statements in math formulas) but this is for the moment used by no
229
modules (to be used, one must have insmath_logic=yes which do not
230
exist in any public module as I know).
6794 bpr 231
 
6797 czzmrn 232
 
6792 czzmrn 233
=====================
234
 
235
keywords.fr : ??
6794 bpr 236
(used by ??) should be deleted
6792 czzmrn 237
 
238
=======================================================
239
 
240
 
241
Some indexing examples
242
======================
243
 
6797 czzmrn 244
U1/algebra/vecshoot.en
6793 bpr 245
 
6797 czzmrn 246
As this is an exercise module it is indexed in the lists A.$lang (All)
247
and X.$lang (eXercise).
6793 bpr 248
 
6797 czzmrn 249
This is a multilanguage module (main language "en", translation
250
language "it"). 
251
 
252
The index file contains the following (nonempty) lines
253
 
254
  title=Vector shoot
255
  description=click on a linear combination of 2D vectors.
256
  language=en
257
  category=exercise
258
  domain=algebra, linear algebra
259
  level=H4,H5,H6,U1,U2
260
  keywords=vector, linear combination
261
  scoring=yes
262
  copyright=&copy; 1998- (<a href="COPYING">GNU GPL</a>) 2013
263
  author=XIAO,Gang
264
  address=xiao@unice.fr
265
  version=2.20
266
  wims_version=4.05a
267
  translation_language=it
268
  title_it=Colpisci i vettori
269
  description_it=individuare una combinazione lineare di vettori 2D.
270
  keywords_it=vettore, combinazione lineare,bersaglio
271
  translator_it=Anna, Lucci
272
  translator_address_it=anna.lucci@gmail.it
273
 
274
In stage 1 the module is given a serial number (depending on the
275
modules actually available on each site, on my site the serial number
276
is "1003"). As the distribution also includes the modules
277
U1/algebra/vecshoot.cn (1002) and U1/algebra/vecshoot.fr (1004) that
278
correspond to translation of this module into "cn" and "fr"
279
respectively, the A.cn/X.cn and A.fr/X.fr contain no reference to this
280
module (1003) but contain only reference to the corresponding
281
translated module (1002 resp 2004). --> HELP there is no A.cn file!!
282
 
283
The files A.en contains the following lines related to this module.
284
 
285
?? (...?2 is the ranking, why do we sometimes have ....?4 )
6804 reyssat 286
(ER : It is a weight -- see name of variable in modind.c -- giving more importance to the title words : 4 if the word appears in the module title, 2 otherwise)
6797 czzmrn 287
 
288
2d:1003?2                           from description and description_it
289
algebra:1003?2			    from domain
290
bersaglio:1003?2		    from keywords_it
291
click:1003?2			    from description
292
combination:1003?2		    from description (_not_ from keywords)
293
combinazione:1003?2		    from description_it
294
combinazione lineare:1003?2	    from keywords + wgrp.en
295
gang:1003?2  			    from author
296
levelh4:1003?2			    from level=h4 (and so on)
297
levelh5:1003?2			    
298
levelh6:1003?2
299
levelu1:1003?2
300
levelu2:1003?2
301
linear:1003?2		            from description
302
linear algebra:1003?2		    from keywords
303
linear combination:1003?2	    from keywords
304
lineare:1003?2			    from description_it
305
shoot:1003?4			    from title
306
vector:1003?4                       from title + description 
307
				    (vectors --> vector because of 
308
				    directive "sr:r" in suffix.en)
309
vettore:1003?2			    from keywords_it
310
xiao:1003?2			    from author
311
 
312
The file A.it contains the following lines related to this module.
313
 
314
(NOTE: only difference is that in A.it there is the keyword "vectors",
315
no difference in keywords, the only difference is in the list of
316
modules, list that I omitted to clarify this example)
317
 
318
2d:1003?2
319
algebra:1003?2
320
bersaglio:1003?2
321
click:1003?2
322
combination:1003?2
323
combinazione:1003?2
324
combinazione lineare:1003?2
325
gang:1003?2
326
levelh4:1003?2
327
levelh5:1003?2
328
levelh6:1003?2
329
levelu1:1003?2
330
levelu2:1003?2
331
linear:1003?2
332
linear algebra:1003?2
333
linear combination:1003?2
334
lineare:1003?2
335
shoot:1003?4
336
vector:1003?4
337
vectors:1003?2			        no corresponding in A.en because 
338
                                        of directive in suffix.en
339
vettore:1003?2
340
xiao:1003?2
341
 
342
NOTE: title_it is missing from the index: you cannot find the module
343
by searching for its Italian title
344
 
345
The file A.$lang for languages different from the above contains lines
346
related to this module.
347
 
348
E.g. A.nl
349
 
350
2d:					
351
algebraisch:			directive "algebra:algebraisch" in words.nl
352
bersaglio:			
353
clicking:			directive "click:clicking" in words.nl
354
combinaison:			"combination:combinaison" in words.nl
355
combinazione:
356
combinazione lineare:
357
gang:
358
levelh4:
359
levelh5:
360
levelh6:
361
levelu1:
362
levelu2:
363
lineare:
364
linearly:			"linear:linearly" in words.nl
365
niet: 				"on:niet" in words.nl
366
ofwel:				"of:ofwel"
367
shooting:			"shoot:shooting"
368
vector:
369
vettore:
370
xiao:
371
 
372
the wgrp groups "linear algebra" and "linear combination" are missing
373
because of the directive "linear:linearly" in words.nl which is
374
executed before wgrp (?? check).
375
 
376
note: ?? words.nl contains both the line algebra:algebraisch and
377
algebraisch:algebra ?? (and more similar pairs)
378
 
379
E.g. A.de
380
 
381
almost the same as A.en except for the lines "vectors" (suffix.en) and
382
"vector shoot" (WHY??). There is no "wgrp.de" file.
383
 
384
2d:
385
algebra:
386
bersaglio:
387
click:
388
combination:
389
combinazione:
390
combinazione lineare:
391
gang:
392
levelh4:
393
levelh5:
394
levelh6:
395
levelu1:
396
levelu2:
397
linear:
398
linear algebra:
399
linear combination:
400
lineare:
401
shoot:
402
vector:
403
vector shoot:			WHY???
404
vectors:			cfr. A.it
405
vettore:
406
xiao:
407
 
408
 
409
 
6793 bpr 410
====================================
411
 
412
In popup.fr, I change also the way to use the keywords for analogous
413
reason, I do not have done it in popup.$lang for $lang != fr).
414
 
415
The file suffix.fr was also used by wcalc.fr , for compatibility
416
with popup on the external web pages, I keep it (so copy it
417
in the wcalc.fr modules).
6795 bpr 418
 
6797 czzmrn 419
Be careful (MC: I know, I hope it is better now with the example): keywords have two significations here :
6795 bpr 420
  - the perl script takes only the words in  the variable keywords
421
  (so only them are in the list of completion)
422
  - modind.c creates files A.$lang etc which are based on words of keywords,
423
  title, description. They are not all of them in the "completion list"
424
  but can be written and found by the search engine.
425
 
426
 
6804 reyssat 427
 
428
Technical things about modind.c (ER. just to avoid forgetting work in progress)
429
===============================
430
 
431
The tasks done are in order : 
432
 
433
- prep() : * replaces if possible the default language list (defined at top of file)
434
             by the list of languages installed on the server.
435
           * gets the list of all modules prepared by a previous script
436
           * opens files bases/site2/author|description|language|...
437
 
438
- modules() : for each language{for each module{extract information}}.
439
 
440
- clean() : closes files bases/site2/author|description|language|...
441
 
442
- sprep(),sheets() : idem for sheets.
443
 
444
 
445
 
446
Extracting information from one module for a given language (function onemodule) : 
447
 
448
- write author,description,language,etc. information in each corresponding file
449
  bases/site2/author|description|language|...
450
 
451
- normalizes data (suppress uppercase, accents, apostrophe, plural) 
452
  according to dictionary, to get normalized author,description, title, etc.
453
  This is done in the loop for(i=0;i<trcnt;i++){...}
454
 
455
- transforms the (normalized) title into words (change commas to spaces) 
456
  and for each word, appends it with weight 4 using function appenditem.
457
  the variables are the word itself, the current language treated, the serial number of module,
458
  the weight=4, and the module language. 
459
 
460
- put every information other than title (description, keywords, foreign titles, author...) 
461
  in a buffer, transforms it into words and appends this as above except than weight=2.
462
 
463
  BUG ? : in this process, i_keywords_fr is used twice, probably the first one should be i_keywords_en, to be checked.
464
 
465
- the 2 preceeding points (treatment of title and other info) are repeated with the difference
466
  that the transformation into words is replaced by a translation : 
467
  the commas are kept, but some usual words are deleted.
468
  BUG ? : Another difference is that part of "other information than title" is missing, 
469
          for instance the foreign titles, require, author.
470
 
471
ER : I don't know why the process is repeated : should look at appenditem to see where it is appended, maybe the second time is somewhere else.
472
 
473
 
474
===============================