diff --git a/CMakeLists.txt b/CMakeLists.txt
index cd506c2230..1736be32a8 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -336,6 +336,7 @@ target_link_libraries(${ABACUS_BIN_NAME}
     driver
     xc_
     hsolver
+    genelpa
     elecstate
     hamilt
     psi
diff --git a/README.md b/README.md
index d26135a5be..17de80ae4c 100644
--- a/README.md
+++ b/README.md
@@ -56,6 +56,7 @@ ABACUS provides the following features and functionalities:
 20. (subsidiary tool)Generator for second generation numerical orbital basis.
 21. Interface with DPGEN
 22. Interface with phonopy
+23. Implicit solvation model
 
 [back to top](#readme-top)
 
@@ -161,6 +162,7 @@ The following provides basic sample jobs in ABACUS. More can be found in the dir
 - [BSSE for molecular formation energy](docs/examples/BSSE.md)
 - [ABACUS-DPGEN interface](docs/examples/dpgen.md)
 - [ABACUS-phonopy interface](docs/examples/phonopy.md)
+- [Implicit solvation model](docs/examples/implicit-sol.md)
 
 [back to top](#readme-top)
 
diff --git a/docs/examples/implicit-sol.md b/docs/examples/implicit-sol.md
new file mode 100644
index 0000000000..1f677e2796
--- /dev/null
+++ b/docs/examples/implicit-sol.md
@@ -0,0 +1,63 @@
+# Implicit solvation model
+
+[back to main page](../../README.md)
+
+Solid-liquid interfaces are ubiquitous in nature and frequently encountered and employed in materials simulation. The solvation effect should be taken into account in accurate first-principles calculations of such systems.  
+Implicit solvation model is a well-developed method to deal with solvation effects, which has been widely used in finite and periodic systems. This approach treats the solvent as a continuous medium instead of individual “explicit” solvent molecules, which means that the solute embedded in an implicit solvent and the average over the solvent degrees of freedom becomes implicit in the properties of the solvent bath.
+
+## Input
+```
+INPUT_PARAMETERS
+imp_sol                 1
+eb_k                    80
+tau                     0.000010798
+sigma_k                 0.6
+nc_k                    0.00037
+```
+- imp_sol  
+
+    If set to 1, an implicit solvation correction is considered. 0：vacuum calculation(default).
+- eb_k  
+    
+    The relative permittivity of the bulk solvent, 80 for water. Used only if `imp_sol` == true.
+- tau 
+
+    The effective surface tension parameter, which describes the cavitation, the dispersion, and the repulsion interaction between the solute and the solvent that are not captured by the electrostatic terms.
+    We use the values of `tau`, `sigma_k`, `nc_k` that were obtained by a fit of the model to experimental solvation energies for molecules in water. tau = 0.525 $meV/Å^{2}$ = 1.0798e-05 $Ry/Bohr^{2}$.
+- sigma_k 
+    
+    We assume a diffuse cavity that is implicitly determined by the electronic structure of the solute. 
+    `sigma_k` is the parameter that describes the width of the diffuse cavity. The specific value is sigma_k = 0.6.
+- nc_k
+    
+    `nc_k` determines at what value of the electron density the dielectric cavity forms. 
+    The specific value is nc_k = 0.0025 $Å^{-3}$ = 0.00037 $Bohr^{-3}$.
+
+## Output
+In this example, we calculate the implicit solvation correction for H2O.
+The results of the energy calculation are written in the “running_nscf.log” in the OUT folder.
+```
+       Energy                       Rydberg                            eV
+   E_KohnSham                -34.3200995971                -466.948910448
+     E_Harris                -34.2973698556                -466.639656449
+       E_band                -7.66026117767                -104.223200184
+   E_one_elec                -56.9853883251                -775.325983964
+    E_Hartree                +30.0541108968                +408.907156521
+         E_xc                -8.32727420734                -113.298378028
+      E_Ewald               +0.961180728747                +13.0775347188
+      E_demet                            +0                            +0
+      E_descf                            +0                            +0
+     E_efield                            +0                            +0
+        E_exx                            +0                            +0
+     E_sol_el              -0.0250553663339               -0.340895747619
+    E_sol_cav             +0.00232667606131               +0.031656051834
+      E_Fermi               -0.499934383866                 -6.8019562467
+
+```
+- E_sol_el: Electrostatic contribution to the solvation energy.
+- E_sol_cav: Cavitation and dispersion contributions to the solvation energy.
+Both `E_sol_el` and `E_sol_cav` corrections are included in `E_KohnSham`. 
+
+
+
+[back to top](#implicit-solvation-model)
\ No newline at end of file
diff --git a/docs/examples/phonopy.md b/docs/examples/phonopy.md
index ac8efdba7d..bf00b37608 100644
--- a/docs/examples/phonopy.md
+++ b/docs/examples/phonopy.md
@@ -3,7 +3,7 @@
 [back to main page](../../README.md)
 
 
-[Phonopy](https://github.com/phonopy/phonopy) is a powerful package to calculate phonon and related properties. It has provided interface with ABACUS. In the following, we take the FCC aluminum as an example:
+[Phonopy](https://github.com/phonopy/phonopy) (Note: please use the `develop` branch, rather than the `master` branch until the abacus interface has been merged into phonopy's `master` branch.) is a powerful package to calculate phonon and related properties. It has provided interface with ABACUS. In the following, we take the FCC aluminum as an example:
 
 
 1. Prepare a 'setting.conf' with following tags:
@@ -39,4 +39,4 @@ PRIMITIVE_AXES = 0 1/2 1/2  1/2 0 1/2  1/2 1/2 0
 BAND= 1 1 1  1/2 1/2 1  3/8 3/8 3/4  0 0 0   1/2 1/2 1/2
 BAND_POINTS = 21
 BAND_CONNECTION = .TRUE.
-```
\ No newline at end of file
+```
diff --git a/docs/features.md b/docs/features.md
index c108e78ed7..b74fdb5df7 100644
--- a/docs/features.md
+++ b/docs/features.md
@@ -50,9 +50,14 @@ ATOMIC_SPECIES
 Si 28.00 Si_ONCV_PBE-1.0.upf
 ```
 
-The user can download the pseudopotential files from our [website](http://abacus.ustc.edu.cn/pseudo.html).
+You can download the pseudopotential files from our [website](http://abacus.ustc.edu.cn/pseudo/list.htm).
 
-For more information of different types of pseudopotentials, please visit the Quantum espresso [website](http://www.quantum-espresso.org/pseudopotentials/).
+There are pseudopotential files in these websites which are also supported by ABACUS:
+1. [Quantum ESPRESSO](http://www.quantum-espresso.org/pseudopotentials/).
+2. [SG15-ONCV](http://quantum-simulation.org/potentials/sg15_oncv/upf/).
+3. [DOJO](http://www.pseudo-dojo.org/).
+
+If LCAO base is used, the numerical orbital files should match the pseudopotential files. The [official orbitals package](http://abacus.ustc.edu.cn/pseudo/list.htm) only matches SG15-ONCV pseudopotentials.
 
 [back to top](#features)
 
diff --git a/docs/input-main.md b/docs/input-main.md
index 7545f7f6ba..3f93eff3ad 100644
--- a/docs/input-main.md
+++ b/docs/input-main.md
@@ -82,6 +82,10 @@
 
     [cal_cond](#cal_cond) | [cond_nche](#cond_nche) | [cond_dw](#cond_dw) | [cond_wcut](#cond_wcut) | [cond_wenlarge](#cond_wenlarge) | [cond_fwhm ](#cond_fwhm )
 
+- [Implicit solvation model](#implicit-solvation-model)
+
+    [imp_sol](#imp_sol) | [eb_k](#eb_k) | [tau](#tau) | [sigma_k](#sigma_k) | [nc_k](#nc_k) 
+
 [back to main page](../README.md)
 
 ## Structure of the file
@@ -1662,3 +1666,39 @@ Thermal conductivities: $\kappa = \lim_{\omega\to 0}\kappa(\omega)$
 - **Type**: Integer
 - **Description**: We use gaussian functions to approxiamte $\delta(E)\approx \frac{1}{\sqrt{2\pi}\Delta E}e^{-\frac{E^2}{2{\Delta E}^2}}$. FWHM for conductivities, $FWHM=2*\sqrt{2\ln2}\cdot \Delta E$. The unit is eV.
 - **Default**: 0.3
+
+### Implicit solvation model
+
+This part of variables are used to control the usage of implicit solvation model. This approach treats the solvent as a continuous medium instead of individual “explicit” solvent molecules, which means that the solute embedded in an implicit solvent and the average over the solvent degrees of freedom becomes implicit in the properties of the solvent bath.
+
+#### imp_sol
+
+- **Type**: Boolean
+- **Description**: If set to 1, an implicit solvation correction is considered.
+- **Default**: 0
+
+#### eb_k
+
+- **Type**: Real
+- **Description**: The relative permittivity of the bulk solvent, 80 for water. Used only if `imp_sol` == true.
+- **Default**: 80
+
+#### tau
+
+- **Type**: Real
+- **Description**: The effective surface tension parameter, which describes the cavitation, the dispersion, and the repulsion interaction between the solute and the solvent that are not captured by the electrostatic terms. The unit is $Ry/Bohr^{2}$.
+- **Default**: 1.0798e-05
+
+#### sigma_k
+
+- **Type**: Real
+- **Description**: We assume a diffuse cavity that is implicitly determined by the electronic structure of the solute.
+`sigma_k` is the parameter that describes the width of the diffuse cavity.
+- **Default**: 0.6
+
+#### nc_k
+
+- **Type**: Real
+- **Description**: It determines at what value of the electron density the dielectric cavity forms. 
+The unit is $Bohr^{-3}$.
+- **Default**: 0.00037
\ No newline at end of file
diff --git a/modules/FindELPA.cmake b/modules/FindELPA.cmake
index 03c4cda549..82cfb56d86 100644
--- a/modules/FindELPA.cmake
+++ b/modules/FindELPA.cmake
@@ -38,17 +38,4 @@ if(ELPA_FOUND)
 endif()
 
 set(CMAKE_REQUIRED_INCLUDES ${CMAKE_REQUIRED_INCLUDES} ${ELPA_INCLUDE_DIR})
-include(CheckCXXSourceCompiles)
-check_cxx_source_compiles("
-#include <elpa/elpa_version.h>
-#if ELPA_API_VERSION < 20210430
-#error ELPA version is too old.
-#endif
-int main(){}
-"
-ELPA_VERSION_SATISFIES
-)
-if(NOT ELPA_VERSION_SATISFIES)
-    message(FATAL_ERROR "ELPA version is too old. We support version 2017 or higher.")
-endif()
 mark_as_advanced(ELPA_INCLUDE_DIR ELPA_LIBRARY)
diff --git a/source/Makefile b/source/Makefile
index 5043526890..16c493ff75 100644
--- a/source/Makefile
+++ b/source/Makefile
@@ -20,6 +20,7 @@ VPATH=./src_global\
 :./module_xc\
 :./module_esolver\
 :./module_hsolver\
+:./module_hsolver/genelpa\
 :./module_elecstate\
 :./module_psi\
 :./module_hamilt\
diff --git a/source/Makefile.Objects b/source/Makefile.Objects
index 36799e6333..1c0eecb0ba 100644
--- a/source/Makefile.Objects
+++ b/source/Makefile.Objects
@@ -277,6 +277,11 @@ hsolver_lcao.o\
 hsolver_pw.o\
 hsolver_pw_sdft.o
 
+OBJ_GENELPA=elpa_new_complex.o\
+elpa_new_real.o\
+elpa_new.o\
+utils.o
+
 OBJ_ELECSTATES=elecstate.o\
 dm2d_to_grid.o\
 elecstate_lcao.o\
@@ -304,6 +309,7 @@ $(OBJ_HSOLVER)\
 $(OBJ_ELECSTATES)\
 $(OBJ_PSI)\
 ${OBJ_OPERATOR}\
+${OBJ_GENELPA}\
 charge.o \
 charge_mixing.o \
 charge_pulay.o \
diff --git a/source/input_conv.cpp b/source/input_conv.cpp
index 1ff0f0a621..76a0ddcd53 100644
--- a/source/input_conv.cpp
+++ b/source/input_conv.cpp
@@ -373,6 +373,7 @@ void Input_Conv::Convert(void)
 
     if (GlobalC::exx_global.info.hybrid_type != Exx_Global::Hybrid_Type::No)
     {
+        //EXX case, convert all EXX related variables 
         GlobalC::exx_global.info.hybrid_alpha = INPUT.exx_hybrid_alpha;
         XC_Functional::get_hybrid_alpha(INPUT.exx_hybrid_alpha);
         GlobalC::exx_global.info.hse_omega = INPUT.exx_hse_omega;
@@ -406,6 +407,9 @@ void Input_Conv::Convert(void)
         Exx_Abfs::Jle::Lmax = INPUT.exx_opt_orb_lmax;
         Exx_Abfs::Jle::Ecut_exx = INPUT.exx_opt_orb_ecut;
         Exx_Abfs::Jle::tolerence = INPUT.exx_opt_orb_tolerence;
+
+        //EXX does not support any symmetry analyse, force symmetry setting to -1
+        ModuleSymmetry::Symmetry::symm_flag = -1;
     }
 #endif
 #endif
diff --git a/source/module_deepks/test/CMakeLists.txt b/source/module_deepks/test/CMakeLists.txt
index 0702aa30bc..963b5dae2c 100644
--- a/source/module_deepks/test/CMakeLists.txt
+++ b/source/module_deepks/test/CMakeLists.txt
@@ -8,7 +8,7 @@ target_link_libraries(
     test_deepks
     base cell symmetry md surchem xc_
     neighbor orb io relax gint lcao parallel mrrr pdiag pw ri driver esolver hsolver psi elecstate hamilt planewave
-    pthread
+    pthread genelpa
     deepks
     ${ABACUS_LINK_LIBRARIES}
 )
diff --git a/source/module_elecstate/elecstate_lcao.cpp b/source/module_elecstate/elecstate_lcao.cpp
index d97409ef58..12e1ceab86 100644
--- a/source/module_elecstate/elecstate_lcao.cpp
+++ b/source/module_elecstate/elecstate_lcao.cpp
@@ -137,7 +137,9 @@ void ElecStateLCAO::print_psi(const psi::Psi<double>& psi_in)
 
     // output but not do "2d-to-grid" conversion
     double** wfc_grid = nullptr;
+#ifdef __MPI
     this->lowf->wfc_2d_to_grid(ElecStateLCAO::out_wfc_lcao, psi_in.get_pointer(), wfc_grid, this->ekb, this->wg);
+#endif
     return;
 }
 void ElecStateLCAO::print_psi(const psi::Psi<std::complex<double>>& psi_in)
@@ -159,7 +161,7 @@ void ElecStateLCAO::print_psi(const psi::Psi<std::complex<double>>& psi_in)
     {
         for (int iw = 0; iw < GlobalV::NLOCAL; iw++)
         {
-            this->lowf->wfc_k_grid[ik][ib][iw] = psi(ib,iw);
+            this->lowf->wfc_k_grid[ik][ib][iw] = psi_in(ib, iw);
         }
     }
 #endif
diff --git a/source/module_esolver/esolver_ks_pw_tool.cpp b/source/module_esolver/esolver_ks_pw_tool.cpp
index 5d8790072c..08600bc573 100644
--- a/source/module_esolver/esolver_ks_pw_tool.cpp
+++ b/source/module_esolver/esolver_ks_pw_tool.cpp
@@ -1,6 +1,6 @@
 #include "esolver_ks_pw.h"
-#include "module_base/global_variable.h"
 #include "module_base/global_function.h"
+#include "module_base/global_variable.h"
 #include "src_pw/global.h"
 #include "src_pw/occupy.h"
 
@@ -20,24 +20,23 @@ namespace ModuleESolver
 // k = 1.380649e-23
 // e/k = 11604.518026 , 1 eV = 11604.5 K
 //------------------------------------------------------------------
-#define TWOSQRT2LN2 2.354820045030949 //FWHM = 2sqrt(2ln2) * \sigma
-#define FACTOR 1.839939223835727e7
-void ESolver_KS_PW::KG(const int nche_KG, const double fwhmin, const double wcut, 
-             const double dw_in, const int times)
+#define TWOSQRT2LN2 2.354820045030949 // FWHM = 2sqrt(2ln2) * \sigma
+#define FACTOR      1.839939223835727e7
+void ESolver_KS_PW::KG(const int nche_KG, const double fwhmin, const double wcut, const double dw_in, const int times)
 {
     //-----------------------------------------------------------
     //               KS conductivity
     //-----------------------------------------------------------
-    cout<<"Calculating conductivity..."<<endl;
+    cout << "Calculating conductivity..." << endl;
     char transn = 'N';
     char transc = 'C';
-    int nw = ceil(wcut/dw_in);
-    double dw =  dw_in / ModuleBase::Ry_to_eV; //converge unit in eV to Ry 
+    int nw = ceil(wcut / dw_in);
+    double dw = dw_in / ModuleBase::Ry_to_eV; // converge unit in eV to Ry
     double sigma = fwhmin / TWOSQRT2LN2 / ModuleBase::Ry_to_eV;
-    double dt = ModuleBase::PI/(dw*nw)/times ; //unit in a.u., 1 a.u. = 4.837771834548454e-17 s
-    int nt = ceil(sqrt(20)/sigma/dt);
-    cout<<"nw: "<<nw<<" ; dw: "<<dw*ModuleBase::Ry_to_eV<<" eV"<<endl;
-    cout<<"nt: "<<nt<<" ; dt: "<<dt<<" a.u.(ry^-1)"<<endl;
+    double dt = ModuleBase::PI / (dw * nw) / times; // unit in a.u., 1 a.u. = 4.837771834548454e-17 s
+    int nt = ceil(sqrt(20) / sigma / dt);
+    cout << "nw: " << nw << " ; dw: " << dw * ModuleBase::Ry_to_eV << " eV" << endl;
+    cout << "nt: " << nt << " ; dt: " << dt << " a.u.(ry^-1)" << endl;
     assert(nw >= 1);
     assert(nt >= 1);
     const int nk = GlobalC::kv.nks;
@@ -46,150 +45,180 @@ void ESolver_KS_PW::KG(const int nche_KG, const double fwhmin, const double wcut
     const double tpiba = GlobalC::ucell.tpiba;
     const int nbands = GlobalV::NBANDS;
     const double ef = GlobalC::en.ef;
-    
 
-    double * ct11 = new double[nt];
-    double * ct12 = new double[nt];
-    double * ct22 = new double[nt];
-    ModuleBase::GlobalFunc::ZEROS(ct11,nt);
-    ModuleBase::GlobalFunc::ZEROS(ct12,nt);
-    ModuleBase::GlobalFunc::ZEROS(ct22,nt);
+    double *ct11 = new double[nt];
+    double *ct12 = new double[nt];
+    double *ct22 = new double[nt];
+    ModuleBase::GlobalFunc::ZEROS(ct11, nt);
+    ModuleBase::GlobalFunc::ZEROS(ct12, nt);
+    ModuleBase::GlobalFunc::ZEROS(ct22, nt);
 
-    for (int ik = 0;ik < nk;++ik)
-	{
-      for(int id = 0 ; id < ndim ; ++id)
-      {
-        this->phami->updateHk(ik);
-        const int npw = GlobalC::kv.ngk[ik];
-    
-        complex<double> * pij = new complex<double> [nbands * nbands];
-        complex<double> * prevc= new complex<double> [npw * nbands];
-        complex<double> * levc = &(this->psi[0](ik,0,0));
-        double *ga = new double[npw];
-        for (int ig = 0;ig < npw;ig++)
-        {
-            ModuleBase::Vector3<double> v3 = GlobalC::wfcpw->getgpluskcar(ik,ig);
-            ga[ig] = v3[id] * tpiba;
-        }
-        //px|right>
-        for (int ib = 0; ib < nbands ; ++ib)
-	    {
-	    	for (int ig = 0; ig < npw; ++ig)
-	    	{
-	    		prevc[ib*npw+ig] = ga[ig] * levc[ib*npwx+ig];
-	    	}
-            
-	    }
-        zgemm_(&transc,&transn,&nbands,&nbands,&npw,&ModuleBase::ONE,levc,&npwx,prevc,&npw,&ModuleBase::ZERO,pij,&nbands);
-        MPI_Allreduce(MPI_IN_PLACE, pij ,2 * nbands * nbands, MPI_DOUBLE, MPI_SUM, POOL_WORLD);
-        int ntper = nt/GlobalV::NPROC_IN_POOL;
-        int itstart = ntper * GlobalV::RANK_IN_POOL;
-        if(nt%GlobalV::NPROC_IN_POOL > GlobalV::RANK_IN_POOL)
-        {
-            ntper++;
-            itstart += GlobalV::RANK_IN_POOL;
-        }
-        else
+    for (int ik = 0; ik < nk; ++ik)
+    {
+        for (int id = 0; id < ndim; ++id)
         {
-            itstart += nt%GlobalV::NPROC_IN_POOL;
-        }
-        
-          
-        for(int it = itstart ; it < itstart+ntper ; ++it)
-        // for(int it = 0 ; it < nt; ++it)
-        { 
-            double tmct11 = 0;
-            double tmct12 = 0;
-            double tmct22 = 0;
-            double *enb=&(this->pelec->ekb(ik,0));
-            for(int ib = 0 ; ib < nbands ; ++ib)
+            this->phami->updateHk(ik);
+            const int npw = GlobalC::kv.ngk[ik];
+
+            complex<double> *pij = new complex<double>[nbands * nbands];
+            complex<double> *prevc = new complex<double>[npw * nbands];
+            complex<double> *levc = &(this->psi[0](ik, 0, 0));
+            double *ga = new double[npw];
+            for (int ig = 0; ig < npw; ig++)
+            {
+                ModuleBase::Vector3<double> v3 = GlobalC::wfcpw->getgpluskcar(ik, ig);
+                ga[ig] = v3[id] * tpiba;
+            }
+            // px|right>
+            for (int ib = 0; ib < nbands; ++ib)
             {
-                double ei = enb[ib];
-                double fi = GlobalC::wf.wg(ik,ib);
-                for(int jb = ib + 1 ; jb < nbands ; ++jb)
+                for (int ig = 0; ig < npw; ++ig)
                 {
-                    double ej = enb[jb];
-                    double fj = GlobalC::wf.wg(ik,jb);
-                    double tmct =  sin((ej-ei)*(it)*dt)*(fi-fj)*norm(pij[ib*nbands+jb]);
-                    tmct11 += tmct;
-                    tmct12 += - tmct * ((ei+ej)/2 - ef);
-                    tmct22 += tmct * pow((ei+ej)/2 - ef,2);
+                    prevc[ib * npw + ig] = ga[ig] * levc[ib * npwx + ig];
                 }
             }
-            ct11[it] += tmct11/2.0;
-            ct12[it] += tmct12/2.0;
-            ct22[it] += tmct22/2.0;
+            zgemm_(&transc,
+                   &transn,
+                   &nbands,
+                   &nbands,
+                   &npw,
+                   &ModuleBase::ONE,
+                   levc,
+                   &npwx,
+                   prevc,
+                   &npw,
+                   &ModuleBase::ZERO,
+                   pij,
+                   &nbands);
+#ifdef __MPI
+            MPI_Allreduce(MPI_IN_PLACE, pij, 2 * nbands * nbands, MPI_DOUBLE, MPI_SUM, POOL_WORLD);
+#endif
+            int ntper = nt / GlobalV::NPROC_IN_POOL;
+            int itstart = ntper * GlobalV::RANK_IN_POOL;
+            if (nt % GlobalV::NPROC_IN_POOL > GlobalV::RANK_IN_POOL)
+            {
+                ntper++;
+                itstart += GlobalV::RANK_IN_POOL;
+            }
+            else
+            {
+                itstart += nt % GlobalV::NPROC_IN_POOL;
+            }
+
+            for (int it = itstart; it < itstart + ntper; ++it)
+            // for(int it = 0 ; it < nt; ++it)
+            {
+                double tmct11 = 0;
+                double tmct12 = 0;
+                double tmct22 = 0;
+                double *enb = &(this->pelec->ekb(ik, 0));
+                for (int ib = 0; ib < nbands; ++ib)
+                {
+                    double ei = enb[ib];
+                    double fi = GlobalC::wf.wg(ik, ib);
+                    for (int jb = ib + 1; jb < nbands; ++jb)
+                    {
+                        double ej = enb[jb];
+                        double fj = GlobalC::wf.wg(ik, jb);
+                        double tmct = sin((ej - ei) * (it)*dt) * (fi - fj) * norm(pij[ib * nbands + jb]);
+                        tmct11 += tmct;
+                        tmct12 += -tmct * ((ei + ej) / 2 - ef);
+                        tmct22 += tmct * pow((ei + ej) / 2 - ef, 2);
+                    }
+                }
+                ct11[it] += tmct11 / 2.0;
+                ct12[it] += tmct12 / 2.0;
+                ct22[it] += tmct22 / 2.0;
+            }
+            delete[] pij;
+            delete[] prevc;
+            delete[] ga;
         }
-        delete [] pij;
-        delete [] prevc;
-        delete [] ga;
-      }
     }
-    MPI_Allreduce(MPI_IN_PLACE,ct11,nt,MPI_DOUBLE,MPI_SUM,MPI_COMM_WORLD);
-    MPI_Allreduce(MPI_IN_PLACE,ct12,nt,MPI_DOUBLE,MPI_SUM,MPI_COMM_WORLD);
-    MPI_Allreduce(MPI_IN_PLACE,ct22,nt,MPI_DOUBLE,MPI_SUM,MPI_COMM_WORLD);
-    
+#ifdef __MPI
+    MPI_Allreduce(MPI_IN_PLACE, ct11, nt, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
+    MPI_Allreduce(MPI_IN_PLACE, ct12, nt, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
+    MPI_Allreduce(MPI_IN_PLACE, ct22, nt, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
+#endif
+
     //------------------------------------------------------------------
     //                    Output
     //------------------------------------------------------------------
-    if(GlobalV::MY_RANK == 0)
+    if (GlobalV::MY_RANK == 0)
     {
-        calcondw(nt,dt,fwhmin,wcut,dw_in,ct11,ct12,ct22);
+        calcondw(nt, dt, fwhmin, wcut, dw_in, ct11, ct12, ct22);
     }
     delete[] ct11;
     delete[] ct12;
     delete[] ct22;
 }
 
-void ESolver_KS_PW::calcondw(const int nt,const double dt,const double fwhmin,const double wcut,const double dw_in,double*ct11,double*ct12,double *ct22)
+void ESolver_KS_PW::calcondw(const int nt,
+                             const double dt,
+                             const double fwhmin,
+                             const double wcut,
+                             const double dw_in,
+                             double *ct11,
+                             double *ct12,
+                             double *ct22)
 {
     double factor = FACTOR;
     const int ndim = 3;
-    int nw = ceil(wcut/dw_in);
-    double dw =  dw_in / ModuleBase::Ry_to_eV; //converge unit in eV to Ry 
+    int nw = ceil(wcut / dw_in);
+    double dw = dw_in / ModuleBase::Ry_to_eV; // converge unit in eV to Ry
     double sigma = fwhmin / TWOSQRT2LN2 / ModuleBase::Ry_to_eV;
     ofstream ofscond("je-je.txt");
-    ofscond<<setw(8)<<"#t(a.u.)"<<setw(15)<<"c11(t)"<<setw(15)<<"c12(t)"<<setw(15)<<"c22(t)"<<setw(15)<<"decay"<<endl;
-	for(int it = 0; it < nt; ++it)
-	{
-		ofscond <<setw(8)<<(it)*dt<<setw(15)<<-2*ct11[it]<<setw(15)<<-2*ct12[it]<<setw(15)<<-2*ct22[it]<<setw(15)<<exp(-double(1)/2*sigma*sigma*pow((it)*dt,2))<<endl;
-	}
+    ofscond << setw(8) << "#t(a.u.)" << setw(15) << "c11(t)" << setw(15) << "c12(t)" << setw(15) << "c22(t)" << setw(15)
+            << "decay" << endl;
+    for (int it = 0; it < nt; ++it)
+    {
+        ofscond << setw(8) << (it)*dt << setw(15) << -2 * ct11[it] << setw(15) << -2 * ct12[it] << setw(15)
+                << -2 * ct22[it] << setw(15) << exp(-double(1) / 2 * sigma * sigma * pow((it)*dt, 2)) << endl;
+    }
     ofscond.close();
-    double * cw11 = new double [nw];
-    double * cw12 = new double [nw];
-    double * cw22 = new double [nw];
-    double * kappa = new double [nw];
-    ModuleBase::GlobalFunc::ZEROS(cw11,nw);
-    ModuleBase::GlobalFunc::ZEROS(cw12,nw);
-    ModuleBase::GlobalFunc::ZEROS(cw22,nw);
-    for(int iw = 0 ; iw < nw ; ++iw )
+    double *cw11 = new double[nw];
+    double *cw12 = new double[nw];
+    double *cw22 = new double[nw];
+    double *kappa = new double[nw];
+    ModuleBase::GlobalFunc::ZEROS(cw11, nw);
+    ModuleBase::GlobalFunc::ZEROS(cw12, nw);
+    ModuleBase::GlobalFunc::ZEROS(cw22, nw);
+    for (int iw = 0; iw < nw; ++iw)
     {
-        for(int it = 0 ; it < nt ; ++it)
+        for (int it = 0; it < nt; ++it)
         {
-            cw11[iw] += -2 * ct11[it] * sin( -(iw+0.5) * dw * it *dt) * exp(-double(1)/2*sigma*sigma*pow((it)*dt,2) ) / (iw+0.5) /dw * dt ;
-            cw12[iw] += -2 * ct12[it] * sin( -(iw+0.5) * dw * it *dt) * exp(-double(1)/2*sigma*sigma*pow((it)*dt,2) ) / (iw+0.5) /dw * dt ;
-            cw22[iw] += -2 * ct22[it] * sin( -(iw+0.5) * dw * it *dt) * exp(-double(1)/2*sigma*sigma*pow((it)*dt,2) ) / (iw+0.5) /dw * dt ;
+            cw11[iw] += -2 * ct11[it] * sin(-(iw + 0.5) * dw * it * dt)
+                        * exp(-double(1) / 2 * sigma * sigma * pow((it)*dt, 2)) / (iw + 0.5) / dw * dt;
+            cw12[iw] += -2 * ct12[it] * sin(-(iw + 0.5) * dw * it * dt)
+                        * exp(-double(1) / 2 * sigma * sigma * pow((it)*dt, 2)) / (iw + 0.5) / dw * dt;
+            cw22[iw] += -2 * ct22[it] * sin(-(iw + 0.5) * dw * it * dt)
+                        * exp(-double(1) / 2 * sigma * sigma * pow((it)*dt, 2)) / (iw + 0.5) / dw * dt;
         }
     }
     ofscond.open("Onsager.txt");
-    ofscond<<setw(8)<<"## w(eV) "<<setw(20)<<"sigma(Sm^-1)"<<setw(20)<<"kappa(W(mK)^-1)"<<setw(20)<<"L12/e(Am^-1)"<<setw(20)<<"L22/e^2(Wm^-1)"<<endl;
-    for(int iw = 0; iw < nw; ++iw)
-	{
-        cw11[iw] *= double(2)/ndim/GlobalC::ucell.omega* factor; //unit in Sm^-1
-        cw12[iw] *= double(2)/ndim/GlobalC::ucell.omega* factor * 2.17987092759e-18/1.6021766208e-19; //unit in Am^-1
-        cw22[iw] *= double(2)/ndim/GlobalC::ucell.omega* factor * pow(2.17987092759e-18/1.6021766208e-19,2); //unit in Wm^-1
-        kappa[iw] = (cw22[iw]-pow(cw12[iw],2)/cw11[iw])/Occupy::gaussian_parameter/ModuleBase::Ry_to_eV/11604.518026;
-	    ofscond <<setw(8)<<(iw+0.5)*dw * ModuleBase::Ry_to_eV <<setw(20)<<cw11[iw] <<setw(20)<<kappa[iw]<<setw(20)<<cw12[iw] <<setw(20)<<cw22[iw]<<endl;
-	}
-    cout<<setprecision(6)<<"DC electrical conductivity: "<<cw11[0] - (cw11[1] - cw11[0]) * 0.5<<" Sm^-1"<<endl;
-    cout<<setprecision(6)<<"Thermal conductivity: "<<kappa[0] - (kappa[1] - kappa[0]) * 0.5<<" Wm^-1"<<endl;;
+    ofscond << setw(8) << "## w(eV) " << setw(20) << "sigma(Sm^-1)" << setw(20) << "kappa(W(mK)^-1)" << setw(20)
+            << "L12/e(Am^-1)" << setw(20) << "L22/e^2(Wm^-1)" << endl;
+    for (int iw = 0; iw < nw; ++iw)
+    {
+        cw11[iw] *= double(2) / ndim / GlobalC::ucell.omega * factor; // unit in Sm^-1
+        cw12[iw]
+            *= double(2) / ndim / GlobalC::ucell.omega * factor * 2.17987092759e-18 / 1.6021766208e-19; // unit in Am^-1
+        cw22[iw] *= double(2) / ndim / GlobalC::ucell.omega * factor
+                    * pow(2.17987092759e-18 / 1.6021766208e-19, 2); // unit in Wm^-1
+        kappa[iw] = (cw22[iw] - pow(cw12[iw], 2) / cw11[iw]) / Occupy::gaussian_parameter / ModuleBase::Ry_to_eV
+                    / 11604.518026;
+        ofscond << setw(8) << (iw + 0.5) * dw * ModuleBase::Ry_to_eV << setw(20) << cw11[iw] << setw(20) << kappa[iw]
+                << setw(20) << cw12[iw] << setw(20) << cw22[iw] << endl;
+    }
+    cout << setprecision(6) << "DC electrical conductivity: " << cw11[0] - (cw11[1] - cw11[0]) * 0.5 << " Sm^-1"
+         << endl;
+    cout << setprecision(6) << "Thermal conductivity: " << kappa[0] - (kappa[1] - kappa[0]) * 0.5 << " Wm^-1" << endl;
+    ;
     ofscond.close();
-    
-    
+
     delete[] cw11;
     delete[] cw12;
     delete[] cw22;
     delete[] kappa;
-
 }
-}
\ No newline at end of file
+} // namespace ModuleESolver
\ No newline at end of file
diff --git a/source/module_hamilt/hamilt_lcao.cpp b/source/module_hamilt/hamilt_lcao.cpp
index ff119638d8..77d5b43081 100644
--- a/source/module_hamilt/hamilt_lcao.cpp
+++ b/source/module_hamilt/hamilt_lcao.cpp
@@ -134,7 +134,7 @@ template <> void HamiltLCAO<double>::updateHk(const int ik)
         }
         const int inc = 1;
         BlasConnector::copy(this->LM->Sloc.size(), this->LM->Sloc.data(), inc, this->smatrix_k, inc);
-        hsolver::DiagoElpa::is_already_decomposed = false;
+        hsolver::DiagoElpa::DecomposedState = 0;
     }
     ModuleBase::timer::tick("HamiltLCAO", "updateHk");
     return;
@@ -303,4 +303,4 @@ template <> void HamiltLCAO<std::complex<double>>::constructHamilt()
 #endif
 }
 
-} // namespace hamilt
\ No newline at end of file
+} // namespace hamilt
diff --git a/source/module_hamilt/operator.h b/source/module_hamilt/operator.h
index 044fc8489b..8610247375 100644
--- a/source/module_hamilt/operator.h
+++ b/source/module_hamilt/operator.h
@@ -93,6 +93,11 @@ class Operator
         //create a new hpsi and delete old hpsi later
         T* hpsi_pointer = std::get<2>(info);
         const T* psi_pointer = std::get<0>(info)->get_pointer();
+        if(this->hpsi != nullptr) 
+        {
+            delete this->hpsi;
+            this->hpsi = nullptr;
+        }
         if(!hpsi_pointer)
         {
             ModuleBase::WARNING_QUIT("Operator::hPsi", "hpsi_pointer can not be nullptr");
diff --git a/source/module_hsolver/CMakeLists.txt b/source/module_hsolver/CMakeLists.txt
index 702bec735d..898fc6cc08 100644
--- a/source/module_hsolver/CMakeLists.txt
+++ b/source/module_hsolver/CMakeLists.txt
@@ -12,6 +12,8 @@ add_library(
     diago_lapack.cpp
 )
 
+add_subdirectory(genelpa)
+
 IF (BUILD_TESTING)
   add_subdirectory(test)
 endif()
diff --git a/source/module_hsolver/diago_elpa.cpp b/source/module_hsolver/diago_elpa.cpp
index 4bb7403121..69ac995e3d 100644
--- a/source/module_hsolver/diago_elpa.cpp
+++ b/source/module_hsolver/diago_elpa.cpp
@@ -7,52 +7,16 @@
 extern "C"
 {
 #include "module_base/blacs_connector.h"
-#include "my_elpa.h"
 #include "module_base/scalapack_connector.h"
 }
+#include "genelpa/elpa_solver.h"
 
 typedef hamilt::MatrixBlock<double> matd;
 typedef hamilt::MatrixBlock<std::complex<double>> matcd;
 
 namespace hsolver
 {
-bool DiagoElpa::is_already_decomposed = false;
-#ifdef __MPI
-inline int set_elpahandle(elpa_t &handle,
-                          const int *desc,
-                          const int local_nrows,
-                          const int local_ncols,
-                          const int nbands)
-{
-    int error;
-    int nprows, npcols, myprow, mypcol;
-    Cblacs_gridinfo(desc[1], &nprows, &npcols, &myprow, &mypcol);
-    elpa_init(20210430);
-    handle = elpa_allocate(&error);
-    elpa_set_integer(handle, "na", desc[2], &error);
-    elpa_set_integer(handle, "nev", nbands, &error);
-
-    elpa_set_integer(handle, "local_nrows", local_nrows, &error);
-
-    elpa_set_integer(handle, "local_ncols", local_ncols, &error);
-
-    elpa_set_integer(handle, "nblk", desc[4], &error);
-
-    elpa_set_integer(handle, "mpi_comm_parent", MPI_Comm_c2f(MPI_COMM_WORLD), &error);
-
-    elpa_set_integer(handle, "process_row", myprow, &error);
-
-    elpa_set_integer(handle, "process_col", mypcol, &error);
-
-    elpa_set_integer(handle, "blacs_context", desc[1], &error);
-
-    elpa_set_integer(handle, "cannon_for_generalized", 0, &error);
-    /* Setup */
-    elpa_setup(handle); /* Set tunables */
-    return 0;
-}
-#endif
-
+int DiagoElpa::DecomposedState = 0;
 void DiagoElpa::diag(hamilt::Hamilt *phm_in, psi::Psi<std::complex<double>> &psi, double *eigenvalue_in)
 {
     ModuleBase::TITLE("DiagoElpa", "diag");
@@ -62,31 +26,15 @@ void DiagoElpa::diag(hamilt::Hamilt *phm_in, psi::Psi<std::complex<double>> &psi
 
     std::vector<double> eigen(GlobalV::NLOCAL, 0.0);
 
-    static elpa_t handle;
-    static bool has_set_elpa_handle = false;
-    if (!has_set_elpa_handle)
-    {
-        set_elpahandle(handle, h_mat.desc, h_mat.row, h_mat.col, GlobalV::NBANDS);
-        has_set_elpa_handle = true;
-    }
-
-    // compare to old code from pplab, there is no need to copy Sloc2 to another memory,
-    // just change Sloc2, which is a temporary matrix
-    // size_t nloc = h_mat.col * h_mat.row,
-    // BlasConnector::copy(nloc, s_mat, inc, Stmp, inc);
-
+    bool isReal=false;
+    const MPI_Comm COMM_DIAG=MPI_COMM_WORLD; // use all processes
+    ELPA_Solver es((const bool)isReal, COMM_DIAG, (const int)GlobalV::NBANDS, (const int)h_mat.row, (const int)h_mat.col, (const int*)h_mat.desc);
+    this->DecomposedState=0; // for k pointer, the decomposed s_mat can not be reused
     ModuleBase::timer::tick("DiagoElpa", "elpa_solve");
-    int elpa_derror;
-    elpa_generalized_eigenvectors_dc(handle,
-                                     reinterpret_cast<double _Complex *>(h_mat.p),
-                                     reinterpret_cast<double _Complex *>(s_mat.p),
-                                     eigen.data(),
-                                     reinterpret_cast<double _Complex *>(psi.get_pointer()),
-                                     0,
-                                     &elpa_derror);
+    es.generalized_eigenvector(h_mat.p, s_mat.p, this->DecomposedState, eigen.data(), psi.get_pointer());
     ModuleBase::timer::tick("DiagoElpa", "elpa_solve");
+    es.exit();
 
-    // the eigenvalues.
     const int inc = 1;
     BlasConnector::copy(GlobalV::NBANDS, eigen.data(), inc, eigenvalue_in, inc);
 #else
@@ -103,43 +51,14 @@ void DiagoElpa::diag(hamilt::Hamilt *phm_in, psi::Psi<double> &psi, double *eige
 
     std::vector<double> eigen(GlobalV::NLOCAL, 0.0);
 
-    static elpa_t handle;
-    static bool has_set_elpa_handle = false;
-    if (!has_set_elpa_handle)
-    {
-        set_elpahandle(handle, h_mat.desc, h_mat.row, h_mat.col, GlobalV::NBANDS);
-        has_set_elpa_handle = true;
-    }
-
-    // compare to old code from pplab, there is no need to copy Sloc2 to another memory,
-    // just change Sloc2, which is a temporary matrix
-    // change this judgement to HamiltLCAO
-    /*int is_already_decomposed;
-    if(ifElpaHandle(GlobalC::CHR.get_new_e_iteration(), (GlobalV::CALCULATION=="nscf")))
-    {
-        ModuleBase::timer::tick("DiagoElpa","decompose_S");
-        BlasConnector::copy(pv->nloc, s_mat, inc, Stmp, inc);
-        is_already_decomposed=0;
-        ModuleBase::timer::tick("DiagoElpa","decompose_S");
-    }
-    else
-    {
-        is_already_decomposed=1;
-    }*/
-
+    bool isReal=true;
+    MPI_Comm COMM_DIAG=MPI_COMM_WORLD; // use all processes
+    //ELPA_Solver es(isReal, COMM_DIAG, GlobalV::NBANDS, h_mat.row, h_mat.col, h_mat.desc);
+    ELPA_Solver es((const bool)isReal, COMM_DIAG, (const int)GlobalV::NBANDS, (const int)h_mat.row, (const int)h_mat.col, (const int*)h_mat.desc);
     ModuleBase::timer::tick("DiagoElpa", "elpa_solve");
-    int elpa_error;
-    elpa_generalized_eigenvectors_d(handle,
-                                    h_mat.p,
-                                    s_mat.p,
-                                    eigen.data(),
-                                    psi.get_pointer(),
-                                    DiagoElpa::is_already_decomposed,
-                                    &elpa_error);
+    es.generalized_eigenvector(h_mat.p, s_mat.p, this->DecomposedState, eigen.data(), psi.get_pointer());
     ModuleBase::timer::tick("DiagoElpa", "elpa_solve");
-
-    //S matrix has been decomposed
-    DiagoElpa::is_already_decomposed = true;
+    es.exit();
 
     const int inc = 1;
     ModuleBase::GlobalFunc::OUT(GlobalV::ofs_running, "K-S equation was solved by genelpa2");
@@ -162,4 +81,4 @@ bool DiagoElpa::ifElpaHandle(const bool &newIteration, const bool &ifNSCF)
 }
 #endif
 
-} // namespace hsolver
\ No newline at end of file
+} // namespace hsolver
diff --git a/source/module_hsolver/diago_elpa.h b/source/module_hsolver/diago_elpa.h
index 379d33ee8e..0820bc6200 100644
--- a/source/module_hsolver/diago_elpa.h
+++ b/source/module_hsolver/diago_elpa.h
@@ -14,9 +14,9 @@ class DiagoElpa : public DiagH
     void diag(hamilt::Hamilt* phm_in, psi::Psi<double>& psi, double* eigenvalue_in) override;
 
     void diag(hamilt::Hamilt* phm_in, psi::Psi<std::complex<double>>& psi, double* eigenvalue_in) override;
-
-    static bool is_already_decomposed;
     
+    static int DecomposedState;
+
   private:
 #ifdef __MPI
     bool ifElpaHandle(const bool& newIteration, const bool& ifNSCF);
@@ -25,4 +25,4 @@ class DiagoElpa : public DiagH
 
 } // namespace hsolver
 
-#endif
\ No newline at end of file
+#endif
diff --git a/source/module_hsolver/genelpa/CMakeLists.txt b/source/module_hsolver/genelpa/CMakeLists.txt
new file mode 100644
index 0000000000..e962f742f7
--- /dev/null
+++ b/source/module_hsolver/genelpa/CMakeLists.txt
@@ -0,0 +1 @@
+add_library(genelpa OBJECT elpa_new.cpp elpa_new_real.cpp elpa_new_complex.cpp utils.cpp)
diff --git a/source/module_hsolver/genelpa/Cblacs.h b/source/module_hsolver/genelpa/Cblacs.h
new file mode 100644
index 0000000000..35a7ccfdfb
--- /dev/null
+++ b/source/module_hsolver/genelpa/Cblacs.h
@@ -0,0 +1,24 @@
+#pragma once
+// blacs
+    // Initialization
+#include "mpi.h"
+int Csys2blacs_handle(MPI_Comm SysCtxt);
+void Cblacs_pinfo(int *myid, int *nprocs);
+void Cblacs_get(int icontxt, int what, int *val);
+void Cblacs_gridinit(int* icontxt, char *layout, int nprow, int npcol);
+void Cblacs_gridmap(int* icontxt, int *usermap, int ldumap, int nprow, int npcol);
+    // Destruction
+void Cblacs_gridexit(int icontxt);
+    // Informational and Miscellaneous
+void Cblacs_gridinfo(int icontxt, int* nprow, int *npcol, int *myprow, int *mypcol);
+int Cblacs_pnum(int icontxt, int prow, int pcol);
+void Cblacs_pcoord(int icontxt, int pnum, int *prow, int *pcol);
+void Cblacs_barrier(int icontxt, char *scope);
+    // Point to  Point
+void Cdgesd2d(int icontxt, int m, int n, double *a, int lda, int rdest, int cdest);
+void Cdgerv2d(int icontxt, int m, int n, double *a, int lda, int rsrc, int csrc);
+void Czgesd2d(int icontxt, int m, int n, double _Complex *a, int lda, int rdest, int cdest);
+void Czgerv2d(int icontxt, int m, int n, double _Complex *a, int lda, int rsrc, int csrc);
+    // Combine
+//void Cdgamx2d(int icontxt, int scope, int top, int m, int n,
+//              double *a, int lda, int *ra, int *ca, int rcflag, int rdest, int cdest);
diff --git a/source/module_hsolver/genelpa/README b/source/module_hsolver/genelpa/README
new file mode 100644
index 0000000000..54005767af
--- /dev/null
+++ b/source/module_hsolver/genelpa/README
@@ -0,0 +1,4 @@
+GenELPA, v1.1.1, customized for ABACUS
+
+Project: <https://github.com/pplab/GenELPA>
+
diff --git a/source/module_hsolver/genelpa/blas.h b/source/module_hsolver/genelpa/blas.h
new file mode 100644
index 0000000000..90266a702d
--- /dev/null
+++ b/source/module_hsolver/genelpa/blas.h
@@ -0,0 +1,27 @@
+#pragma once
+//blas
+void dcopy_(const int *n, const double *x, const int *incx, double *y, const int *incy);
+void zcopy_(const int *n, const double _Complex *x, const int *incx, double _Complex *y, const int *incy);
+void dgemm_(const char *transa, const char *transb, const int *m, const int *n, const int *k,
+            const double *alpha, double *a, const int *lda, 
+                                 double *b, const int *ldb,
+            const double *beta,  double *c, const int *ldc);
+void dsymm_(char *side, char *uplo, int *m, int *n, 
+            const double *alpha, double *a,  int *lda,  
+                                 double *b, int *ldb, 
+            const double *beta,  double *c, int *ldc);
+void dtrsm_(char *side, char *uplo, char *transa, char *diag, int *m, int *n,
+            const double *alpha, double *a, int *lda, 
+                                 double *b, int *ldb);
+//void zcopy_(int *n, double _Complex *x, int *incx, double _Complex *y, int *incy);
+void zgemm_(const char *transa, const char *transb, const int *m, const int *n, const int *k,
+            const double _Complex *alpha, double _Complex *a, const int *lda, 
+                                          double _Complex *b, const int *ldb,
+            const double _Complex *beta,  double _Complex *c, const int *ldc);
+void zsymm_(char *side, char *uplo, int *m, int *n, 
+            const double _Complex *alpha, double _Complex *a,  int *lda,  
+                                          double _Complex *b, int *ldb, 
+            const double _Complex *beta,  double _Complex *c, int *ldc);
+void ztrsm_(char *side, char *uplo, char *transa, char *diag, int *m, int *n,
+            double _Complex *alpha, double _Complex *a, int *lda, 
+                                    double _Complex *b, int *ldb);
\ No newline at end of file
diff --git a/source/module_hsolver/genelpa/elpa_generic.hpp b/source/module_hsolver/genelpa/elpa_generic.hpp
new file mode 100644
index 0000000000..a6b3f82ff5
--- /dev/null
+++ b/source/module_hsolver/genelpa/elpa_generic.hpp
@@ -0,0 +1,408 @@
+// `elpa_generic.h` replacement for version 2021.05.002 and earlier versions
+// If the file `elpa_generic.h` has keywords `elpa_eigenvectors_all_host_arrays_dc`,
+// it is the new version of 2021.11.002; otherwise it is the old version.
+#pragma once
+#include "elpa_new.h"
+static inline void elpa_set(elpa_t e, const char *name, int value, int *error)
+{
+    elpa_set_integer(e, name, value, error);
+}
+
+static inline void elpa_set(elpa_t e, const char *name, double value, int *error)
+{
+    elpa_set_double(e, name, value, error);
+}
+
+static inline void elpa_get(elpa_t e, const char *name, int *value, int *error)
+{
+    elpa_get_integer(e, name, value, error);
+}
+
+static inline void elpa_get(elpa_t e, const char *name, double *value, int *error)
+{
+    elpa_get_double(e, name, value, error);
+}
+
+#if ELPA_API_VERSION <= 20210430 // ELPA 2021.05.002 and earlier versions
+
+static inline void elpa_eigenvectors(elpa_t handle, double *a, double *ev, double *q, int *error)
+{
+    elpa_eigenvectors_d(handle, a, ev, q, error);
+}
+
+static inline void elpa_eigenvectors(elpa_t handle, float *a, float *ev, float *q, int *error)
+{
+    elpa_eigenvectors_f(handle, a, ev, q, error);
+}
+
+static inline void elpa_eigenvectors(elpa_t handle, double complex *a, double *ev, double complex *q, int *error)
+{
+    elpa_eigenvectors_dc(handle, a, ev, q, error);
+}
+
+static inline void elpa_eigenvectors(elpa_t handle, float complex *a, float *ev, float complex *q, int *error)
+{
+    elpa_eigenvectors_fc(handle, a, ev, q, error);
+}
+
+static inline void elpa_skew_eigenvectors(elpa_t handle, double *a, double *ev, double *q, int *error)
+{
+    elpa_eigenvectors_d(handle, a, ev, q, error);
+}
+
+static inline void elpa_skew_eigenvectors(elpa_t handle, float *a, float *ev, float *q, int *error)
+{
+    elpa_eigenvectors_f(handle, a, ev, q, error);
+}
+
+static inline void elpa_generalized_eigenvectors(elpa_t handle,
+                                                 double *a,
+                                                 double *b,
+                                                 double *ev,
+                                                 double *q,
+                                                 int is_already_decomposed,
+                                                 int *error)
+{
+    elpa_generalized_eigenvectors_d(handle, a, b, ev, q, is_already_decomposed, error);
+}
+
+static inline void elpa_generalized_eigenvectors(elpa_t handle,
+                                                 float *a,
+                                                 float *b,
+                                                 float *ev,
+                                                 float *q,
+                                                 int is_already_decomposed,
+                                                 int *error)
+{
+    elpa_generalized_eigenvectors_f(handle, a, b, ev, q, is_already_decomposed, error);
+}
+
+static inline void elpa_generalized_eigenvectors(elpa_t handle,
+                                                 double complex *a,
+                                                 double complex *b,
+                                                 double *ev,
+                                                 double complex *q,
+                                                 int is_already_decomposed,
+                                                 int *error)
+{
+    elpa_generalized_eigenvectors_dc(handle, a, b, ev, q, is_already_decomposed, error);
+}
+
+static inline void elpa_generalized_eigenvectors(elpa_t handle,
+                                                 float complex *a,
+                                                 float complex *b,
+                                                 float *ev,
+                                                 float complex *q,
+                                                 int is_already_decomposed,
+                                                 int *error)
+{
+    elpa_generalized_eigenvectors_fc(handle, a, b, ev, q, is_already_decomposed, error);
+}
+
+static inline void elpa_eigenvalues(elpa_t handle, double *a, double *ev, int *error)
+{
+    elpa_eigenvalues_d(handle, a, ev, error);
+}
+
+static inline void elpa_eigenvalues(elpa_t handle, float *a, float *ev, int *error)
+{
+    elpa_eigenvalues_f(handle, a, ev, error);
+}
+
+static inline void elpa_eigenvalues(elpa_t handle, double complex *a, double *ev, int *error)
+{
+    elpa_eigenvalues_dc(handle, a, ev, error);
+}
+
+static inline void elpa_eigenvalues(elpa_t handle, float complex *a, float *ev, int *error)
+{
+    elpa_eigenvalues_fc(handle, a, ev, error);
+}
+
+static inline void elpa_skew_eigenvalues(elpa_t handle, double *a, double *ev, int *error)
+{
+    elpa_eigenvalues_d(handle, a, ev, error);
+}
+
+static inline void elpa_skew_eigenvalues(elpa_t handle, float *a, float *ev, int *error)
+{
+    elpa_eigenvalues_f(handle, a, ev, error);
+}
+
+static inline void elpa_cholesky(elpa_t handle, double *a, int *error)
+{
+    elpa_cholesky_d(handle, a, error);
+}
+
+static inline void elpa_cholesky(elpa_t handle, float *a, int *error)
+{
+    elpa_cholesky_f(handle, a, error);
+}
+#else // ELPA version >= 2021.11.002
+static inline void elpa_eigenvectors(elpa_t handle, double *a, double *ev, double *q, int *error)
+{
+    elpa_eigenvectors_all_host_arrays_d(handle, a, ev, q, error);
+}
+
+static inline void elpa_eigenvectors(elpa_t handle, float *a, float *ev, float *q, int *error)
+{
+    elpa_eigenvectors_all_host_arrays_f(handle, a, ev, q, error);
+}
+
+static inline void elpa_eigenvectors(elpa_t handle, double complex *a, double *ev, double complex *q, int *error)
+{
+    elpa_eigenvectors_all_host_arrays_dc(handle, a, ev, q, error);
+}
+
+static inline void elpa_eigenvectors(elpa_t handle, float complex *a, float *ev, float complex *q, int *error)
+{
+    elpa_eigenvectors_all_host_arrays_fc(handle, a, ev, q, error);
+}
+
+static inline void elpa_eigenvectors_double(elpa_t handle, double *a, double *ev, double *q, int *error)
+{
+    elpa_eigenvectors_device_pointer_d(handle, a, ev, q, error);
+}
+
+static inline void elpa_eigenvectors_float(elpa_t handle, float *a, float *ev, float *q, int *error)
+{
+    elpa_eigenvectors_device_pointer_f(handle, a, ev, q, error);
+}
+
+static inline void elpa_eigenvectors_double_complex(elpa_t handle,
+                                                    double complex *a,
+                                                    double *ev,
+                                                    double complex *q,
+                                                    int *error)
+{
+    elpa_eigenvectors_device_pointer_dc(handle, a, ev, q, error);
+}
+
+static inline void elpa_eigenvectors_float_complex(elpa_t handle,
+                                                   float complex *a,
+                                                   float *ev,
+                                                   float complex *q,
+                                                   int *error)
+{
+    elpa_eigenvectors_device_pointer_fc(handle, a, ev, q, error);
+}
+
+static inline void elpa_skew_eigenvectors(elpa_t handle, double *a, double *ev, double *q, int *error)
+{
+    elpa_eigenvectors_all_host_arrays_d(handle, a, ev, q, error);
+}
+
+static inline void elpa_skew_eigenvectors(elpa_t handle, float *a, float *ev, float *q, int *error)
+{
+    elpa_eigenvectors_all_host_arrays_f(handle, a, ev, q, error);
+}
+
+static inline void elpa_skew_eigenvectors_double(elpa_t handle, double *a, double *ev, double *q, int *error)
+{
+    elpa_eigenvectors_device_pointer_d(handle, a, ev, q, error);
+}
+
+static inline void elpa_skew_eigenvectors_float(elpa_t handle, float *a, float *ev, float *q, int *error)
+{
+    elpa_eigenvectors_device_pointer_f(handle, a, ev, q, error);
+}
+
+static inline void elpa_generalized_eigenvectors(elpa_t handle,
+                                                 double *a,
+                                                 double *b,
+                                                 double *ev,
+                                                 double *q,
+                                                 int is_already_decomposed,
+                                                 int *error)
+{
+    elpa_generalized_eigenvectors_d(handle, a, b, ev, q, is_already_decomposed, error);
+}
+
+static inline void elpa_generalized_eigenvectors(elpa_t handle,
+                                                 float *a,
+                                                 float *b,
+                                                 float *ev,
+                                                 float *q,
+                                                 int is_already_decomposed,
+                                                 int *error)
+{
+    elpa_generalized_eigenvectors_f(handle, a, b, ev, q, is_already_decomposed, error);
+}
+
+static inline void elpa_generalized_eigenvectors(elpa_t handle,
+                                                 double complex *a,
+                                                 double complex *b,
+                                                 double *ev,
+                                                 double complex *q,
+                                                 int is_already_decomposed,
+                                                 int *error)
+{
+    elpa_generalized_eigenvectors_dc(handle, a, b, ev, q, is_already_decomposed, error);
+}
+
+static inline void elpa_generalized_eigenvectors(elpa_t handle,
+                                                 float complex *a,
+                                                 float complex *b,
+                                                 float *ev,
+                                                 float complex *q,
+                                                 int is_already_decomposed,
+                                                 int *error)
+{
+    elpa_generalized_eigenvectors_fc(handle, a, b, ev, q, is_already_decomposed, error);
+}
+
+static inline void elpa_eigenvalues(elpa_t handle, double *a, double *ev, int *error)
+{
+    elpa_eigenvalues_all_host_arrays_d(handle, a, ev, error);
+}
+
+static inline void elpa_eigenvalues(elpa_t handle, float *a, float *ev, int *error)
+{
+    elpa_eigenvalues_all_host_arrays_f(handle, a, ev, error);
+}
+
+static inline void elpa_eigenvalues(elpa_t handle, double complex *a, double *ev, int *error)
+{
+    elpa_eigenvalues_all_host_arrays_dc(handle, a, ev, error);
+}
+
+static inline void elpa_eigenvalues(elpa_t handle, float complex *a, float *ev, int *error)
+{
+    elpa_eigenvalues_all_host_arrays_fc(handle, a, ev, error);
+}
+
+static inline void elpa_eigenvalues_double(elpa_t handle, double *a, double *ev, int *error)
+{
+    elpa_eigenvalues_device_pointer_d(handle, a, ev, error);
+}
+
+static inline void elpa_eigenvalues_float(elpa_t handle, float *a, float *ev, int *error)
+{
+    elpa_eigenvalues_device_pointer_f(handle, a, ev, error);
+}
+
+static inline void elpa_eigenvalues_double_complex(elpa_t handle, double complex *a, double *ev, int *error)
+{
+    elpa_eigenvalues_device_pointer_dc(handle, a, ev, error);
+}
+
+static inline void elpa_eigenvalues_float_complex(elpa_t handle, float complex *a, float *ev, int *error)
+{
+    elpa_eigenvalues_device_pointer_fc(handle, a, ev, error);
+}
+
+static inline void elpa_skew_eigenvalues(elpa_t handle, double *a, double *ev, int *error)
+{
+    elpa_eigenvalues_all_host_arrays_d(handle, a, ev, error);
+}
+
+static inline void elpa_skew_eigenvalues(elpa_t handle, float *a, float *ev, int *error)
+{
+    elpa_eigenvalues_all_host_arrays_f(handle, a, ev, error);
+}
+
+static inline void elpa_skew_eigenvalues_double(elpa_t handle, double *a, double *ev, int *error)
+{
+    elpa_eigenvalues_device_pointer_d(handle, a, ev, error);
+}
+
+static inline void elpa_skew_eigenvalues_float(elpa_t handle, float *a, float *ev, int *error)
+{
+    elpa_eigenvalues_device_pointer_f(handle, a, ev, error);
+}
+
+#endif // ELPA_API_VERSION <= 20210430
+
+static inline void elpa_cholesky(elpa_t handle, double complex *a, int *error)
+{
+    elpa_cholesky_dc(handle, a, error);
+}
+
+static inline void elpa_cholesky(elpa_t handle, float complex *a, int *error)
+{
+    elpa_cholesky_fc(handle, a, error);
+}
+
+static inline void elpa_hermitian_multiply(elpa_t handle,
+                                           char uplo_a,
+                                           char uplo_c,
+                                           int ncb,
+                                           double *a,
+                                           double *b,
+                                           int nrows_b,
+                                           int ncols_b,
+                                           double *c,
+                                           int nrows_c,
+                                           int ncols_c,
+                                           int *error)
+{
+    elpa_hermitian_multiply_d(handle, uplo_a, uplo_c, ncb, a, b, nrows_b, ncols_b, c, nrows_c, ncols_c, error);
+}
+
+static inline void elpa_hermitian_multiply(elpa_t handle,
+                                           char uplo_a,
+                                           char uplo_c,
+                                           int ncb,
+                                           float *a,
+                                           float *b,
+                                           int nrows_b,
+                                           int ncols_b,
+                                           float *c,
+                                           int nrows_c,
+                                           int ncols_c,
+                                           int *error)
+{
+    elpa_hermitian_multiply_df(handle, uplo_a, uplo_c, ncb, a, b, nrows_b, ncols_b, c, nrows_c, ncols_c, error);
+}
+
+static inline void elpa_hermitian_multiply(elpa_t handle,
+                                           char uplo_a,
+                                           char uplo_c,
+                                           int ncb,
+                                           double complex *a,
+                                           double complex *b,
+                                           int nrows_b,
+                                           int ncols_b,
+                                           double complex *c,
+                                           int nrows_c,
+                                           int ncols_c,
+                                           int *error)
+{
+    elpa_hermitian_multiply_dc(handle, uplo_a, uplo_c, ncb, a, b, nrows_b, ncols_b, c, nrows_c, ncols_c, error);
+}
+
+static inline void elpa_hermitian_multiply(elpa_t handle,
+                                           char uplo_a,
+                                           char uplo_c,
+                                           int ncb,
+                                           float complex *a,
+                                           float complex *b,
+                                           int nrows_b,
+                                           int ncols_b,
+                                           float complex *c,
+                                           int nrows_c,
+                                           int ncols_c,
+                                           int *error)
+{
+    elpa_hermitian_multiply_fc(handle, uplo_a, uplo_c, ncb, a, b, nrows_b, ncols_b, c, nrows_c, ncols_c, error);
+}
+
+static inline void elpa_invert_triangular(elpa_t handle, double *a, int *error)
+{
+    elpa_invert_trm_d(handle, a, error);
+}
+
+static inline void elpa_invert_triangular(elpa_t handle, float *a, int *error)
+{
+    elpa_invert_trm_f(handle, a, error);
+}
+
+static inline void elpa_invert_triangular(elpa_t handle, double complex *a, int *error)
+{
+    elpa_invert_trm_dc(handle, a, error);
+}
+
+static inline void elpa_invert_triangular(elpa_t handle, float complex *a, int *error)
+{
+    elpa_invert_trm_fc(handle, a, error);
+}
diff --git a/source/module_hsolver/genelpa/elpa_new.cpp b/source/module_hsolver/genelpa/elpa_new.cpp
new file mode 100644
index 0000000000..bc42d1de94
--- /dev/null
+++ b/source/module_hsolver/genelpa/elpa_new.cpp
@@ -0,0 +1,461 @@
+#include "elpa_new.h"
+
+#include "elpa_solver.h"
+#include "my_math.hpp"
+#include "utils.h"
+
+#include <cfloat>
+#include <complex>
+#include <cstring>
+#include <fstream>
+#include <iostream>
+#include <map>
+#include <mpi.h>
+#include <regex>
+#include <sstream>
+#include <vector>
+
+using namespace std;
+
+map<int, elpa_t> NEW_ELPA_HANDLE_POOL;
+
+ELPA_Solver::ELPA_Solver(const bool isReal,
+                         const MPI_Comm comm,
+                         const int nev,
+                         const int narows,
+                         const int nacols,
+                         const int* desc)
+{
+    this->isReal = isReal;
+    this->comm = comm;
+    this->nev = nev;
+    this->narows = narows;
+    this->nacols = nacols;
+    for (int i = 0; i < 9; ++i)
+        this->desc[i] = desc[i];
+    cblacs_ctxt = desc[1];
+    nFull = desc[2];
+    nblk = desc[4];
+    lda = desc[8];
+    // cout<<"parameters are passed\n";
+    MPI_Comm_rank(comm, &myid);
+    Cblacs_gridinfo(cblacs_ctxt, &nprows, &npcols, &myprow, &mypcol);
+    // cout<<"blacs grid is inited\n";
+    allocate_work();
+    // cout<<"work array is inited\n";
+    if (isReal)
+        kernel_id = read_real_kernel();
+    else
+        kernel_id = read_complex_kernel();
+    // cout<<"kernel id is inited as "<<kernel_id<<"\n";
+    int error;
+
+    static int total_handle = 0;
+
+    elpa_init(20210430);
+
+    handle_id = ++total_handle;
+    elpa_t handle;
+
+    handle = elpa_allocate(&error);
+    NEW_ELPA_HANDLE_POOL[handle_id] = handle;
+
+    elpa_set_integer(NEW_ELPA_HANDLE_POOL[handle_id], "na", nFull, &error);
+    elpa_set_integer(NEW_ELPA_HANDLE_POOL[handle_id], "nev", nev, &error);
+    elpa_set_integer(NEW_ELPA_HANDLE_POOL[handle_id], "local_nrows", narows, &error);
+    elpa_set_integer(NEW_ELPA_HANDLE_POOL[handle_id], "local_ncols", nacols, &error);
+    elpa_set_integer(NEW_ELPA_HANDLE_POOL[handle_id], "nblk", nblk, &error);
+    elpa_set_integer(NEW_ELPA_HANDLE_POOL[handle_id], "mpi_comm_parent", MPI_Comm_c2f(comm), &error);
+    elpa_set_integer(NEW_ELPA_HANDLE_POOL[handle_id], "process_row", myprow, &error);
+    elpa_set_integer(NEW_ELPA_HANDLE_POOL[handle_id], "process_col", mypcol, &error);
+
+    error = elpa_setup(NEW_ELPA_HANDLE_POOL[handle_id]);
+    // cout<<"elpa handle is setup\n";
+    elpa_set_integer(NEW_ELPA_HANDLE_POOL[handle_id], "solver", ELPA_SOLVER_2STAGE, &error);
+    this->setQR(0);
+    this->setKernel(isReal, kernel_id);
+    // cout<<"elpa kernel is setup\n";
+    this->setLoglevel(0);
+    // cout<<"log level is setup\n";
+}
+
+ELPA_Solver::ELPA_Solver(const bool isReal,
+                         const MPI_Comm comm,
+                         const int nev,
+                         const int narows,
+                         const int nacols,
+                         const int* desc,
+                         const int* otherParameter)
+{
+    this->isReal = isReal;
+    this->comm = comm;
+    this->nev = nev;
+    this->narows = narows;
+    this->nacols = nacols;
+    for (int i = 0; i < 9; ++i)
+        this->desc[i] = desc[i];
+
+    kernel_id = otherParameter[0];
+    useQR = otherParameter[1];
+    loglevel = otherParameter[2];
+
+    cblacs_ctxt = desc[1];
+    nFull = desc[2];
+    nblk = desc[4];
+    lda = desc[8];
+    MPI_Comm_rank(comm, &myid);
+    Cblacs_gridinfo(cblacs_ctxt, &nprows, &npcols, &myprow, &mypcol);
+    allocate_work();
+
+    int error;
+    static map<int, elpa_t> NEW_ELPA_HANDLE_POOL;
+    static int total_handle;
+
+    elpa_init(20210430);
+
+    handle_id = ++total_handle;
+    elpa_t handle;
+    handle = elpa_allocate(&error);
+    NEW_ELPA_HANDLE_POOL[handle_id] = handle;
+
+    elpa_set_integer(NEW_ELPA_HANDLE_POOL[handle_id], "na", nFull, &error);
+    elpa_set_integer(NEW_ELPA_HANDLE_POOL[handle_id], "nev", nev, &error);
+    elpa_set_integer(NEW_ELPA_HANDLE_POOL[handle_id], "local_nrows", narows, &error);
+    elpa_set_integer(NEW_ELPA_HANDLE_POOL[handle_id], "local_ncols", nacols, &error);
+    elpa_set_integer(NEW_ELPA_HANDLE_POOL[handle_id], "nblk", nblk, &error);
+    elpa_set_integer(NEW_ELPA_HANDLE_POOL[handle_id], "mpi_comm_parent", MPI_Comm_c2f(comm), &error);
+    elpa_set_integer(NEW_ELPA_HANDLE_POOL[handle_id], "process_row", myprow, &error);
+    elpa_set_integer(NEW_ELPA_HANDLE_POOL[handle_id], "process_col", mypcol, &error);
+    elpa_set_integer(NEW_ELPA_HANDLE_POOL[handle_id], "blacs_context", cblacs_ctxt, &error);
+    elpa_set_integer(NEW_ELPA_HANDLE_POOL[handle_id], "solver", ELPA_SOLVER_2STAGE, &error);
+    elpa_set_integer(NEW_ELPA_HANDLE_POOL[handle_id], "debug", wantDebug, &error);
+    elpa_set_integer(NEW_ELPA_HANDLE_POOL[handle_id], "qr", useQR, &error);
+    this->setQR(useQR);
+    this->setKernel(isReal, kernel_id);
+    this->setLoglevel(loglevel);
+}
+
+void ELPA_Solver::setLoglevel(int loglevel)
+{
+    int error;
+    this->loglevel = loglevel;
+    static bool isLogfileInited = false;
+
+    if (loglevel >= 2)
+    {
+        wantDebug = 1;
+        elpa_set_integer(NEW_ELPA_HANDLE_POOL[handle_id], "verbose", 1, &error);
+        elpa_set_integer(NEW_ELPA_HANDLE_POOL[handle_id], "debug", wantDebug, &error);
+        if (!isLogfileInited)
+        {
+            stringstream logfilename;
+            logfilename.str("");
+            logfilename << "GenELPA_" << myid << ".log";
+            logfile.open(logfilename.str());
+            logfile << "logfile inited\n";
+            isLogfileInited = true;
+        }
+    }
+    else
+    {
+        wantDebug = 0;
+    }
+}
+
+void ELPA_Solver::setKernel(bool isReal, int kernel)
+{
+    this->kernel_id = kernel;
+    int error;
+    if (isReal)
+        elpa_set_integer(NEW_ELPA_HANDLE_POOL[handle_id], "real_kernel", kernel, &error);
+    else
+        elpa_set_integer(NEW_ELPA_HANDLE_POOL[handle_id], "complex_kernel", kernel, &error);
+}
+
+void ELPA_Solver::setQR(int useQR)
+{
+    this->useQR = useQR;
+    int error;
+    elpa_set_integer(NEW_ELPA_HANDLE_POOL[handle_id], "qr", useQR, &error);
+}
+
+void ELPA_Solver::exit()
+{
+    // delete[] dwork;
+    // delete[] zwork;
+    if (loglevel > 2)
+        logfile.close();
+    int error;
+    elpa_deallocate(NEW_ELPA_HANDLE_POOL[handle_id], &error);
+}
+
+int ELPA_Solver::read_cpuflag()
+{
+    int cpuflag = 0;
+
+    ifstream f_cpuinfo("/proc/cpuinfo");
+    string cpuinfo_line;
+    regex cpuflag_ex("flags.*");
+    regex cpuflag_avx512(".*avx512.*");
+    regex cpuflag_avx2(".*avx2.*");
+    regex cpuflag_avx(".*avx.*");
+    regex cpuflag_sse(".*sse.*");
+    while (getline(f_cpuinfo, cpuinfo_line))
+    {
+        if (regex_match(cpuinfo_line, cpuflag_ex))
+        {
+            // cout<<cpuinfo_line<<endl;
+            if (regex_match(cpuinfo_line, cpuflag_avx512))
+            {
+                cpuflag = 4;
+            }
+            else if (regex_match(cpuinfo_line, cpuflag_avx2))
+            {
+                cpuflag = 3;
+            }
+            else if (regex_match(cpuinfo_line, cpuflag_avx))
+            {
+                cpuflag = 2;
+            }
+            else if (regex_match(cpuinfo_line, cpuflag_sse))
+            {
+                cpuflag = 1;
+            }
+            break;
+        }
+    }
+    f_cpuinfo.close();
+    return cpuflag;
+}
+
+int ELPA_Solver::read_real_kernel()
+{
+    int kernel_id;
+
+    if (const char* env = getenv("ELPA_DEFAULT_real_kernel"))
+    {
+        if (strcmp(env, "ELPA_2STAGE_REAL_GENERIC_SIMPLE") == 0)
+            kernel_id = ELPA_2STAGE_REAL_GENERIC_SIMPLE;
+        else if (strcmp(env, "ELPA_2STAGE_REAL_BGP") == 0)
+            kernel_id = ELPA_2STAGE_REAL_BGP;
+        else if (strcmp(env, "ELPA_2STAGE_REAL_BGQ") == 0)
+            kernel_id = ELPA_2STAGE_REAL_BGQ;
+        else if (strcmp(env, "ELPA_2STAGE_REAL_SSE_ASSEMBLY") == 0)
+            kernel_id = ELPA_2STAGE_REAL_SSE_ASSEMBLY;
+        else if (strcmp(env, "ELPA_2STAGE_REAL_SSE_BLOCK2") == 0)
+            kernel_id = ELPA_2STAGE_REAL_SSE_BLOCK2;
+        else if (strcmp(env, "ELPA_2STAGE_REAL_SSE_BLOCK4") == 0)
+            kernel_id = ELPA_2STAGE_REAL_SSE_BLOCK4;
+        else if (strcmp(env, "ELPA_2STAGE_REAL_SSE_BLOCK6") == 0)
+            kernel_id = ELPA_2STAGE_REAL_SSE_BLOCK6;
+        else if (strcmp(env, "ELPA_2STAGE_REAL_AVX_BLOCK2") == 0)
+            kernel_id = ELPA_2STAGE_REAL_AVX_BLOCK2;
+        else if (strcmp(env, "ELPA_2STAGE_REAL_AVX_BLOCK4") == 0)
+            kernel_id = ELPA_2STAGE_REAL_AVX_BLOCK4;
+        else if (strcmp(env, "ELPA_2STAGE_REAL_AVX_BLOCK6") == 0)
+            kernel_id = ELPA_2STAGE_REAL_AVX_BLOCK6;
+        else if (strcmp(env, "ELPA_2STAGE_REAL_AVX2_BLOCK2") == 0)
+            kernel_id = ELPA_2STAGE_REAL_AVX2_BLOCK2;
+        else if (strcmp(env, "ELPA_2STAGE_REAL_AVX2_BLOCK4") == 0)
+            kernel_id = ELPA_2STAGE_REAL_AVX2_BLOCK4;
+        else if (strcmp(env, "ELPA_2STAGE_REAL_AVX2_BLOCK6") == 0)
+            kernel_id = ELPA_2STAGE_REAL_AVX2_BLOCK6;
+        else if (strcmp(env, "ELPA_2STAGE_REAL_AVX512_BLOCK2") == 0)
+            kernel_id = ELPA_2STAGE_REAL_AVX512_BLOCK2;
+        else if (strcmp(env, "ELPA_2STAGE_REAL_AVX512_BLOCK4") == 0)
+            kernel_id = ELPA_2STAGE_REAL_AVX512_BLOCK4;
+        else if (strcmp(env, "ELPA_2STAGE_REAL_AVX512_BLOCK6") == 0)
+            kernel_id = ELPA_2STAGE_REAL_AVX512_BLOCK6;
+        else if (strcmp(env, "ELPA_2STAGE_REAL_SPARC64_BLOCK2") == 0)
+            kernel_id = ELPA_2STAGE_REAL_SPARC64_BLOCK2;
+        else if (strcmp(env, "ELPA_2STAGE_REAL_SPARC64_BLOCK4") == 0)
+            kernel_id = ELPA_2STAGE_REAL_SPARC64_BLOCK4;
+        else if (strcmp(env, "ELPA_2STAGE_REAL_SPARC64_BLOCK6") == 0)
+            kernel_id = ELPA_2STAGE_REAL_SPARC64_BLOCK6;
+        else if (strcmp(env, "ELPA_2STAGE_REAL_NEON_ARCH64_BLOCK2") == 0)
+            kernel_id = ELPA_2STAGE_REAL_NEON_ARCH64_BLOCK2;
+        else if (strcmp(env, "ELPA_2STAGE_REAL_NEON_ARCH64_BLOCK4") == 0)
+            kernel_id = ELPA_2STAGE_REAL_NEON_ARCH64_BLOCK4;
+        else if (strcmp(env, "ELPA_2STAGE_REAL_NEON_ARCH64_BLOCK6") == 0)
+            kernel_id = ELPA_2STAGE_REAL_NEON_ARCH64_BLOCK6;
+        else if (strcmp(env, "ELPA_2STAGE_REAL_VSX_BLOCK2") == 0)
+            kernel_id = ELPA_2STAGE_REAL_VSX_BLOCK2;
+        else if (strcmp(env, "ELPA_2STAGE_REAL_VSX_BLOCK4") == 0)
+            kernel_id = ELPA_2STAGE_REAL_VSX_BLOCK4;
+        else if (strcmp(env, "ELPA_2STAGE_REAL_VSX_BLOCK6") == 0)
+            kernel_id = ELPA_2STAGE_REAL_VSX_BLOCK6;
+        else if (strcmp(env, "ELPA_2STAGE_REAL_SVE128_BLOCK2") == 0)
+            kernel_id = ELPA_2STAGE_REAL_SVE128_BLOCK2;
+        else if (strcmp(env, "ELPA_2STAGE_REAL_SVE128_BLOCK4") == 0)
+            kernel_id = ELPA_2STAGE_REAL_SVE128_BLOCK4;
+        else if (strcmp(env, "ELPA_2STAGE_REAL_SVE128_BLOCK6") == 0)
+            kernel_id = ELPA_2STAGE_REAL_SVE128_BLOCK6;
+        else if (strcmp(env, "ELPA_2STAGE_REAL_SVE256_BLOCK2") == 0)
+            kernel_id = ELPA_2STAGE_REAL_SVE256_BLOCK2;
+        else if (strcmp(env, "ELPA_2STAGE_REAL_SVE256_BLOCK4") == 0)
+            kernel_id = ELPA_2STAGE_REAL_SVE256_BLOCK4;
+        else if (strcmp(env, "ELPA_2STAGE_REAL_SVE256_BLOCK6") == 0)
+            kernel_id = ELPA_2STAGE_REAL_SVE256_BLOCK6;
+        else if (strcmp(env, "ELPA_2STAGE_REAL_SVE512_BLOCK2") == 0)
+            kernel_id = ELPA_2STAGE_REAL_SVE512_BLOCK2;
+        else if (strcmp(env, "ELPA_2STAGE_REAL_SVE512_BLOCK4") == 0)
+            kernel_id = ELPA_2STAGE_REAL_SVE512_BLOCK4;
+        else if (strcmp(env, "ELPA_2STAGE_REAL_SVE512_BLOCK6") == 0)
+            kernel_id = ELPA_2STAGE_REAL_SVE512_BLOCK6;
+        else if (strcmp(env, "ELPA_2STAGE_REAL_GENERIC_SIMPLE_BLOCK4") == 0)
+            kernel_id = ELPA_2STAGE_REAL_GENERIC_SIMPLE_BLOCK4;
+        else if (strcmp(env, "ELPA_2STAGE_REAL_GENERIC_SIMPLE_BLOCK6") == 0)
+            kernel_id = ELPA_2STAGE_REAL_GENERIC_SIMPLE_BLOCK6;
+        else
+            kernel_id = ELPA_2STAGE_REAL_GENERIC;
+    }
+    else
+    {
+        int cpuflag = read_cpuflag();
+        switch (cpuflag)
+        {
+        case 4:
+            kernel_id = ELPA_2STAGE_REAL_AVX512_BLOCK4;
+            break;
+        case 3:
+            kernel_id = ELPA_2STAGE_REAL_AVX2_BLOCK2;
+            break;
+        case 2:
+            kernel_id = ELPA_2STAGE_REAL_AVX_BLOCK2;
+            break;
+        case 1:
+            kernel_id = ELPA_2STAGE_REAL_SSE_BLOCK2;
+            break;
+        default:
+            kernel_id = ELPA_2STAGE_REAL_GENERIC_SIMPLE_BLOCK6;
+            break;
+        }
+    }
+    return kernel_id;
+}
+
+int ELPA_Solver::read_complex_kernel()
+{
+    int kernel_id;
+    if (const char* env = getenv("ELPA_DEFAULT_complex_kernel"))
+    {
+        if (strcmp(env, "ELPA_2STAGE_COMPLEX_GENERIC_SIMPLE") == 0)
+            kernel_id = ELPA_2STAGE_COMPLEX_GENERIC_SIMPLE;
+        else if (strcmp(env, "ELPA_2STAGE_COMPLEX_BGP") == 0)
+            kernel_id = ELPA_2STAGE_COMPLEX_BGP;
+        else if (strcmp(env, "ELPA_2STAGE_COMPLEX_BGQ") == 0)
+            kernel_id = ELPA_2STAGE_COMPLEX_BGQ;
+        else if (strcmp(env, "ELPA_2STAGE_COMPLEX_SSE_ASSEMBLY") == 0)
+            kernel_id = ELPA_2STAGE_COMPLEX_SSE_ASSEMBLY;
+        else if (strcmp(env, "ELPA_2STAGE_COMPLEX_SSE_BLOCK1") == 0)
+            kernel_id = ELPA_2STAGE_COMPLEX_SSE_BLOCK1;
+        else if (strcmp(env, "ELPA_2STAGE_COMPLEX_SSE_BLOCK2") == 0)
+            kernel_id = ELPA_2STAGE_COMPLEX_SSE_BLOCK2;
+        else if (strcmp(env, "ELPA_2STAGE_COMPLEX_AVX_BLOCK1") == 0)
+            kernel_id = ELPA_2STAGE_COMPLEX_AVX_BLOCK1;
+        else if (strcmp(env, "ELPA_2STAGE_COMPLEX_AVX_BLOCK2") == 0)
+            kernel_id = ELPA_2STAGE_COMPLEX_AVX_BLOCK2;
+        else if (strcmp(env, "ELPA_2STAGE_COMPLEX_AVX2_BLOCK1") == 0)
+            kernel_id = ELPA_2STAGE_COMPLEX_AVX2_BLOCK1;
+        else if (strcmp(env, "ELPA_2STAGE_COMPLEX_AVX2_BLOCK2") == 0)
+            kernel_id = ELPA_2STAGE_COMPLEX_AVX2_BLOCK2;
+        else if (strcmp(env, "ELPA_2STAGE_COMPLEX_AVX512_BLOCK1") == 0)
+            kernel_id = ELPA_2STAGE_COMPLEX_AVX512_BLOCK1;
+        else if (strcmp(env, "ELPA_2STAGE_COMPLEX_AVX512_BLOCK2") == 0)
+            kernel_id = ELPA_2STAGE_COMPLEX_AVX512_BLOCK2;
+        else if (strcmp(env, "ELPA_2STAGE_COMPLEX_SVE128_BLOCK1") == 0)
+            kernel_id = ELPA_2STAGE_COMPLEX_SVE128_BLOCK1;
+        else if (strcmp(env, "ELPA_2STAGE_COMPLEX_SVE128_BLOCK2") == 0)
+            kernel_id = ELPA_2STAGE_COMPLEX_SVE128_BLOCK2;
+        else if (strcmp(env, "ELPA_2STAGE_COMPLEX_SVE256_BLOCK1") == 0)
+            kernel_id = ELPA_2STAGE_COMPLEX_SVE256_BLOCK1;
+        else if (strcmp(env, "ELPA_2STAGE_COMPLEX_SVE256_BLOCK2") == 0)
+            kernel_id = ELPA_2STAGE_COMPLEX_SVE256_BLOCK2;
+        else if (strcmp(env, "ELPA_2STAGE_COMPLEX_SVE512_BLOCK1") == 0)
+            kernel_id = ELPA_2STAGE_COMPLEX_SVE512_BLOCK1;
+        else if (strcmp(env, "ELPA_2STAGE_COMPLEX_SVE512_BLOCK2") == 0)
+            kernel_id = ELPA_2STAGE_COMPLEX_SVE512_BLOCK2;
+        else if (strcmp(env, "ELPA_2STAGE_COMPLEX_NEON_ARCH64_BLOCK1") == 0)
+            kernel_id = ELPA_2STAGE_COMPLEX_NEON_ARCH64_BLOCK1;
+        else if (strcmp(env, "ELPA_2STAGE_COMPLEX_NEON_ARCH64_BLOCK2") == 0)
+            kernel_id = ELPA_2STAGE_COMPLEX_NEON_ARCH64_BLOCK2;
+        else if (strcmp(env, "ELPA_2STAGE_COMPLEX_NVIDIA_GPU") == 0)
+            kernel_id = ELPA_2STAGE_COMPLEX_NVIDIA_GPU;
+        else if (strcmp(env, "ELPA_2STAGE_COMPLEX_AMD_GPU") == 0)
+            kernel_id = ELPA_2STAGE_COMPLEX_AMD_GPU;
+        else if (strcmp(env, "ELPA_2STAGE_COMPLEX_INTEL_GPU") == 0)
+            kernel_id = ELPA_2STAGE_COMPLEX_INTEL_GPU;
+        else
+            kernel_id = ELPA_2STAGE_COMPLEX_GENERIC;
+    }
+    else
+    {
+        int cpuflag = read_cpuflag();
+        switch (cpuflag)
+        {
+        case 4:
+            kernel_id = ELPA_2STAGE_COMPLEX_AVX512_BLOCK2;
+            break;
+        case 3:
+            kernel_id = ELPA_2STAGE_COMPLEX_AVX2_BLOCK2;
+            break;
+        case 2:
+            kernel_id = ELPA_2STAGE_COMPLEX_AVX_BLOCK2;
+            break;
+        case 1:
+            kernel_id = ELPA_2STAGE_COMPLEX_SSE_BLOCK2;
+            break;
+        default:
+            kernel_id = ELPA_2STAGE_COMPLEX_GENERIC_SIMPLE;
+            break;
+        }
+    }
+    return kernel_id;
+}
+
+int ELPA_Solver::allocate_work()
+{
+    unsigned long nloc = narows * nacols; // local size
+    unsigned long maxloc; // maximum local size
+    MPI_Allreduce(&nloc, &maxloc, 1, MPI_UNSIGNED_LONG, MPI_MAX, comm);
+    if (isReal)
+        dwork.resize(maxloc);
+    else
+        zwork.resize(maxloc);
+    return 0;
+}
+
+void ELPA_Solver::timer(int myid, const char function[], const char step[], double& t0)
+{
+    double t1;
+    if (t0 < 0) // t0 < 0 means this is the init call before the function
+    {
+        t0 = MPI_Wtime();
+        logfile << "DEBUG: Process " << myid << " Call " << function << endl;
+    }
+    else
+    {
+        t1 = MPI_Wtime();
+        logfile << "DEBUG: Process " << myid << " Step " << step << " " << function << " time: " << t1 - t0 << " s"
+                << endl;
+    }
+}
+
+void ELPA_Solver::outputParameters()
+{
+    logfile << "myid " << myid << ": comm id(in FORTRAN):" << MPI_Comm_c2f(comm) << endl;
+    logfile << "myid " << myid << ": nprows: " << nprows << " npcols: " << npcols << endl;
+    logfile << "myid " << myid << ": myprow: " << myprow << " mypcol: " << mypcol << endl;
+    logfile << "myid " << myid << ": nFull: " << nFull << " nev: " << nev << endl;
+    logfile << "myid " << myid << ": narows: " << narows << " nacols: " << nacols << endl;
+    logfile << "myid " << myid << ": blacs parameters setting" << endl;
+    logfile << "myid " << myid << ": blacs ctxt:" << cblacs_ctxt << endl;
+    logfile << "myid " << myid << ": desc: ";
+    for (int i = 0; i < 9; ++i)
+        logfile << desc[i] << " ";
+    logfile << endl;
+    logfile << "myid " << myid << ": nblk: " << nblk << " lda: " << lda << endl;
+    logfile << "myid " << myid << ": useQR: " << useQR << " kernel:" << kernel_id << endl;
+    ;
+    logfile << "myid " << myid << ": wantDebug: " << wantDebug << " loglevel: " << loglevel << endl;
+}
diff --git a/source/module_hsolver/genelpa/elpa_new.h b/source/module_hsolver/genelpa/elpa_new.h
new file mode 100644
index 0000000000..a7743d2fe1
--- /dev/null
+++ b/source/module_hsolver/genelpa/elpa_new.h
@@ -0,0 +1,29 @@
+#pragma once
+
+extern "C"
+{
+#include <elpa/elpa_version.h>
+#include <limits.h>
+
+    struct elpa_struct;
+    typedef struct elpa_struct *elpa_t;
+
+    struct elpa_autotune_struct;
+    typedef struct elpa_autotune_struct *elpa_autotune_t;
+
+#include <elpa/elpa_constants.h>
+#include <elpa/elpa_generated_c_api.h>
+// ELPA only provides a C interface header, causing inconsistence of complex
+// between C99 (e.g. double complex) and C++11 (std::complex).
+// Thus, we have to define a wrapper of complex over the c api
+// for compatiability.
+#define complex _Complex
+#include <elpa/elpa_generated.h>
+    // #include <elpa/elpa_generic.h>
+#undef complex
+    const char *elpa_strerr(int elpa_error);
+}
+
+#define complex _Complex
+#include "elpa_generic.hpp" // This is a wrapper for `elpa/elpa_generic.h`.
+#undef complex
\ No newline at end of file
diff --git a/source/module_hsolver/genelpa/elpa_new_complex.cpp b/source/module_hsolver/genelpa/elpa_new_complex.cpp
new file mode 100644
index 0000000000..b9cb9de375
--- /dev/null
+++ b/source/module_hsolver/genelpa/elpa_new_complex.cpp
@@ -0,0 +1,454 @@
+#include <complex>
+#include <map>
+#include <regex>
+#include <fstream>
+#include <cfloat>
+#include <cstring>
+#include <mpi.h>
+
+#include "elpa_new.h"
+#include "elpa_solver.h"
+
+#include "my_math.hpp"
+#include "utils.h"
+
+using namespace std;
+
+extern map<int, elpa_t> NEW_ELPA_HANDLE_POOL;
+
+int ELPA_Solver::eigenvector(complex<double>* A, double* EigenValue, complex<double>* EigenVector)
+{
+    int info;
+    int allinfo;
+    double t;
+
+    if((loglevel>0 && myid==0) || loglevel>1)
+    {
+        t=-1;
+        timer(myid, "elpa_eigenvectors_dc", "1", t);
+    }
+    elpa_eigenvectors(NEW_ELPA_HANDLE_POOL[handle_id],
+                                        reinterpret_cast<double _Complex*>(A),
+                EigenValue, reinterpret_cast<double _Complex*>(EigenVector),
+                                        &info);
+    if((loglevel>0 && myid==0) || loglevel>1)
+    {
+        timer(myid, "elpa_eigenvectors_dc", "1", t);
+    }
+    MPI_Allreduce(&info, &allinfo, 1, MPI_INT, MPI_MAX, comm);
+    return allinfo;
+}
+
+int ELPA_Solver::generalized_eigenvector(complex<double>* A, complex<double>* B, int& DecomposedState,
+                                         double* EigenValue, complex<double>* EigenVector)
+{
+    int info, allinfo;
+    double t;
+
+    if((loglevel>0 && myid==0) || loglevel>1)
+    {
+        t=-1;
+        timer(myid, "decomposeRightMatrix", "1", t);
+    }
+    if(DecomposedState==0) // B is not decomposed
+        allinfo=decomposeRightMatrix(B, EigenValue, EigenVector, DecomposedState);
+    else
+        allinfo=0;
+
+    if((loglevel>0 && myid==0) || loglevel>1)
+    {
+        timer(myid, "decomposeRightMatrix", "1", t);
+    }
+    if(allinfo != 0)
+        return allinfo;
+
+    // transform A to A~
+    if((loglevel>0 && myid==0) || loglevel>1)
+    {
+        t=-1;
+        timer(myid, "transform A to A~", "2", t);
+    }
+    if(DecomposedState == 1 || DecomposedState == 2)
+    {
+        // calculate A*U^-1, put to work
+        if(loglevel>1)
+        {
+            t=-1;
+            timer(myid, "A*U^-1", "2.1a", t);
+        }
+        Cpzgemm('C', 'N', nFull, 1.0, A, B, 0.0, zwork.data(), desc);
+        if(loglevel>1)
+        {
+            timer(myid, "A*U^-1", "2.1a", t);
+        }
+
+        // calculate U^-C^(A*U^-1), put to a
+        if(loglevel>1)
+        {
+            t=-1;
+            timer(myid, "U^-T*(A*U^-1)", "2.2a", t);
+        }
+        Cpzgemm('C', 'N', nFull, 1.0, B, zwork.data(), 0.0, A, desc);
+        if(loglevel>1)
+        {
+            timer(myid, "U^-T*(A*U^-1)", "2.2a", t);
+        }
+    }
+    else
+    {
+        // calculate b*a^C and put to work
+        if(loglevel>1)
+        {
+            t=-1;
+            timer(myid, "B*A^T", "2.1b", t);
+        }
+        Cpzgemm('N', 'C', nFull, 1.0, B, A, 0.0, zwork.data(), desc);
+        if(loglevel>1)
+        {
+            timer(myid, "B*A^T", "2.1b", t);
+        }
+        // calculate b*work^C and put to a -- original A*x=v*B*x was transform to a*x'=v*x'
+        if(loglevel>1)
+        {
+            t=-1;
+            timer(myid, "B*(B*A^T)^T", "2.2b", t);
+        }
+        Cpzgemm('N', 'C', nFull, 1.0, B, zwork.data(), 0.0, A, desc);
+        if(loglevel>1)
+        {
+            timer(myid, "B*(B*A^T)^T", "2.2b", t);
+        }
+    }
+    if((loglevel>0 && myid==0) || loglevel>1)
+    {
+        timer(myid, "transform A to A~", "2", t);
+    }
+
+    // calculate the eigenvalue and eigenvector of A~
+    if((loglevel>0 && myid==0) || loglevel>1)
+    {
+        t=-1;
+        timer(myid, "elpa_eigenvectors", "3", t);
+    }
+    if(loglevel>2) saveMatrix("A_tilde.dat", nFull, A, desc, cblacs_ctxt);
+    //elpa_eigenvectors_all_host_arrays_dc(NEW_ELPA_HANDLE_POOL[handle_id], reinterpret_cast<double _Complex*>(A),
+    //                     EigenValue, reinterpret_cast<double _Complex*>(EigenVector), &info);
+    info=eigenvector(A, EigenValue, EigenVector);
+    if((loglevel>0 && myid==0) || loglevel>1)
+    {
+        timer(myid, "elpa_eigenvectors", "3", t);
+    }
+    MPI_Allreduce(&info, &allinfo, 1, MPI_INT, MPI_MAX, comm);
+    if(loglevel>2) saveMatrix("EigenVector_tilde.dat", nFull, EigenVector, desc, cblacs_ctxt);
+
+    if((loglevel>0 && myid==0) || loglevel>1)
+    {
+        t=-1;
+        timer(myid, "composeEigenVector", "4", t);
+    }
+    // transform eigenvector c~ to original eigenvector c
+    allinfo=composeEigenVector(DecomposedState, B, EigenVector);
+    if((loglevel>0 && myid==0) || loglevel>1)
+    {
+        timer(myid, "composeEigenVector", "4", t);
+    }
+    return allinfo;
+}
+
+int ELPA_Solver::decomposeRightMatrix(complex<double>* B, double* EigenValue, complex<double>* EigenVector, int& DecomposedState)
+{
+    double _Complex* b = reinterpret_cast<double _Complex*>(B);
+    double _Complex* q = reinterpret_cast<double _Complex*>(EigenVector);
+
+    int info=0;
+    int allinfo=0;
+    double t;
+
+    // first try cholesky decomposing
+    if(nFull<CHOLESKY_CRITICAL_SIZE) // use pdpotrf for small matrix
+    {
+        DecomposedState=1;
+        if(loglevel>1)
+        {
+            t=-1;
+            timer(myid, "pzpotrf_", "1", t);
+        }
+        Cpzpotrf('U', nFull, B, desc);
+        if(loglevel>1)
+        {
+            timer(myid, "pzpotrf_", "1", t);
+        }
+        MPI_Allreduce(&info, &allinfo, 1, MPI_INT, MPI_MAX, comm);
+        if(allinfo != 0) //if pdpotrf fail, try elpa_cholesky_real
+        {
+            DecomposedState=2;
+            if(loglevel>1)
+            {
+                t=-1;
+                timer(myid, "elpa_cholesky_dc", "2", t);
+            }
+            elpa_cholesky_dc(NEW_ELPA_HANDLE_POOL[handle_id], b, &info);
+            if(loglevel>1)
+            {
+                timer(myid, "elpa_cholesky_dc", "2", t);
+            }
+            MPI_Allreduce(&info, &allinfo, 1, MPI_INT, MPI_MAX, comm);
+        }
+    } else
+    {
+        DecomposedState=2;
+        if(loglevel>1)
+        {
+            t=-1;
+            timer(myid, "elpa_cholesky_dc", "1", t);
+        }
+        elpa_cholesky_dc(NEW_ELPA_HANDLE_POOL[handle_id], b, &info);
+        if(loglevel>1)
+        {
+            timer(myid, "elpa_cholesky_dc", "1", t);
+        }
+        MPI_Allreduce(&info, &allinfo, 1, MPI_INT, MPI_MAX, comm);
+        if(allinfo != 0)
+        {
+            DecomposedState=1;
+            if(loglevel>1)
+            {
+                t=-1;
+                timer(myid, "pzpotrf_", "2", t);
+            }
+            Cpzpotrf('U', nFull, B, desc);
+            if(loglevel>1)
+            {
+                timer(myid, "pzpotrf_", "2", t);
+            }
+            MPI_Allreduce(&info, &allinfo, 1, MPI_INT, MPI_MAX, comm);
+        }
+    }
+
+    if(allinfo==0) // calculate U^{-1}
+    {
+        if(loglevel>1)
+        {
+            t=-1;
+            timer(myid, "clear low triangle", "1", t);
+        }
+        for(int j=0; j<nacols; ++j)
+        {
+            int jGlobal=globalIndex(j, nblk, npcols, mypcol);
+            for(int i=0; i<narows; ++i)
+            {
+                int iGlobal=globalIndex(i, nblk, nprows, myprow);
+                if(iGlobal>jGlobal) B[i+j*narows]=0;
+            }
+        }
+        if(loglevel>1)
+        {
+            timer(myid, "clear low triangle", "1", t);
+        }
+        if(loglevel>2) saveMatrix("U.dat", nFull, B, desc, cblacs_ctxt);
+        if(loglevel>1)
+        {
+            t=-1;
+            timer(myid, "invert U", "1", t);
+        }
+        elpa_invert_trm_dc(NEW_ELPA_HANDLE_POOL[handle_id], b, &info);
+        if(loglevel>1)
+        {
+            timer(myid, "invert U", "1", t);
+        }
+        if(loglevel>2) saveMatrix("U_inv.dat", nFull, B, desc, cblacs_ctxt);
+    } else {
+        // if cholesky decomposing failed, try diagonalize
+        // calculate B^{-1/2}_{i,j}=\sum_k q_{i,k}*ev_k^{-1/2}*q_{j,k} and put to b, which will be b^-1/2
+        DecomposedState=3;
+        if(loglevel>1)
+        {
+            t=-1;
+            timer(myid, "calculate eigenvalue and eigenvector of B", "1", t);
+        }
+        //elpa_eigenvectors_all_host_arrays_dc(NEW_ELPA_HANDLE_POOL[handle_id], b,
+        //                     EigenValue, q, &info);
+		info=eigenvector(B, EigenValue, EigenVector);
+        if(loglevel>1)
+        {
+            timer(myid, "calculate eigenvalue and eigenvector of B", "1", t);
+        }
+        MPI_Allreduce(&info, &allinfo, 1, MPI_INT, MPI_MAX, comm);
+        // calculate q*ev and put to work
+        for(int i=0; i<nacols; ++i)
+        {
+            int eidx=globalIndex(i, nblk, npcols, mypcol);
+            //double ev_sqrt=1.0/sqrt(ev[eidx]);
+            double ev_sqrt=EigenValue[eidx]>DBL_MIN?1.0/sqrt(EigenValue[eidx]):0;
+            for(int j=0; j<narows; ++j)
+                zwork[i*lda+j]=EigenVector[i*lda+j]*ev_sqrt;
+        }
+
+        // calculate qevq=qev*q^T, put to b, which is B^{-1/2}
+        if(loglevel>1)
+        {
+            t=-1;
+            timer(myid, "qevq=qev*q^T", "2", t);
+        }
+        Cpzgemm('N', 'C', nFull, 1.0, zwork.data(), EigenVector, 0.0, B, desc);
+        if(loglevel>1)
+        {
+            timer(myid, "qevq=qev*q^T", "2", t);
+        }
+    }
+    return allinfo;
+}
+
+int ELPA_Solver::composeEigenVector(int DecomposedState, complex<double>* B, complex<double>* EigenVector)
+{
+    double t;
+    if(DecomposedState==1 || DecomposedState==2)
+    {
+        // transform the eigenvectors to original general equation, let U^-1*q, and put to q
+        if(loglevel>1)
+        {
+            t=-1;
+            timer(myid, "Cpztrmm", "1", t);
+        }
+        Cpztrmm('L', 'U', 'N', 'N', nFull, nev, 1.0, B, EigenVector, desc);
+        if(loglevel>1)
+        {
+            timer(myid, "Cpztrmm", "1", t);
+        }
+    } else {
+        // transform the eigenvectors to original general equation, let b^C*q, and put to q
+        if(loglevel>1)
+        {
+            t=-1;
+            timer(myid, "Cpzgemm", "1", t);
+        }
+        Cpzgemm('C', 'N', nFull, nev, nFull, 1.0, B, zwork.data(), 0.0, EigenVector, desc);
+        if(loglevel>1)
+        {
+            timer(myid, "Cpzgemm", "1", t);
+        }
+    }
+    return 0;
+}
+
+// calculate the error
+// $ \ket{ \delta \psi_i } = H\ket{\psi_i} $
+// $ \delta_i = \braket{ \delta \psi_i | \delta \psi_i } $
+//
+// V: eigenvector matrix
+// D: Diagonal matrix of eigenvalue
+// maxError: maximum absolute value of error
+// meanError: mean absolute value of error
+void ELPA_Solver::verify(complex<double>* A, double* EigenValue, complex<double>* EigenVector,
+                         double &maxError, double &meanError)
+{
+    complex<double>* V=EigenVector;
+    const int naloc=narows*nacols;
+    complex<double>* D=new complex<double>[naloc];
+    complex<double>* R=zwork.data();
+
+    for(int i=0; i<naloc; ++i)
+        D[i]=0;
+
+    for(int i=0; i<nFull; ++i)
+    {
+        int localRow, localCol;
+        int localProcRow, localProcCol;
+
+        localRow=localIndex(i, nblk, nprows, localProcRow);
+        if(myprow==localProcRow)
+        {
+            localCol=localIndex(i, nblk, npcols, localProcCol);
+            if(mypcol==localProcCol)
+            {
+                int idx = localRow + localCol*narows;
+                D[idx]=EigenValue[i];
+            }
+        }
+    }
+
+    // R=V*D
+    Cpzhemm('R', 'U', nFull, 1.0, D, V, 0.0, R, desc);
+    if(loglevel>2) saveMatrix("VD.dat", nFull, R, desc, cblacs_ctxt);
+    // R=A*V-V*D=A*V-R
+    Cpzhemm('L', 'U', nFull, 1.0, A, V, -1.0, R, desc);
+    if(loglevel>2) saveMatrix("AV-VD.dat", nFull, R, desc, cblacs_ctxt);
+    // calculate the maximum and mean value of sum_i{R(:,i)*R(:,i)}
+    double sumError=0;
+    maxError=0;
+    for(int i=1; i<=nev; ++i)
+    {
+        complex<double> E;
+        Cpzdotc(nFull, E, R, 1, i, 1,
+                         R, 1, i, 1, desc);
+        double abs_E=std::abs(E);
+        sumError+=abs_E;
+        maxError=std::max(maxError, abs_E);
+    }
+    meanError=sumError/nFull;
+    delete[] D;
+}
+
+// calculate the error
+// $ \ket{ \delta \psi_i } = (H - \epsilon_i S)\ket{\psi_i} $
+// $ \delta_i = \braket{ \delta \psi_i | \delta \psi_i } $
+//
+// V: eigenvector matrix
+// D: Diagonal matrix of eigenvalue
+// maxError: maximum absolute value of error
+// meanError: mean absolute value of error
+void ELPA_Solver::verify(complex<double>* A, complex<double>* B,
+                        double* EigenValue, complex<double>* EigenVector,
+                        double &maxError, double &meanError)
+{
+    complex<double>* V=EigenVector;
+    const int naloc=narows*nacols;
+    complex<double>* D=new complex<double>[naloc];
+    complex<double>* R=new complex<double>[naloc];
+
+    for(int i=0; i<naloc; ++i)
+        D[i]=0;
+
+    for(int i=0; i<nFull; ++i)
+    {
+        int localRow, localCol;
+        int localProcRow, localProcCol;
+
+        localRow=localIndex(i, nblk, nprows, localProcRow);
+        if(myprow==localProcRow)
+        {
+            localCol=localIndex(i, nblk, npcols, localProcCol);
+            if(mypcol==localProcCol)
+            {
+                int idx = localRow + localCol*narows;
+                D[idx]=EigenValue[i];
+            }
+        }
+    }
+
+    // zwork=B*V
+    Cpzhemm('L', 'U', nFull, 1.0, B, V, 0.0, zwork.data(), desc);
+    if(loglevel>2) saveMatrix("BV.dat", nFull, zwork.data(), desc, cblacs_ctxt);
+    // R=B*V*D=zwork*D
+    Cpzhemm('R', 'U', nFull, 1.0, D, zwork.data(), 0.0, R, desc);
+    if(loglevel>2) saveMatrix("BVD.dat", nFull, R, desc, cblacs_ctxt);
+    // R=A*V-B*V*D=A*V-R
+    Cpzhemm('L', 'U', nFull, 1.0, A, V, -1.0, R, desc);
+    if(loglevel>2) saveMatrix("AV-BVD.dat", nFull, R, desc, cblacs_ctxt);
+    // calculate the maximum and mean value of sum_i{R(:,i)*R(:,i)}
+    double sumError=0;
+    maxError=0;
+    for(int i=1; i<=nev; ++i)
+    {
+        complex<double> E;
+        Cpzdotc(nFull, E, R, 1, i, 1,
+                         R, 1, i, 1, desc);
+        double abs_E=std::abs(E);
+        sumError+=abs_E;
+        maxError=std::max(maxError, abs_E);
+    }
+    meanError=sumError/nFull;
+
+    delete[] D;
+    delete[] R;
+}
diff --git a/source/module_hsolver/genelpa/elpa_new_real.cpp b/source/module_hsolver/genelpa/elpa_new_real.cpp
new file mode 100644
index 0000000000..d6b606b007
--- /dev/null
+++ b/source/module_hsolver/genelpa/elpa_new_real.cpp
@@ -0,0 +1,458 @@
+#include "elpa_new.h"
+#include "elpa_solver.h"
+#include "my_math.hpp"
+#include "utils.h"
+
+#include <cfloat>
+#include <complex>
+#include <cstring>
+#include <fstream>
+#include <map>
+#include <mpi.h>
+#include <regex>
+
+using namespace std;
+extern map<int, elpa_t> NEW_ELPA_HANDLE_POOL;
+
+int ELPA_Solver::eigenvector(double* A, double* EigenValue, double* EigenVector)
+{
+    int info;
+    double t;
+
+    if (loglevel > 0 && myid == 0)
+    {
+        t = -1;
+        timer(myid, "elpa_eigenvectors_all_host_arrays_d", "1", t);
+    }
+    elpa_eigenvectors(NEW_ELPA_HANDLE_POOL[handle_id], A, EigenValue, EigenVector, &info);
+    if (loglevel > 0 && myid == 0)
+    {
+        timer(myid, "elpa_eigenvectors_all_host_arrays_d", "1", t);
+    }
+    return info;
+}
+
+int ELPA_Solver::generalized_eigenvector(double* A,
+                                         double* B,
+                                         int& DecomposedState,
+                                         double* EigenValue,
+                                         double* EigenVector)
+{
+    int info, allinfo;
+    double t;
+
+    if (loglevel > 0 && myid == 0)
+    {
+        t = -1;
+        timer(myid, "decomposeRightMatrix", "1", t);
+    }
+    if (DecomposedState == 0)
+        allinfo = decomposeRightMatrix(B, EigenValue, EigenVector, DecomposedState);
+    else
+        allinfo = 0;
+    if (loglevel > 0 && myid == 0)
+    {
+        timer(myid, "decomposeRightMatrix", "1", t);
+    }
+    if (allinfo != 0)
+        return allinfo;
+
+    // transform A to A~
+    if ((loglevel > 0 && myid == 0) || loglevel > 1)
+    {
+        t = -1;
+        timer(myid, "transform A to A~", "2", t);
+    }
+    if (DecomposedState == 1 || DecomposedState == 2)
+    {
+        // calculate A*U^-1, put to work
+        if (loglevel > 1)
+        {
+            t = -1;
+            timer(myid, "A*U^-1", "2", t);
+        }
+        Cpdgemm('T', 'N', nFull, 1.0, A, B, 0.0, dwork.data(), desc);
+        if (loglevel > 1)
+        {
+            timer(myid, "A*U^-1", "2", t);
+        }
+
+        // calculate U^-T^(A*U^-1), put to a
+        if (loglevel > 1)
+        {
+            t = -1;
+            timer(myid, "U^-T^(A*U^-1)", "3", t);
+        }
+        Cpdgemm('T', 'N', nFull, 1.0, B, dwork.data(), 0.0, A, desc);
+        if (loglevel > 1)
+        {
+            timer(myid, "U^-T^(A*U^-1)", "3", t);
+        }
+    }
+    else
+    {
+        // calculate B*A^T and put to work
+        if (loglevel > 1)
+        {
+            t = -1;
+            timer(myid, "B*A^T", "2", t);
+        }
+        Cpdgemm('N', 'T', nFull, 1.0, B, A, 0.0, dwork.data(), desc);
+        if (loglevel > 1)
+        {
+            timer(myid, "B*A^T", "2", t);
+        }
+        // calculate B*work^T = B*(B*A^T)^T and put to A -- original A*x=v*B*x was transform to a*x'=v*x'
+        if (loglevel > 1)
+        {
+            t = -1;
+            timer(myid, "B*work^T = B*(B*A^T)^T", "3", t);
+        }
+        Cpdgemm('N', 'T', nFull, 1.0, B, dwork.data(), 0.0, A, desc);
+        if (loglevel > 1)
+        {
+            timer(myid, "B*work^T = B*(B*A^T)^T", "3", t);
+        }
+    }
+    if ((loglevel > 0 && myid == 0) || loglevel > 1)
+    {
+        timer(myid, "transform A to A~", "2", t);
+    }
+
+    if ((loglevel > 0 && myid == 0) || loglevel > 1)
+    {
+        t = -1;
+        timer(myid, "elpa_eigenvectors", "2", t);
+    }
+    if (loglevel > 2)
+        saveMatrix("A_tilde.dat", nFull, A, desc, cblacs_ctxt);
+    info = eigenvector(A, EigenValue, EigenVector);
+
+    if ((loglevel > 0 && myid == 0) || loglevel > 1)
+    {
+        timer(myid, "elpa_eigenvectors", "2", t);
+    }
+
+    MPI_Allreduce(&info, &allinfo, 1, MPI_INT, MPI_MAX, comm);
+    if (loglevel > 2)
+        saveMatrix("EigenVector_tilde.dat", nFull, EigenVector, desc, cblacs_ctxt);
+
+    if (loglevel > 0 && myid == 0)
+    {
+        t = -1;
+        timer(myid, "composeEigenVector", "3", t);
+    }
+    allinfo = composeEigenVector(DecomposedState, B, EigenVector);
+    if (loglevel > 0 && myid == 0)
+    {
+        timer(myid, "composeEigenVector", "3", t);
+    }
+    return allinfo;
+}
+
+// calculate cholesky factorization of matrix B
+// B = U^T * U
+// and calculate the inverse: U^{-1}
+// input:
+//      B: the right side matrix of generalized eigen equation
+// output:
+//      DecomposedState: the method used to decompose right matrix
+//                  1 or 2: use cholesky decomposing, B=U^T*U
+//                  3: if cholesky decomposing failed, use diagonalizing
+//      B: decomposed right matrix
+//           when DecomposedState is 1 or 2, B is U^{-1}
+//           when DecomposedState is 3, B is B^{-1/2}
+int ELPA_Solver::decomposeRightMatrix(double* B, double* EigenValue, double* EigenVector, int& DecomposedState)
+{
+    int info = 0;
+    int allinfo = 0;
+    double t;
+
+    // first try cholesky decomposing
+    if (nFull < CHOLESKY_CRITICAL_SIZE)
+    {
+        DecomposedState = 1;
+        if (loglevel > 1)
+        {
+            t = -1;
+            timer(myid, "pdpotrf_", "1", t);
+        }
+        info = Cpdpotrf('U', nFull, B, desc);
+        if (loglevel > 1)
+        {
+            timer(myid, "pdpotrf_", "1", t);
+        }
+        MPI_Allreduce(&info, &allinfo, 1, MPI_INT, MPI_MAX, comm);
+        if (allinfo != 0) // pdpotrf fail, try elpa_cholesky_real
+        {
+            DecomposedState = 2;
+            if (loglevel > 1)
+            {
+                t = -1;
+                timer(myid, "elpa_cholesky_d", "2", t);
+            }
+            elpa_cholesky_d(NEW_ELPA_HANDLE_POOL[handle_id], B, &info);
+            if (loglevel > 1)
+            {
+                timer(myid, "elpa_cholesky_d", "2", t);
+            }
+            MPI_Allreduce(&info, &allinfo, 1, MPI_INT, MPI_MAX, comm);
+        }
+    }
+    else
+    {
+        DecomposedState = 2;
+        if (loglevel > 1)
+        {
+            t = -1;
+            timer(myid, "elpa_cholesky_d", "1", t);
+        }
+        elpa_cholesky_d(NEW_ELPA_HANDLE_POOL[handle_id], B, &info);
+        if (loglevel > 1)
+        {
+            timer(myid, "elpa_cholesky_d", "1", t);
+        }
+        MPI_Allreduce(&info, &allinfo, 1, MPI_INT, MPI_MAX, comm);
+        if (allinfo != 0)
+        {
+            DecomposedState = 1;
+            if (loglevel > 1)
+            {
+                t = -1;
+                timer(myid, "pdpotrf_", "2", t);
+            }
+            info = Cpdpotrf('U', nFull, B, desc);
+            if (loglevel > 1)
+            {
+                timer(myid, "pdpotrf_", "2", t);
+            }
+            MPI_Allreduce(&info, &allinfo, 1, MPI_INT, MPI_MAX, comm);
+        }
+    }
+
+    if (allinfo == 0) // calculate U^{-1}
+    {
+        // clear low triangle
+        if (loglevel > 1)
+        {
+            t = -1;
+            timer(myid, "clear low triangle", "1", t);
+        }
+        for (int j = 0; j < nacols; ++j)
+        {
+            int jGlobal = globalIndex(j, nblk, npcols, mypcol);
+            for (int i = 0; i < narows; ++i)
+            {
+                int iGlobal = globalIndex(i, nblk, nprows, myprow);
+                if (iGlobal > jGlobal)
+                    B[i + j * narows] = 0;
+            }
+        }
+        if (loglevel > 1)
+        {
+            timer(myid, "clear low triangle", "1", t);
+        }
+        if (loglevel > 2)
+            saveMatrix("U.dat", nFull, B, desc, cblacs_ctxt);
+        // calculate the inverse U^{-1}
+        if (loglevel > 1)
+        {
+            t = -1;
+            timer(myid, "invert U", "1", t);
+        }
+        elpa_invert_trm_d(NEW_ELPA_HANDLE_POOL[handle_id], B, &info);
+        if (loglevel > 1)
+        {
+            timer(myid, "invert U", "1", t);
+        }
+        if (loglevel > 2)
+            saveMatrix("U_inv.dat", nFull, B, desc, cblacs_ctxt);
+    }
+    else
+    {
+        // if cholesky decomposing failed, try diagonalize
+        // calculate B^{-1/2}_{i,j}=\sum_k q_{i,k}*ev_k^{-1/2}*q_{j,k} and put to b, which will be b^-1/2
+        DecomposedState = 3;
+        if (loglevel > 1)
+        {
+            t = -1;
+            timer(myid, "calculate eigenvalue and eigenvector of B", "1", t);
+        }
+        // elpa_eigenvectors_all_host_arrays_d(NEW_ELPA_HANDLE_POOL[handle_id], B, EigenValue, EigenVector, &info);
+        info = eigenvector(B, EigenValue, EigenVector);
+        if (loglevel > 1)
+        {
+            timer(myid, "calculate eigenvalue and eigenvector of B", "1", t);
+        }
+        MPI_Allreduce(&info, &allinfo, 1, MPI_INT, MPI_MAX, comm);
+
+        // calculate q*ev^{-1/2} and put to work
+        for (int i = 0; i < nacols; ++i)
+        {
+            int eidx = globalIndex(i, nblk, npcols, mypcol);
+            // double ev_sqrt=1.0/sqrt(ev[eidx]);
+            double ev_sqrt = EigenValue[eidx] > DBL_MIN ? 1.0 / sqrt(EigenValue[eidx]) : 0;
+            for (int j = 0; j < narows; ++j)
+                dwork[i * lda + j] = EigenVector[i * lda + j] * ev_sqrt;
+        }
+
+        // calculate work*q=q*ev^{-1/2}*q^T, put to b, which is B^{-1/2}
+        if (loglevel > 1)
+        {
+            t = -1;
+            timer(myid, "qevq=qev*q^T", "2", t);
+        }
+        Cpdgemm('N', 'T', nFull, 1.0, dwork.data(), EigenVector, 0.0, B, desc);
+        if (loglevel > 1)
+        {
+            timer(myid, "qevq=qev*q^T", "2", t);
+        }
+    }
+    return allinfo;
+}
+
+int ELPA_Solver::composeEigenVector(int DecomposedState, double* B, double* EigenVector)
+{
+    double t;
+    if (DecomposedState == 1 || DecomposedState == 2)
+    {
+        // transform the eigenvectors to original general equation, let U^-1*q, and put to q
+        if (loglevel > 1)
+        {
+            t = -1;
+            timer(myid, "Cpdtrmm", "1", t);
+        }
+        Cpdtrmm('L', 'U', 'N', 'N', nFull, nev, 1.0, B, EigenVector, desc);
+        if (loglevel > 1)
+        {
+            timer(myid, "Cpdtrmm", "1", t);
+        }
+    }
+    else
+    {
+        // transform the eigenvectors to original general equation, let b^T*q, and put to q
+        if (loglevel > 1)
+        {
+            t = -1;
+            timer(myid, "Cpdgemm", "1", t);
+        }
+        Cpdgemm('T', 'N', nFull, 1.0, B, dwork.data(), 0.0, EigenVector, desc);
+        if (loglevel > 1)
+        {
+            timer(myid, "Cpdgemm", "1", t);
+        }
+    }
+    return 0;
+}
+
+// calculate error of sum_i{R(:,i)*R(:,i)}, where R = A*V - V*D
+// V: eigenvector matrix
+// D: Diaganal matrix of eigenvalue
+// maxError: maximum error
+// meanError: mean error
+void ELPA_Solver::verify(double* A, double* EigenValue, double* EigenVector, double& maxError, double& meanError)
+{
+    double* V = EigenVector;
+    const int naloc = narows * nacols;
+    double* D = new double[naloc];
+    double* R = dwork.data();
+
+    for (int i = 0; i < naloc; ++i)
+        D[i] = 0;
+
+    for (int i = 0; i < nFull; ++i)
+    {
+        int localRow, localCol;
+        int localProcRow, localProcCol;
+
+        localRow = localIndex(i, nblk, nprows, localProcRow);
+        if (myprow == localProcRow)
+        {
+            localCol = localIndex(i, nblk, npcols, localProcCol);
+            if (mypcol == localProcCol)
+            {
+                int idx = localRow + localCol * narows;
+                D[idx] = EigenValue[i];
+            }
+        }
+    }
+
+    // R=V*D
+    Cpdsymm('R', 'U', nFull, nev, 1.0, D, V, 0.0, R, desc);
+    // R=A*V-V*D=A*V-R
+    Cpdsymm('L', 'U', nFull, nev, 1.0, A, V, -1.0, R, desc);
+    // calculate the maximum and mean value of sum_i{R(:,i)*R(:,i)}
+    double sumError = 0;
+    maxError = 0;
+    for (int i = 1; i <= nev; ++i)
+    {
+        double E;
+        Cpddot(nFull, E, R, 1, i, 1, R, 1, i, 1, desc);
+        // printf("myid: %d, i: %d, E: %lf\n", myid, i, E);
+        sumError += E;
+        maxError = maxError > E ? maxError : E;
+    }
+    meanError = sumError / nFull;
+    // global mean and max Error
+    delete[] D;
+}
+
+// calculate remains of A*V - B*V*D
+// V: eigenvector matrix
+// D: Diaganal matrix of eigenvalue
+// maxError: maximum absolute value of error
+// meanError: mean absolute value of error
+void ELPA_Solver::verify(double* A,
+                         double* B,
+                         double* EigenValue,
+                         double* EigenVector,
+                         double& maxError,
+                         double& meanError)
+{
+    double* V = EigenVector;
+    const int naloc = narows * nacols;
+    double* D = new double[naloc];
+    double* R = new double[naloc];
+
+    for (int i = 0; i < naloc; ++i)
+        D[i] = 0;
+
+    for (int i = 0; i < nFull; ++i)
+    {
+        int localRow, localCol;
+        int localProcRow, localProcCol;
+
+        localRow = localIndex(i, nblk, nprows, localProcRow);
+        if (myprow == localProcRow)
+        {
+            localCol = localIndex(i, nblk, npcols, localProcCol);
+            if (mypcol == localProcCol)
+            {
+                int idx = localRow + localCol * narows;
+                D[idx] = EigenValue[i];
+            }
+        }
+    }
+
+    // dwork=B*V
+    Cpdsymm('L', 'U', nFull, 1.0, B, V, 0.0, dwork.data(), desc);
+    // R=B*V*D=dwork*D
+    Cpdsymm('R', 'U', nFull, 1.0, D, dwork.data(), 0.0, R, desc);
+    // R=A*V-B*V*D=A*V-R
+    Cpdsymm('L', 'U', nFull, 1.0, A, V, -1.0, R, desc);
+    // calculate the maximum and mean value of sum_i{R(:,i)*R(:,i)}
+    double sumError = 0;
+    maxError = 0;
+    for (int i = 1; i <= nev; ++i)
+    {
+        double E;
+        Cpddot(nFull, E, R, 1, i, 1, R, 1, i, 1, desc);
+        // printf("myid: %d, i: %d, E: %lf\n", myid, i, E);
+        sumError += E;
+        maxError = maxError > E ? maxError : E;
+    }
+    meanError = sumError / nFull;
+
+    delete[] D;
+    delete[] R;
+}
diff --git a/source/module_hsolver/genelpa/elpa_solver.h b/source/module_hsolver/genelpa/elpa_solver.h
new file mode 100644
index 0000000000..13b8bc5ecc
--- /dev/null
+++ b/source/module_hsolver/genelpa/elpa_solver.h
@@ -0,0 +1,98 @@
+#pragma once
+#include "mpi.h"
+
+#include <complex>
+#include <fstream>
+#include <vector>
+
+class ELPA_Solver
+{
+  public:
+    ELPA_Solver(const bool isReal,
+                const MPI_Comm comm,
+                const int nev,
+                const int narows,
+                const int nacols,
+                const int* desc);
+    ELPA_Solver(const bool isReal,
+                const MPI_Comm comm,
+                const int nev,
+                const int narows,
+                const int nacols,
+                const int* desc,
+                const int* otherParameter);
+
+    int eigenvector(double* A, double* EigenValue, double* EigenVector);
+    int generalized_eigenvector(double* A, double* B, int& DecomposedState, double* EigenValue, double* EigenVector);
+    int eigenvector(std::complex<double>* A, double* EigenValue, std::complex<double>* EigenVector);
+    int generalized_eigenvector(std::complex<double>* A,
+                                std::complex<double>* B,
+                                int& DecomposedState,
+                                double* EigenValue,
+                                std::complex<double>* EigenVector);
+    void setLoglevel(int loglevel);
+    void setKernel(bool isReal, int Kernel);
+    void setQR(int useQR);
+    void outputParameters();
+    void verify(double* A, double* EigenValue, double* EigenVector, double& maxRemain, double& meanRemain);
+    void verify(double* A, double* B, double* EigenValue, double* EigenVector, double& maxRemain, double& meanRemain);
+    void verify(std::complex<double>* A,
+                double* EigenValue,
+                std::complex<double>* EigenVector,
+                double& maxError,
+                double& meanError);
+    void verify(std::complex<double>* A,
+                std::complex<double>* B,
+                double* EigenValue,
+                std::complex<double>* EigenVector,
+                double& maxError,
+                double& meanError);
+    void exit();
+
+  private:
+    const int CHOLESKY_CRITICAL_SIZE = 1000;
+    bool isReal;
+    MPI_Comm comm;
+    int nFull;
+    int nev;
+    int narows;
+    int nacols;
+    int desc[9];
+    int method;
+    int kernel_id;
+    int cblacs_ctxt;
+    int nblk;
+    int lda;
+    std::vector<double> dwork;
+    std::vector<std::complex<double>> zwork;
+    int myid;
+    int nprows;
+    int npcols;
+    int myprow;
+    int mypcol;
+    int useQR;
+    int wantDebug;
+    int loglevel;
+    std::ofstream logfile;
+    // for legacy interface
+    int comm_f;
+    int mpi_comm_rows;
+    int mpi_comm_cols;
+    // for new elpa handle
+    int handle_id;
+
+    // toolbox
+    int read_cpuflag();
+    int read_real_kernel();
+    int read_complex_kernel();
+    int allocate_work();
+    int decomposeRightMatrix(double* B, double* EigenValue, double* EigenVector, int& DecomposedState);
+    int decomposeRightMatrix(std::complex<double>* B,
+                             double* EigenValue,
+                             std::complex<double>* EigenVector,
+                             int& DecomposedState);
+    int composeEigenVector(int DecomposedState, double* B, double* EigenVector);
+    int composeEigenVector(int DecomposedState, std::complex<double>* B, std::complex<double>* EigenVector);
+    // debug tool
+    void timer(int myid, const char function[], const char step[], double& t0);
+};
diff --git a/source/module_hsolver/genelpa/my_math.hpp b/source/module_hsolver/genelpa/my_math.hpp
new file mode 100644
index 0000000000..999389e852
--- /dev/null
+++ b/source/module_hsolver/genelpa/my_math.hpp
@@ -0,0 +1,384 @@
+#pragma once
+// simple wrappers for blas, pblas and scalapack
+// NOTE: some parameters of these functions are not supported
+extern "C"
+{
+#include "Cblacs.h"
+#include "blas.h"
+#include "pblas.h"
+#include "scalapack.h"
+
+}
+#include <complex>
+
+static inline void Cdcopy(const int n, double* a, double* b)
+{
+    int inc = 1;
+    dcopy_(&n, a, &inc, b, &inc);
+}
+
+static inline void Czcopy(const int n, std::complex<double>* a, std::complex<double>* b)
+{
+    double _Complex* aa = reinterpret_cast<double _Complex*>(a);
+    double _Complex* bb = reinterpret_cast<double _Complex*>(b);
+    int inc = 1;
+    zcopy_(&n, aa, &inc, bb, &inc);
+}
+
+static inline void Cpddot(int n,
+                          double& dot,
+                          double* a,
+                          int ia,
+                          int ja,
+                          int inca,
+                          double* b,
+                          int ib,
+                          int jb,
+                          int incb,
+                          int* desc)
+{
+    pddot_(&n, &dot, a, &ia, &ja, desc, &inca, b, &ib, &jb, desc, &incb);
+}
+
+static inline void Cpzdotc(int n,
+                           std::complex<double>& dotc,
+                           std::complex<double>* a,
+                           int ia,
+                           int ja,
+                           int inca,
+                           std::complex<double>* b,
+                           int ib,
+                           int jb,
+                           int incb,
+                           int* desc)
+{
+    double _Complex* aa = reinterpret_cast<double _Complex*>(a);
+    double _Complex* bb = reinterpret_cast<double _Complex*>(b);
+    double _Complex* dotc_c = reinterpret_cast<double _Complex*>(&dotc);
+    pzdotc_(&n, dotc_c, aa, &ia, &ja, desc, &inca, bb, &ib, &jb, desc, &incb);
+}
+
+static inline int Cpdpotrf(const char uplo, const int na, double* U, int* desc)
+{
+    int isrc = 1;
+    int info;
+    pdpotrf_(&uplo, &na, U, &isrc, &isrc, desc, &info);
+    return info;
+}
+
+static inline int Cpzpotrf(const char uplo, const int na, std::complex<double>* U, int* desc)
+{
+    int isrc = 1;
+    int info;
+    double _Complex* uu = reinterpret_cast<double _Complex*>(U);
+    pzpotrf_(&uplo, &na, uu, &isrc, &isrc, desc, &info);
+    return info;
+}
+
+static inline void Cpdtrmm(char side,
+                           char uplo,
+                           char trans,
+                           char diag,
+                           int m,
+                           int n,
+                           double alpha,
+                           double* a,
+                           double* b,
+                           int* desc)
+{
+    int isrc = 1;
+    pdtrmm_(&side, &uplo, &trans, &diag, &m, &n, &alpha, a, &isrc, &isrc, desc, b, &isrc, &isrc, desc);
+}
+
+static inline void Cpztrmm(char side,
+                           char uplo,
+                           char trans,
+                           char diag,
+                           int m,
+                           int n,
+                           std::complex<double> alpha,
+                           std::complex<double>* a,
+                           std::complex<double>* b,
+                           int* desc)
+{
+    int isrc = 1;
+    double _Complex* alpha_c = reinterpret_cast<double _Complex*>(&alpha);
+    double _Complex* aa = reinterpret_cast<double _Complex*>(a);
+    double _Complex* bb = reinterpret_cast<double _Complex*>(b);
+    pztrmm_(&side, &uplo, &trans, &diag, &m, &n, alpha_c, aa, &isrc, &isrc, desc, bb, &isrc, &isrc, desc);
+}
+
+static inline void Cpdgemm(char transa,
+                           char transb,
+                           int m,
+                           int n,
+                           int k,
+                           double alpha,
+                           double* a,
+                           double* b,
+                           double beta,
+                           double* c,
+                           int* desc)
+{
+    int isrc = 1;
+    pdgemm_(&transa,
+            &transb,
+            &m,
+            &n,
+            &k,
+            &alpha,
+            a,
+            &isrc,
+            &isrc,
+            desc,
+            b,
+            &isrc,
+            &isrc,
+            desc,
+            &beta,
+            c,
+            &isrc,
+            &isrc,
+            desc);
+}
+
+static inline void Cpdgemm(char transa,
+                           char transb,
+                           int m,
+                           double alpha,
+                           double* a,
+                           double* b,
+                           double beta,
+                           double* c,
+                           int* desc)
+{
+    int isrc = 1;
+    pdgemm_(&transa,
+            &transb,
+            &m,
+            &m,
+            &m,
+            &alpha,
+            a,
+            &isrc,
+            &isrc,
+            desc,
+            b,
+            &isrc,
+            &isrc,
+            desc,
+            &beta,
+            c,
+            &isrc,
+            &isrc,
+            desc);
+}
+
+static inline void Cpzgemm(char transa,
+                           char transb,
+                           int m,
+                           int n,
+                           int k,
+                           std::complex<double> alpha,
+                           std::complex<double>* a,
+                           std::complex<double>* b,
+                           std::complex<double> beta,
+                           std::complex<double>* c,
+                           int* desc)
+{
+    double _Complex* alpha_c = reinterpret_cast<double _Complex*>(&alpha);
+    double _Complex* beta_c = reinterpret_cast<double _Complex*>(&beta);
+    double _Complex* aa = reinterpret_cast<double _Complex*>(a);
+    double _Complex* bb = reinterpret_cast<double _Complex*>(b);
+    double _Complex* cc = reinterpret_cast<double _Complex*>(c);
+    int isrc = 1;
+    pzgemm_(&transa,
+            &transb,
+            &m,
+            &n,
+            &k,
+            alpha_c,
+            aa,
+            &isrc,
+            &isrc,
+            desc,
+            bb,
+            &isrc,
+            &isrc,
+            desc,
+            beta_c,
+            cc,
+            &isrc,
+            &isrc,
+            desc);
+}
+
+static inline void Cpzgemm(char transa,
+                           char transb,
+                           int m,
+                           std::complex<double> alpha,
+                           std::complex<double>* a,
+                           std::complex<double>* b,
+                           std::complex<double> beta,
+                           std::complex<double>* c,
+                           int* desc)
+{
+    double _Complex* alpha_c = reinterpret_cast<double _Complex*>(&alpha);
+    double _Complex* beta_c = reinterpret_cast<double _Complex*>(&beta);
+    double _Complex* aa = reinterpret_cast<double _Complex*>(a);
+    double _Complex* bb = reinterpret_cast<double _Complex*>(b);
+    double _Complex* cc = reinterpret_cast<double _Complex*>(c);
+    int isrc = 1;
+    pzgemm_(&transa,
+            &transb,
+            &m,
+            &m,
+            &m,
+            alpha_c,
+            aa,
+            &isrc,
+            &isrc,
+            desc,
+            bb,
+            &isrc,
+            &isrc,
+            desc,
+            beta_c,
+            cc,
+            &isrc,
+            &isrc,
+            desc);
+}
+
+static inline void Cpdsymm(char side,
+                           char uplo,
+                           int m,
+                           int n,
+                           double alpha,
+                           double* a,
+                           double* b,
+                           double beta,
+                           double* c,
+                           int* desc)
+{
+    int isrc = 1;
+    pdsymm_(&side, &uplo, &m, &n, &alpha, a, &isrc, &isrc, desc, b, &isrc, &isrc, desc, &beta, c, &isrc, &isrc, desc);
+}
+
+static inline void Cpdsymm(char side,
+                           char uplo,
+                           int na,
+                           double alpha,
+                           double* a,
+                           double* b,
+                           double beta,
+                           double* c,
+                           int* desc)
+{
+    int isrc = 1;
+    pdsymm_(&side, &uplo, &na, &na, &alpha, a, &isrc, &isrc, desc, b, &isrc, &isrc, desc, &beta, c, &isrc, &isrc, desc);
+}
+
+static inline void Cpzsymm(char side,
+                           char uplo,
+                           int na,
+                           std::complex<double> alpha,
+                           std::complex<double>* a,
+                           std::complex<double>* b,
+                           std::complex<double> beta,
+                           std::complex<double>* c,
+                           int* desc)
+{
+    double _Complex* alpha_c = reinterpret_cast<double _Complex*>(&alpha);
+    double _Complex* beta_c = reinterpret_cast<double _Complex*>(&beta);
+    double _Complex* aa = reinterpret_cast<double _Complex*>(a);
+    double _Complex* bb = reinterpret_cast<double _Complex*>(b);
+    double _Complex* cc = reinterpret_cast<double _Complex*>(c);
+    int isrc = 1;
+    pzsymm_(&side,
+            &uplo,
+            &na,
+            &na,
+            alpha_c,
+            aa,
+            &isrc,
+            &isrc,
+            desc,
+            bb,
+            &isrc,
+            &isrc,
+            desc,
+            beta_c,
+            cc,
+            &isrc,
+            &isrc,
+            desc);
+}
+
+static inline void Cpzhemm(char side,
+                           char uplo,
+                           int na,
+                           std::complex<double> alpha,
+                           std::complex<double>* a,
+                           std::complex<double>* b,
+                           std::complex<double> beta,
+                           std::complex<double>* c,
+                           int* desc)
+{
+    double _Complex* alpha_c = reinterpret_cast<double _Complex*>(&alpha);
+    double _Complex* beta_c = reinterpret_cast<double _Complex*>(&beta);
+    double _Complex* aa = reinterpret_cast<double _Complex*>(a);
+    double _Complex* bb = reinterpret_cast<double _Complex*>(b);
+    double _Complex* cc = reinterpret_cast<double _Complex*>(c);
+    int isrc = 1;
+    pzhemm_(&side,
+            &uplo,
+            &na,
+            &na,
+            alpha_c,
+            aa,
+            &isrc,
+            &isrc,
+            desc,
+            bb,
+            &isrc,
+            &isrc,
+            desc,
+            beta_c,
+            cc,
+            &isrc,
+            &isrc,
+            desc);
+}
+
+static inline void Cpdgemr2d(int M,
+                             int N,
+                             double* a,
+                             int ia,
+                             int ja,
+                             int* desca,
+                             double* b,
+                             int ib,
+                             int jb,
+                             int* descb,
+                             int blacs_ctxt)
+{
+    pdgemr2d_(&M, &N, a, &ia, &ja, desca, b, &ib, &jb, descb, &blacs_ctxt);
+}
+
+static inline void Cpzgemr2d(int M,
+                             int N,
+                             std::complex<double>* a,
+                             int ia,
+                             int ja,
+                             int* desca,
+                             std::complex<double>* b,
+                             int ib,
+                             int jb,
+                             int* descb,
+                             int blacs_ctxt)
+{
+    double _Complex* aa = reinterpret_cast<double _Complex*>(a);
+    double _Complex* bb = reinterpret_cast<double _Complex*>(b);
+    pzgemr2d_(&M, &N, aa, &ia, &ja, desca, bb, &ib, &jb, descb, &blacs_ctxt);
+}
diff --git a/source/module_hsolver/genelpa/pblas.h b/source/module_hsolver/genelpa/pblas.h
new file mode 100644
index 0000000000..51ac6f3671
--- /dev/null
+++ b/source/module_hsolver/genelpa/pblas.h
@@ -0,0 +1,40 @@
+#pragma once
+void pddot_(int* n, double* dot, double* x, int* ix, int* jx, int* descx, int* incx,
+								 double* y, int* iy, int* jy, int* descy, int* incy);
+								 
+void pzdotc_(int* n, double _Complex* dot, double _Complex* x, int* ix, int* jx, int* descx, int* incx,
+								 double _Complex* y, int* iy, int* jy, int* descy, int* incy);
+void pdsymv_(char* uplo, int* n, 
+			 double* alpha, double* a, int* ia, int* ja, int* desca,
+						    double* x, int* ix, int* jx, int* descx, int* incx,
+             double* beta,  double* y, int* iy, int* jy, int* descy, int* incy);
+void pdtran_(int* m , int* n ,
+             double* alpha , double* a , int* ia , int* ja , int* desca ,
+             double* beta ,  double* c , int* ic , int* jc , int* descc );
+
+void pdgemm_(char* transa , char* transb , int* m , int* n , int* k ,
+             double* alpha , double* a , int* ia , int* ja , int* desca ,
+                             double* b , int* ib , int* jb , int* descb ,
+             double* beta ,  double* c , int* ic , int* jc , int* descc );
+void pzgemm_(char* transa , char* transb , int* m , int* n , int* k ,
+             double _Complex* alpha , double _Complex* a , int* ia , int* ja , int* desca ,
+									  double _Complex* b , int* ib , int* jb , int* descb ,
+             double _Complex* beta ,  double _Complex* c , int* ic , int* jc , int* descc );
+void pdsymm_(char* side , char* uplo , int* m , int* n ,
+             double* alpha , double* a , int* ia , int* ja , int* desca ,
+                             double* b , int* ib , int* jb , int* descb ,
+             double* beta ,  double* c , int* ic , int* jc , int* descc );
+void pzsymm_(char* side , char* uplo , int* m , int* n ,
+             double _Complex* alpha , double _Complex* a , int* ia , int* ja , int* desca ,
+									  double _Complex* b , int* ib , int* jb , int* descb ,
+             double _Complex* beta ,  double _Complex* c , int* ic , int* jc , int* descc );
+void pzhemm_(char* side , char* uplo , int* m , int* n ,
+             double _Complex* alpha , double _Complex* a , int* ia , int* ja , int* desca ,
+									  double _Complex* b , int* ib , int* jb , int* descb ,
+             double _Complex* beta ,  double _Complex* c , int* ic , int* jc , int* descc );
+void pdtrmm_(char* side , char* uplo , char* transa , char* diag , int* m , int* n ,
+             double* alpha , double* a , int* ia , int* ja , int* desca ,
+                             double* b , int* ib , int* jb , int* descb );
+void pztrmm_(char* side , char* uplo , char* transa , char* diag , int* m , int* n ,
+             double _Complex* alpha ,  double _Complex* a , int* ia , int* ja , int* desca ,
+									   double _Complex* b , int* ib , int* jb , int* descb );
diff --git a/source/module_hsolver/genelpa/scalapack.h b/source/module_hsolver/genelpa/scalapack.h
new file mode 100644
index 0000000000..39e18358e3
--- /dev/null
+++ b/source/module_hsolver/genelpa/scalapack.h
@@ -0,0 +1,12 @@
+#pragma once
+// scalapack
+int numroc_(const int *N, const int *NB, const int *IPROC, const int *ISRCPROC, const int *NPROCS);
+void descinit_(int *DESC, const int *M, const int *N, const int *MB, const int *NB, const int *IRSRC, const int *ICSRC, const int *ICTXT, const int *LLD, int *INFO);
+void pdpotrf_(const char *UPLO, const int *N, double *A, const int *IA, const int *JA, const int *DESCA, int *INFO);
+void pzpotrf_(const char *UPLO, const int *N, double _Complex *A, const int *IA, const int *JA, const int *DESCA, int *INFO);
+void pdsyev_(const char *JOBZ, const char *UPLO, int *N, double *A, int *IA, int *JA, int *DESCA, 
+             double *W, double *Z, int *IZ, int *JZ, int *DESCZ, double *WORK, int *LWORK, int *INFO);                          
+void pdgemr2d_(int *M, int *N, double *A, int *IA, int *JA, int *DESCA, 
+			   double *B, int *IB, int *JB, int *DESCB, int *ICTXT);			   
+void pzgemr2d_(int *M, int *N, double _Complex *A, int *IA, int *JA, int *DESCA, 
+			   double _Complex *B, int *IB, int *JB, int *DESCB, int *ICTXT);
diff --git a/source/module_hsolver/genelpa/utils.cpp b/source/module_hsolver/genelpa/utils.cpp
new file mode 100644
index 0000000000..77db0a4d41
--- /dev/null
+++ b/source/module_hsolver/genelpa/utils.cpp
@@ -0,0 +1,351 @@
+#include "utils.h"
+
+#include "my_math.hpp"
+
+#include <complex>
+#include <cstring>
+#include <fstream>
+#include <iostream>
+#include <mpi.h>
+#include <sstream>
+
+void initBlacsGrid(int loglevel,
+                   MPI_Comm comm,
+                   int nFull,
+                   int nblk,
+                   int& blacs_ctxt,
+                   int& narows,
+                   int& nacols,
+                   int desc[])
+{
+    std::stringstream outlog;
+    char BLACS_LAYOUT = 'C';
+    int ISRCPROC = 0; // fortran array starts from 1
+    int nprows, npcols;
+    int myprow, mypcol;
+    int nprocs, myid;
+    int info;
+    MPI_Comm_size(comm, &nprocs);
+    MPI_Comm_rank(comm, &myid);
+    // set blacs parameters
+    for (npcols = int(sqrt(double(nprocs))); npcols >= 2; --npcols)
+    {
+        if (nprocs % npcols == 0)
+            break;
+    }
+    nprows = nprocs / npcols;
+    if ((loglevel > 0 && myid == 0) || loglevel > 1)
+    {
+        outlog.str("");
+        outlog << "myid " << myid << ": nprows: " << nprows << " ; npcols: " << npcols << std::endl;
+        std::cout << outlog.str();
+    }
+
+    // int comm_f = MPI_Comm_c2f(comm);
+    blacs_ctxt = Csys2blacs_handle(comm);
+    Cblacs_gridinit(&blacs_ctxt, &BLACS_LAYOUT, nprows, npcols);
+    if ((loglevel > 0 && myid == 0) || loglevel > 1)
+    {
+        outlog.str("");
+        outlog << "myid " << myid << ": Cblacs_gridinit done, blacs_ctxt: " << blacs_ctxt << std::endl;
+        std::cout << outlog.str();
+    }
+    Cblacs_gridinfo(blacs_ctxt, &nprows, &npcols, &myprow, &mypcol);
+    if ((loglevel > 0 && myid == 0) || loglevel > 1)
+    {
+        int mypnum = Cblacs_pnum(blacs_ctxt, myprow, mypcol);
+        int prow, pcol;
+        Cblacs_pcoord(blacs_ctxt, myid, &prow, &pcol);
+        outlog.str("");
+        outlog << "myid " << myid << ": myprow: " << myprow << " ;mypcol: " << mypcol << std::endl;
+        outlog << "myid " << myid << ": mypnum: " << mypnum << std::endl;
+        outlog << "myid " << myid << ": prow: " << prow << " ;pcol: " << pcol << std::endl;
+        std::cout << outlog.str();
+    }
+
+    narows = numroc_(&nFull, &nblk, &myprow, &ISRCPROC, &nprows);
+    nacols = numroc_(&nFull, &nblk, &mypcol, &ISRCPROC, &npcols);
+    descinit_(desc, &nFull, &nFull, &nblk, &nblk, &ISRCPROC, &ISRCPROC, &blacs_ctxt, &narows, &info);
+
+    if ((loglevel > 0 && myid == 0) || loglevel > 1)
+    {
+        outlog.str("");
+        outlog << "myid " << myid << ": narows: " << narows << " nacols: " << nacols << std::endl;
+        outlog << "myid " << myid << ": blacs parameters setting" << std::endl;
+        outlog << "myid " << myid << ": desc is: ";
+        for (int i = 0; i < 9; ++i)
+            outlog << desc[i] << " ";
+        outlog << std::endl;
+        std::cout << outlog.str();
+    }
+}
+
+// load matrix from the file
+void loadMatrix(const char FileName[], int nFull, double* a, int* desca, int blacs_ctxt)
+{
+    int nprows, npcols, myprow, mypcol;
+    Cblacs_gridinfo(blacs_ctxt, &nprows, &npcols, &myprow, &mypcol);
+    int myid = Cblacs_pnum(blacs_ctxt, myprow, mypcol);
+
+    const int ROOT_PROC = 0;
+    std::ifstream matrixFile;
+    if (myid == ROOT_PROC)
+        matrixFile.open(FileName);
+
+    double* b; // buffer
+    const int MAX_BUFFER_SIZE = 1e9; // max buffer size is 1GB
+
+    int N = nFull;
+    int M
+        = std::max(1, std::min(nFull, (int)(MAX_BUFFER_SIZE / nFull / sizeof(double)))); // at lease 1 row, max size 1GB
+    if (myid == ROOT_PROC)
+        b = new double[M * N];
+    else
+        b = new double[1];
+
+    // set descb, which has all elements in the only block in the root process
+    //  block size is M x N, so all elements are in the first process
+    int descb[9] = {1, blacs_ctxt, M, N, M, N, 0, 0, M};
+
+    int ja = 1, ib = 1, jb = 1;
+    for (int ia = 1; ia < nFull; ia += M)
+    {
+        int thisM = std::min(M, nFull - ia + 1); // nFull-ia+1 is number of the last few rows to be read from file
+        // read from the file
+        if (myid == ROOT_PROC)
+        {
+            for (int i = 0; i < thisM; ++i)
+            {
+                for (int j = 0; j < N; ++j)
+                {
+                    matrixFile >> b[i + j * M];
+                }
+            }
+        }
+        // gather data rows by rows from all processes
+        Cpdgemr2d(thisM, N, b, ib, jb, descb, a, ia, ja, desca, blacs_ctxt);
+    }
+
+    if (myid == ROOT_PROC)
+        matrixFile.close();
+
+    delete[] b;
+}
+
+void saveLocalMatrix(const char filePrefix[], int narows, int nacols, double* a)
+{
+    using namespace std;
+    char FileName[80];
+    int myid;
+    ofstream matrixFile;
+    MPI_Comm_rank(MPI_COMM_WORLD, &myid);
+
+    sprintf(FileName, "%s_%3.3d.dat", filePrefix, myid);
+    matrixFile.open(FileName);
+    matrixFile.flags(std::ios_base::scientific);
+    matrixFile.precision(17);
+    matrixFile.width(24);
+    for (int i = 0; i < narows; ++i)
+    {
+        for (int j = 0; j < nacols; ++j)
+        {
+            matrixFile << a[i + j * narows] << " ";
+        }
+        matrixFile << std::endl;
+    }
+    matrixFile.close();
+}
+
+// use pdgemr2d to collect matrix from all processes to root process
+// and save to one completed matrix file
+void saveMatrix(const char FileName[], int nFull, double* a, int* desca, int blacs_ctxt)
+{
+    int nprows, npcols, myprow, mypcol;
+    Cblacs_gridinfo(blacs_ctxt, &nprows, &npcols, &myprow, &mypcol);
+    int myid = Cblacs_pnum(blacs_ctxt, myprow, mypcol);
+
+    const int ROOT_PROC = 0;
+    std::ofstream matrixFile;
+    if (myid == ROOT_PROC) // setup saved matrix format
+    {
+        matrixFile.open(FileName);
+        matrixFile.flags(std::ios_base::scientific);
+        matrixFile.precision(17);
+        matrixFile.width(24);
+    }
+
+    double* b; // buffer
+    const int MAX_BUFFER_SIZE = 1e9; // max buffer size is 1GB
+
+    int N = nFull;
+    int M
+        = std::max(1, std::min(nFull, (int)(MAX_BUFFER_SIZE / nFull / sizeof(double)))); // at lease 1 row, max size 1GB
+    if (myid == ROOT_PROC)
+        b = new double[M * N];
+    else
+        b = new double[1];
+
+    // set descb, which has all elements in the only block in the root process
+    int descb[9] = {1, blacs_ctxt, M, N, M, N, 0, 0, M};
+
+    int ja = 1, ib = 1, jb = 1;
+    for (int ia = 1; ia < nFull; ia += M)
+    {
+        int thisM = std::min(M, nFull - ia + 1); // nFull-ia+1 is the last few row to be saved
+        // gather data rows by rows from all processes
+        Cpdgemr2d(thisM, N, a, ia, ja, desca, b, ib, jb, descb, blacs_ctxt);
+        // write to the file
+        if (myid == ROOT_PROC)
+        {
+            for (int i = 0; i < thisM; ++i)
+            {
+                for (int j = 0; j < N; ++j)
+                {
+                    matrixFile << b[i + j * M] << " ";
+                }
+                matrixFile << std::endl;
+            }
+        }
+    }
+
+    if (myid == ROOT_PROC)
+        matrixFile.close();
+
+    delete[] b;
+}
+
+// load matrix from the file
+void loadMatrix(const char FileName[], int nFull, std::complex<double>* a, int* desca, int blacs_ctxt)
+{
+    int nprows, npcols, myprow, mypcol;
+    Cblacs_gridinfo(blacs_ctxt, &nprows, &npcols, &myprow, &mypcol);
+    int myid = Cblacs_pnum(blacs_ctxt, myprow, mypcol);
+
+    const int ROOT_PROC = 0;
+    std::ifstream matrixFile;
+    if (myid == ROOT_PROC)
+        matrixFile.open(FileName);
+
+    std::complex<double>* b; // buffer
+    const int MAX_BUFFER_SIZE = 1e9; // max buffer size is 1GB
+
+    int N = nFull;
+    int M = std::max(
+        1,
+        std::min(nFull, (int)(MAX_BUFFER_SIZE / nFull / (2 * sizeof(double))))); // at lease 1 row, max size 1GB
+    if (myid == ROOT_PROC)
+        b = new std::complex<double>[M * N];
+    else
+        b = new std::complex<double>[1];
+
+    // set descb, which has all elements in the only block in the root process
+    //  block size is M x N, so all elements are in the first process
+    int descb[9] = {1, blacs_ctxt, M, N, M, N, 0, 0, M};
+
+    int ja = 1, ib = 1, jb = 1;
+    for (int ia = 1; ia < nFull; ia += M)
+    {
+        int thisM = std::min(M, nFull - ia + 1); // nFull-ia+1 is number of the last few rows to be read from file
+        // read from the file
+        if (myid == ROOT_PROC)
+        {
+            for (int i = 0; i < thisM; ++i)
+            {
+                for (int j = 0; j < N; ++j)
+                {
+                    matrixFile >> b[i + j * M];
+                }
+            }
+        }
+        // gather data rows by rows from all processes
+        Cpzgemr2d(thisM, N, b, ib, jb, descb, a, ia, ja, desca, blacs_ctxt);
+    }
+
+    if (myid == ROOT_PROC)
+        matrixFile.close();
+
+    delete[] b;
+}
+
+void saveLocalMatrix(const char filePrefix[], int narows, int nacols, std::complex<double>* a)
+{
+    using namespace std;
+    char FileName[80];
+    int myid;
+    ofstream matrixFile;
+
+    MPI_Comm_rank(MPI_COMM_WORLD, &myid);
+
+    sprintf(FileName, "%s_%3.3d.dat", filePrefix, myid);
+    matrixFile.open(FileName);
+    matrixFile.flags(std::ios_base::scientific);
+    matrixFile.precision(17);
+    matrixFile.width(24);
+    for (int i = 0; i < narows; ++i)
+    {
+        for (int j = 0; j < nacols; ++j)
+        {
+            matrixFile << a[i + j * narows] << " ";
+        }
+        matrixFile << std::endl;
+    }
+    matrixFile.close();
+}
+
+// use pzgemr2d to collect matrix from all processes to root process
+// and save to one completed matrix file
+void saveMatrix(const char FileName[], int nFull, std::complex<double>* a, int* desca, int blacs_ctxt)
+{
+    int nprows, npcols, myprow, mypcol;
+    Cblacs_gridinfo(blacs_ctxt, &nprows, &npcols, &myprow, &mypcol);
+    int myid = Cblacs_pnum(blacs_ctxt, myprow, mypcol);
+
+    const int ROOT_PROC = 0;
+    std::ofstream matrixFile;
+    if (myid == ROOT_PROC) // setup saved matrix format
+    {
+        matrixFile.open(FileName);
+        matrixFile.flags(std::ios_base::scientific);
+        matrixFile.precision(17);
+        matrixFile.width(24);
+    }
+
+    std::complex<double>* b; // buffer
+    const int MAX_BUFFER_SIZE = 1e9; // max buffer size is 1GB
+
+    int N = nFull;
+    int M
+        = std::max(1, std::min(nFull, (int)(MAX_BUFFER_SIZE / nFull / sizeof(double)))); // at lease 1 row, max size 1GB
+    if (myid == ROOT_PROC)
+        b = new std::complex<double>[M * N];
+    else
+        b = new std::complex<double>[1];
+
+    // set descb, which has all elements in the only block in the root process
+    int descb[9] = {1, blacs_ctxt, M, N, M, N, 0, 0, M};
+
+    int ja = 1, ib = 1, jb = 1;
+    for (int ia = 1; ia < nFull; ia += M)
+    {
+        int transM = std::min(M, nFull - ia + 1); // nFull-ia+1 is the last few row to be saved
+        // gather data rows by rows from all processes
+        Cpzgemr2d(transM, N, a, ia, ja, desca, b, ib, jb, descb, blacs_ctxt);
+        // write to the file
+        if (myid == ROOT_PROC)
+        {
+            for (int i = 0; i < transM; ++i)
+            {
+                for (int j = 0; j < N; ++j)
+                {
+                    matrixFile << b[i + j * M] << " ";
+                }
+                matrixFile << std::endl;
+            }
+        }
+    }
+
+    if (myid == ROOT_PROC)
+        matrixFile.close();
+
+    delete[] b;
+}
diff --git a/source/module_hsolver/genelpa/utils.h b/source/module_hsolver/genelpa/utils.h
new file mode 100644
index 0000000000..412bcdaca3
--- /dev/null
+++ b/source/module_hsolver/genelpa/utils.h
@@ -0,0 +1,43 @@
+#pragma once
+#include <complex>
+#include <mpi.h>
+
+static inline int globalIndex(int localIndex, int nblk, int nprocs, int myproc)
+{
+    int iblock, gIndex;
+    iblock = localIndex / nblk;
+    gIndex = (iblock * nprocs + myproc) * nblk + localIndex % nblk;
+    return gIndex;
+}
+
+static inline int localIndex(int globalIndex, int nblk, int nprocs, int& lcoalProc)
+{
+    lcoalProc = int((globalIndex % (nblk * nprocs)) / nblk);
+    return int(globalIndex / (nblk * nprocs)) * nblk + globalIndex % nblk;
+}
+
+void initBlacsGrid(int loglevel,
+                   MPI_Comm comm,
+                   int nFull,
+                   int nblk,
+                   int& blacs_ctxt,
+                   int& narows,
+                   int& nacols,
+                   int desc[]);
+
+// load matrix from the file
+void loadMatrix(const char FileName[], int nFull, double* a, int* desca, int blacs_ctxt);
+
+void saveLocalMatrix(const char filePrefix[], int narows, int nacols, double* a);
+
+// use pdgemr2d to collect matrix from all processes to root process
+// and save to one completed matrix file
+void saveMatrix(const char FileName[], int nFull, double* a, int* desca, int blacs_ctxt);
+
+void loadMatrix(const char FileName[], int nFull, std::complex<double>* a, int* desca, int blacs_ctxt);
+
+void saveLocalMatrix(const char filePrefix[], int narows, int nacols, std::complex<double>* a);
+
+// use pzgemr2d to collect matrix from all processes to root process
+// and save to one completed matrix file
+void saveMatrix(const char FileName[], int nFull, std::complex<double>* a, int* desca, int blacs_ctxt);
diff --git a/source/module_hsolver/hsolver_pw.cpp b/source/module_hsolver/hsolver_pw.cpp
index 911d15ac7a..84bc9552e5 100644
--- a/source/module_hsolver/hsolver_pw.cpp
+++ b/source/module_hsolver/hsolver_pw.cpp
@@ -158,7 +158,7 @@ void HSolverPW::hamiltSolvePsiK(hamilt::Hamilt* hm, psi::Psi<std::complex<double
 
 void HSolverPW::update_precondition(std::vector<double> &h_diag, const int ik, const int npw)
 {
-    h_diag.resize(h_diag.size(), 1.0);
+    h_diag.assign(h_diag.size(), 1.0);
     int precondition_type = 2;
     const double tpiba2 = this->wfc_basis->tpiba2;
     
diff --git a/source/module_hsolver/test/CMakeLists.txt b/source/module_hsolver/test/CMakeLists.txt
index 710efdbd46..39000d9330 100644
--- a/source/module_hsolver/test/CMakeLists.txt
+++ b/source/module_hsolver/test/CMakeLists.txt
@@ -16,7 +16,7 @@ AddTest(
 )
 AddTest(
   TARGET HSolver_LCAO
-  LIBS ${math_libs} ELPA::ELPA base
+  LIBS ${math_libs} ELPA::ELPA base genelpa
   SOURCES diago_lcao_test.cpp ../diago_elpa.cpp ../diago_blas.cpp ../../src_parallel/parallel_global.cpp 
           ../../src_parallel/parallel_common.cpp ../../src_parallel/parallel_reduce.cpp
 )
diff --git a/source/module_surchem/H_correction_pw.cpp b/source/module_surchem/H_correction_pw.cpp
index 6ad2b036bc..1aaf2e6bbe 100644
--- a/source/module_surchem/H_correction_pw.cpp
+++ b/source/module_surchem/H_correction_pw.cpp
@@ -8,7 +8,7 @@
 #include <cmath>
 
 ModuleBase::matrix surchem::v_correction(const UnitCell &cell,
-                                         ModulePW::PW_Basis* rho_basis,
+                                         ModulePW::PW_Basis *rho_basis,
                                          const int &nspin,
                                          const double *const *const rho)
 {
@@ -51,7 +51,14 @@ ModuleBase::matrix surchem::v_correction(const UnitCell &cell,
     return v;
 }
 
-void surchem::add_comp_chg(const UnitCell &cell, ModulePW::PW_Basis* rho_basis, double q, double l, double center, complex<double> *NG, int dim)
+void surchem::add_comp_chg(const UnitCell &cell,
+                           ModulePW::PW_Basis *rho_basis,
+                           double q,
+                           double l,
+                           double center,
+                           complex<double> *NG,
+                           int dim,
+                           bool flag)
 {
     // x dim
     double tmp_q = 0.0;
@@ -62,18 +69,24 @@ void surchem::add_comp_chg(const UnitCell &cell, ModulePW::PW_Basis* rho_basis,
         ModuleBase::GlobalFunc::ZEROS(NG, rho_basis->npw);
         for (int ig = 0; ig < rho_basis->npw; ig++)
         {
-            if(ig==rho_basis->ig_gge0)
+            if (ig == rho_basis->ig_gge0)
+            {
+                if(flag)
+                {
+                    NG[ig] = complex<double>(tmp_q * l / L, 0.0);
+                }
                 continue;
+            }
             double GX = rho_basis->gcar[ig][0];
             double GY = rho_basis->gcar[ig][1];
             double GZ = rho_basis->gcar[ig][2];
             GX = GX * 2 * ModuleBase::PI;
             if (GY == 0 && GZ == 0 && GX != 0)
             {
-                NG[ig] = exp(ModuleBase::NEG_IMAG_UNIT * GX * center) * complex<double>(2.0 * tmp_q * sin(GX * l / 2.0) / (L * GX), 0.0);
+                NG[ig] = exp(ModuleBase::NEG_IMAG_UNIT * GX * center)
+                         * complex<double>(2.0 * tmp_q * sin(GX * l / 2.0) / (L * GX), 0.0);
             }
         }
-        // NG[0] = complex<double>(tmp_q * l / L, 0.0);
     }
     // y dim
     else if (dim == 1)
@@ -83,71 +96,123 @@ void surchem::add_comp_chg(const UnitCell &cell, ModulePW::PW_Basis* rho_basis,
         ModuleBase::GlobalFunc::ZEROS(NG, rho_basis->npw);
         for (int ig = 0; ig < rho_basis->npw; ig++)
         {
-            if(ig==rho_basis->ig_gge0)
+            if (ig == rho_basis->ig_gge0)
+            {
+                if(flag)
+                {
+                    NG[ig] = complex<double>(tmp_q * l / L, 0.0);
+                }
                 continue;
+            }
             double GX = rho_basis->gcar[ig][0];
             double GY = rho_basis->gcar[ig][1];
             double GZ = rho_basis->gcar[ig][2];
             GY = GY * 2 * ModuleBase::PI;
             if (GX == 0 && GZ == 0 && GY != 0)
             {
-                NG[ig] = exp(ModuleBase::NEG_IMAG_UNIT * GY * center) * complex<double>(2.0 * tmp_q * sin(GY * l / 2.0) / (L * GY), 0.0);
+                NG[ig] = exp(ModuleBase::NEG_IMAG_UNIT * GY * center)
+                         * complex<double>(2.0 * tmp_q * sin(GY * l / 2.0) / (L * GY), 0.0);
             }
         }
-        // NG[0] = complex<double>(tmp_q * l / L, 0.0);
     }
     // z dim
     else if (dim == 2)
     {
         double L = cell.a3[2];
-        // cout << "area" << cross(cell.a1, cell.a2).norm() << endl;
         tmp_q = q / (cross(cell.a1, cell.a2).norm() * l);
         ModuleBase::GlobalFunc::ZEROS(NG, rho_basis->npw);
         for (int ig = 0; ig < rho_basis->npw; ig++)
         {
-            if(ig==rho_basis->ig_gge0)
+            if (ig == rho_basis->ig_gge0)
+            {
+                if(flag)
+                {
+                    NG[ig] = complex<double>(tmp_q * l / L, 0.0);
+                }
                 continue;
+            }
             double GX = rho_basis->gcar[ig][0];
             double GY = rho_basis->gcar[ig][1];
             double GZ = rho_basis->gcar[ig][2];
             GZ = GZ * 2 * ModuleBase::PI;
             if (GX == 0 && GY == 0 && GZ != 0)
             {
-                NG[ig] = exp(ModuleBase::NEG_IMAG_UNIT * GZ * center) * complex<double>(2.0 * tmp_q * sin(GZ * l / 2.0) / (L * GZ), 0.0);
+                NG[ig] = exp(ModuleBase::NEG_IMAG_UNIT * GZ * center)
+                         * complex<double>(2.0 * tmp_q * sin(GZ * l / 2.0) / (L * GZ), 0.0);
             }
         }
-        // NG[0] = complex<double>(tmp_q * l / L, 0.0);
     }
 }
 
-ModuleBase::matrix surchem::v_compensating(const UnitCell &cell, ModulePW::PW_Basis *rho_basis)
+ModuleBase::matrix surchem::v_compensating(const UnitCell &cell,
+                                           ModulePW::PW_Basis *rho_basis,
+                                           const int &nspin,
+                                           const double *const *const rho)
 {
     ModuleBase::TITLE("surchem", "v_compensating");
     ModuleBase::timer::tick("surchem", "v_compensating");
 
+    // calculating v_comp also need TOTN_real
+    double *Porter = new double[rho_basis->nrxx];
+    for (int i = 0; i < rho_basis->nrxx; i++)
+        Porter[i] = 0.0;
+    const int nspin0 = (nspin == 2) ? 2 : 1;
+    for (int is = 0; is < nspin0; is++)
+        for (int ir = 0; ir < rho_basis->nrxx; ir++)
+            Porter[ir] += rho[is][ir];
+
+    complex<double> *Porter_g = new complex<double>[rho_basis->npw];
+    ModuleBase::GlobalFunc::ZEROS(Porter_g, rho_basis->npw);
+
+    rho_basis->real2recip(Porter, Porter_g);
+
+    complex<double> *N = new complex<double>[rho_basis->npw];
+    complex<double> *TOTN = new complex<double>[rho_basis->npw];
+
+    cal_totn(cell, rho_basis, Porter_g, N, TOTN);
+
+    // save TOTN in real space
+    rho_basis->recip2real(TOTN, this->TOTN_real);
+
     complex<double> *comp_reci = new complex<double>[rho_basis->npw];
     complex<double> *phi_comp_G = new complex<double>[rho_basis->npw];
-    double *phi_comp_R = new double[rho_basis->nrxx];
 
     ModuleBase::GlobalFunc::ZEROS(comp_reci, rho_basis->npw);
     ModuleBase::GlobalFunc::ZEROS(phi_comp_G, rho_basis->npw);
     ModuleBase::GlobalFunc::ZEROS(phi_comp_R, rho_basis->nrxx);
-    // get comp chg in reci space
-    add_comp_chg(cell, rho_basis, comp_q, comp_l, comp_center, comp_reci, comp_dim);
-    double ecomp = 0.0;
+    // get compensating charge in reci space
+    add_comp_chg(cell, rho_basis, comp_q, comp_l, comp_center, comp_reci, comp_dim, true);
+    // save compensating charge in real space
+    rho_basis->recip2real(comp_reci, this->comp_real);
+
+    // test sum of comp_real -> 0
+    // for (int i = 0; i < rho_basis->nz;i++)
+    // {
+    //     cout << comp_real[i] << endl;
+    // }
+    // double sum = 0;
+    // for (int i = 0; i < rho_basis->nxyz; i++)
+    // {
+    //     sum += TOTN_real[i];
+    // }
+    // sum = sum * cell.omega / rho_basis->nxyz;
+    // cout << "sum:" << sum << endl;
+    // int pp;
+    // cin >> pp;
+
     for (int ig = 0; ig < rho_basis->npw; ig++)
     {
-        if (rho_basis->gg[ig] >= 1.0e-12) // LiuXh 20180410
+        if (ig == rho_basis->ig_gge0)
+        {
+            // cout << ig << endl;
+            continue;
+        }
+        else
         {
             const double fac = ModuleBase::e2 * ModuleBase::FOUR_PI / (cell.tpiba2 * rho_basis->gg[ig]);
-            ecomp += (conj(comp_reci[ig]) * comp_reci[ig]).real() * fac;
             phi_comp_G[ig] = fac * comp_reci[ig];
         }
     }
-    Parallel_Reduce::reduce_double_pool(ecomp);
-    ecomp *= 0.5 * cell.omega;
-    // std::cout << " ecomp=" << ecomp << std::endl;
-    comp_chg_energy = ecomp;
 
     rho_basis->recip2real(phi_comp_G, phi_comp_R);
 
@@ -166,95 +231,64 @@ ModuleBase::matrix surchem::v_compensating(const UnitCell &cell, ModulePW::PW_Ba
 
     delete[] comp_reci;
     delete[] phi_comp_G;
-    delete[] phi_comp_R;
+    delete[] Porter;
+    delete[] Porter_g;
+    delete[] N;
+    delete[] TOTN;
 
     ModuleBase::timer::tick("surchem", "v_compensating");
     return v_comp;
 }
 
-void test_print(double* tmp, ModulePW::PW_Basis *rho_basis)
-{
-    for (int i = 0; i < rho_basis->nz; i++)
-    {
-        cout << tmp[i] << endl;
-    }
-}
-
-void surchem::test_V_to_N(ModuleBase::matrix &v, 
-                const UnitCell &cell, 
-                ModulePW::PW_Basis *rho_basis, 
-                const double *const *const rho)
+void surchem::cal_comp_force(ModuleBase::matrix &force_comp, ModulePW::PW_Basis *rho_basis)
 {
-    double *phi_comp_R = new double[rho_basis->nrxx];
-    complex<double> *phi_comp_G = new complex<double>[rho_basis->npw];
-    complex<double> *comp_reci = new complex<double>[rho_basis->npw];
-    double *N_real = new double[rho_basis->nrxx];
-
-    ModuleBase::GlobalFunc::ZEROS(phi_comp_R, rho_basis->nrxx);
-    ModuleBase::GlobalFunc::ZEROS(phi_comp_G, rho_basis->npw);
-    ModuleBase::GlobalFunc::ZEROS(comp_reci, rho_basis->npw);
-    ModuleBase::GlobalFunc::ZEROS(N_real, rho_basis->nrxx);
-
-    for (int ir = 0; ir < rho_basis->nz; ir++)
-    {
-        cout << v(0, ir) << endl;
-    }
-
-    for (int ir = 0; ir < rho_basis->nrxx; ir++)
-    {
-        phi_comp_R[ir] = v(0, ir);
-    }
-
+    int iat = 0;
+    std::complex<double> *N = new std::complex<double>[rho_basis->npw];
+    std::complex<double> *phi_comp_G = new complex<double>[rho_basis->npw];
+    std::complex<double> *vloc_at = new std::complex<double>[rho_basis->npw];
     rho_basis->real2recip(phi_comp_R, phi_comp_G);
-    for (int ig = 0; ig < rho_basis->npw; ig++)
+
+    for (int it = 0; it < GlobalC::ucell.ntype; it++)
     {
-        if (rho_basis->gg[ig] >= 1.0e-12) // LiuXh 20180410
+        for (int ia = 0; ia < GlobalC::ucell.atoms[it].na; ia++)
         {
-            const double fac = ModuleBase::e2 * ModuleBase::FOUR_PI / (cell.tpiba2 * rho_basis->gg[ig]);
-            comp_reci[ig] = phi_comp_G[ig] / fac;
-        }
-    }
-    rho_basis->recip2real(comp_reci, N_real);
-
-    complex<double> *vloc_g = new complex<double>[rho_basis->npw];
-    complex<double> *ng = new complex<double>[rho_basis->npw];
-    ModuleBase::GlobalFunc::ZEROS(vloc_g, rho_basis->npw);
-    ModuleBase::GlobalFunc::ZEROS(ng, rho_basis->npw);
 
-    double* Porter = new double[rho_basis->nrxx];
-    for (int ir = 0; ir < rho_basis->nrxx; ir++)
-        Porter[ir] = rho[0][ir];
+            // cout << GlobalC::ucell.atoms[it].zv << endl;
+            for (int ig = 0; ig < rho_basis->npw; ig++)
+            {   
+                complex<double> phase = exp( ModuleBase::NEG_IMAG_UNIT *ModuleBase::TWO_PI * ( rho_basis->gcar[ig] * GlobalC::ucell.atoms[it].tau[ia]));
+                //vloc for each atom
+                vloc_at[ig] = GlobalC::ppcell.vloc(it, rho_basis->ig2igg[ig]) * phase;
+                if(rho_basis->ig_gge0 == ig)
+                {
+                    N[ig] = GlobalC::ucell.atoms[it].zv / GlobalC::ucell.omega;
+                }
+                else
+                {
+                    const double fac
+                        = ModuleBase::e2 * ModuleBase::FOUR_PI / (GlobalC::ucell.tpiba2 * rho_basis->gg[ig]);
+
+                    N[ig] = -vloc_at[ig] / fac;
+                }
+                
+                //force for each atom
+                force_comp(iat, 0) += rho_basis->gcar[ig][0] * imag(conj(phi_comp_G[ig]) * N[ig]);
+                force_comp(iat, 1) += rho_basis->gcar[ig][1] * imag(conj(phi_comp_G[ig]) * N[ig]);
+                force_comp(iat, 2) += rho_basis->gcar[ig][2] * imag(conj(phi_comp_G[ig]) * N[ig]);
+            }
+                
+            force_comp(iat, 0) *= (GlobalC::ucell.tpiba * GlobalC::ucell.omega);
+            force_comp(iat, 1) *= (GlobalC::ucell.tpiba * GlobalC::ucell.omega);
+            force_comp(iat, 2) *= (GlobalC::ucell.tpiba * GlobalC::ucell.omega);
 
-    rho_basis->real2recip(GlobalC::pot.vltot,vloc_g);// now n is vloc in Recispace
-    for (int ig = 0; ig < rho_basis->npw; ig++) {
-        if (rho_basis->gg[ig] >= 1.0e-12) // LiuXh 20180410
-        {
-            const double fac = ModuleBase::e2 * ModuleBase::FOUR_PI /
-                               (cell.tpiba2 * rho_basis->gg[ig]);
+            // cout << "Force1(Ry / Bohr)" << iat << ":"
+            //      << " " << force_comp(iat, 0) << " " << force_comp(iat, 1) << " " << force_comp(iat, 2) << endl;
 
-            ng[ig] = -vloc_g[ig] / fac;
+            ++iat;
         }
     }
-    double *nr = new double[rho_basis->nrxx];
-    rho_basis->recip2real(ng, nr);
-
-    double *diff = new double[rho_basis->nrxx];
-    double *diff2 = new double[rho_basis->nrxx];
-    for (int i = 0; i < rho_basis->nrxx; i++)
-    {
-        diff[i] = N_real[i] - nr[i];
-        diff2[i] = N_real[i] - Porter[i];
-    }
-
-    for (int i = 0; i < rho_basis->nrxx;i++)
-    {
-        diff[i] -= Porter[i];
-    }
-
-    delete[] phi_comp_R;
+    Parallel_Reduce::reduce_double_pool(force_comp.c, force_comp.nr * force_comp.nc);
+    delete[] vloc_at;
+    delete[] N;
     delete[] phi_comp_G;
-    delete[] comp_reci;
-    delete[] diff;
-    delete[] vloc_g;
-    delete[] Porter;
-}
+}
\ No newline at end of file
diff --git a/source/module_surchem/cal_totn.cpp b/source/module_surchem/cal_totn.cpp
index e2a183fab3..3cea7e6a0d 100644
--- a/source/module_surchem/cal_totn.cpp
+++ b/source/module_surchem/cal_totn.cpp
@@ -3,7 +3,7 @@
 void surchem::cal_totn(const UnitCell &cell, ModulePW::PW_Basis* rho_basis,
                        const complex<double> *Porter_g, complex<double> *N,
                        complex<double> *TOTN) {
-    // vloc to N8
+    // vloc to N
     complex<double> *vloc_g = new complex<double>[rho_basis->npw];
     ModuleBase::GlobalFunc::ZEROS(vloc_g, rho_basis->npw);
 
@@ -25,7 +25,6 @@ void surchem::cal_totn(const UnitCell &cell, ModulePW::PW_Basis* rho_basis,
         TOTN[ig] = N[ig] - Porter_g[ig];
     }
 
-    // delete[] comp_real;
     delete[] vloc_g;
     return;
 }
\ No newline at end of file
diff --git a/source/module_surchem/cal_vel.cpp b/source/module_surchem/cal_vel.cpp
index fe6b246cc1..88f246053d 100644
--- a/source/module_surchem/cal_vel.cpp
+++ b/source/module_surchem/cal_vel.cpp
@@ -57,7 +57,6 @@ ModuleBase::matrix surchem::cal_vel(const UnitCell &cell,
     ModuleBase::TITLE("surchem", "cal_vel");
     ModuleBase::timer::tick("surchem", "cal_vel");
 
-    // double *TOTN_real = new double[pwb.nrxx];
     rho_basis->recip2real(TOTN, TOTN_real);
 
     // -4pi * TOTN(G)
@@ -93,7 +92,6 @@ ModuleBase::matrix surchem::cal_vel(const UnitCell &cell,
 
     double *phi_tilda_R = new double[rho_basis->nrxx];
     double *phi_tilda_R0 = new double[rho_basis->nrxx];
-    // double *delta_phi_R = new double[pwb.nrxx];
 
     rho_basis->recip2real(Sol_phi, phi_tilda_R);
     rho_basis->recip2real(Sol_phi0, phi_tilda_R0);
@@ -144,11 +142,8 @@ ModuleBase::matrix surchem::cal_vel(const UnitCell &cell,
     delete[] epsilon;
     delete[] epsilon0;
     delete[] tmp_Vel;
-    // delete[] Vel2;
-    // delete[] TOTN_real;
     delete[] phi_tilda_R;
     delete[] phi_tilda_R0;
-    // delete[] delta_phi_R;
 
     ModuleBase::timer::tick("surchem", "cal_vel");
     return Vel;
diff --git a/source/module_surchem/corrected_energy.cpp b/source/module_surchem/corrected_energy.cpp
index 7f0aaabe6f..55894e8fa3 100644
--- a/source/module_surchem/corrected_energy.cpp
+++ b/source/module_surchem/corrected_energy.cpp
@@ -1,6 +1,6 @@
 #include "surchem.h"
 
-double surchem::cal_Ael(const UnitCell &cell, ModulePW::PW_Basis* rho_basis)
+double surchem::cal_Ael(const UnitCell &cell, ModulePW::PW_Basis *rho_basis)
 {
     double Ael = 0.0;
     for (int ir = 0; ir < rho_basis->nrxx; ir++)
@@ -8,17 +8,100 @@ double surchem::cal_Ael(const UnitCell &cell, ModulePW::PW_Basis* rho_basis)
         Ael -= TOTN_real[ir] * delta_phi[ir];
     }
     Parallel_Reduce::reduce_double_pool(Ael);
-    Ael = Ael * cell.omega / rho_basis->nxyz;  // unit Ry
-    //cout << "Ael: " << Ael << endl;
+    Ael = Ael * cell.omega / rho_basis->nxyz;
+    // cout << "Ael: " << Ael << endl;
     return Ael;
 }
 
-double surchem::cal_Acav(const UnitCell &cell,  ModulePW::PW_Basis* rho_basis)
+double surchem::cal_Acav(const UnitCell &cell, ModulePW::PW_Basis *rho_basis)
 {
     double Acav = 0.0;
     Acav = GlobalV::tau * qs;
-    Acav = Acav * cell.omega / rho_basis->nxyz;  // unit Ry
+    Acav = Acav * cell.omega / rho_basis->nxyz; // unit Ry
     Parallel_Reduce::reduce_double_pool(Acav);
-    //cout << "Acav: " << Acav << endl;
+    // cout << "Acav: " << Acav << endl;
     return Acav;
+}
+
+void surchem::cal_Acomp(const UnitCell &cell,
+                        ModulePW::PW_Basis *rho_basis,
+                        const double *const *const rho,
+                        vector<double> &res)
+{
+    double Acomp1 = 0.0; // self
+    double Acomp2 = 0.0; // electrons
+    double Acomp3 = 0.0; // nuclear
+
+    complex<double> *phi_comp_G = new complex<double>[rho_basis->npw];
+    complex<double> *comp_reci = new complex<double>[rho_basis->npw];
+    double *phi_comp_R = new double[rho_basis->nrxx];
+
+    ModuleBase::GlobalFunc::ZEROS(phi_comp_G, rho_basis->npw);
+    ModuleBase::GlobalFunc::ZEROS(comp_reci, rho_basis->npw);
+    ModuleBase::GlobalFunc::ZEROS(phi_comp_R, rho_basis->nrxx);
+
+    // part1: comp & comp
+    rho_basis->real2recip(comp_real, comp_reci);
+    for (int ig = 0; ig < rho_basis->npw; ig++)
+    {
+        if (rho_basis->gg[ig] >= 1.0e-12) // LiuXh 20180410
+        {
+            const double fac = ModuleBase::e2 * ModuleBase::FOUR_PI / (cell.tpiba2 * rho_basis->gg[ig]);
+            Acomp1 += (conj(comp_reci[ig]) * comp_reci[ig]).real() * fac;
+            phi_comp_G[ig] = fac * comp_reci[ig];
+        }
+    }
+    // 0.5 for double counting
+    Parallel_Reduce::reduce_double_pool(Acomp1);
+    Acomp1 *= 0.5 * cell.omega;
+
+    // electrons
+    double *n_elec_R = new double[rho_basis->nrxx];
+    for (int i = 0; i < rho_basis->nrxx; i++)
+        n_elec_R[i] = 0.0;
+    const int nspin0 = (GlobalV::NSPIN == 2) ? 2 : 1;
+    for (int is = 0; is < nspin0; is++)
+        for (int ir = 0; ir < rho_basis->nrxx; ir++)
+            n_elec_R[ir] += rho[is][ir];
+
+    // nuclear = TOTN_R + n_elec_R
+    double *n_nucl_R = new double[rho_basis->nrxx];
+    for (int ir = 0; ir < rho_basis->nrxx; ir++)
+    {
+        n_nucl_R[ir] = TOTN_real[ir] + n_elec_R[ir];
+    }
+
+    // part2: electrons
+    rho_basis->recip2real(phi_comp_G, phi_comp_R);
+    for (int ir = 0; ir < rho_basis->nrxx; ir++)
+    {
+        Acomp2 += n_elec_R[ir] * phi_comp_R[ir];
+    }
+    Parallel_Reduce::reduce_double_pool(Acomp2);
+    Acomp2 = Acomp2 * cell.omega / rho_basis->nxyz;
+
+    // part3: nuclear
+    for (int ir = 0; ir < rho_basis->nrxx; ir++)
+    {
+        Acomp3 += n_nucl_R[ir] * phi_comp_R[ir];
+    }
+    Parallel_Reduce::reduce_double_pool(Acomp3);
+    Acomp3 = Acomp3 * cell.omega / rho_basis->nxyz;
+
+    delete[] phi_comp_G;
+    delete[] phi_comp_R;
+    delete[] comp_reci;
+
+    delete[] n_elec_R;
+    delete[] n_nucl_R;
+
+    // cout << "Acomp1(self, Ry): " << Acomp1 << endl;
+    // cout << "Acomp1(electrons, Ry): " << Acomp2 << endl;
+    // cout << "Acomp1(nuclear, Ry): " << Acomp3 << endl;
+
+    res[0] = Acomp1;
+    res[1] = Acomp2;
+    res[2] = -Acomp3;
+
+    // return Acomp1 + Acomp2 - Acomp3;
 }
\ No newline at end of file
diff --git a/source/module_surchem/surchem.cpp b/source/module_surchem/surchem.cpp
index 83f199e5ac..77de8158a3 100644
--- a/source/module_surchem/surchem.cpp
+++ b/source/module_surchem/surchem.cpp
@@ -2,7 +2,7 @@
 
 namespace GlobalC
 {
-  surchem solvent_model;
+surchem solvent_model;
 }
 
 surchem::surchem()
@@ -10,10 +10,11 @@ surchem::surchem()
     TOTN_real = nullptr;
     delta_phi = nullptr;
     epspot = nullptr;
+    comp_real = nullptr;
+    phi_comp_R = nullptr;
     Vcav = ModuleBase::matrix();
     Vel = ModuleBase::matrix();
     qs = 0;
-    comp_chg_energy = 0;
 }
 
 void surchem::allocate(const int &nrxx, const int &nspin)
@@ -24,20 +25,32 @@ void surchem::allocate(const int &nrxx, const int &nspin)
     delete[] TOTN_real;
     delete[] delta_phi;
     delete[] epspot;
-    if(nrxx > 0)
+    delete[] comp_real;
+    delete[] phi_comp_R;
+    if (nrxx > 0)
     {
         TOTN_real = new double[nrxx];
         delta_phi = new double[nrxx];
         epspot = new double[nrxx];
+        comp_real = new double[nrxx];
+        phi_comp_R = new double[nrxx];
     }
     else
-        TOTN_real = delta_phi = epspot = nullptr;
+    {
+        TOTN_real = nullptr;
+        delta_phi = nullptr;
+        epspot = nullptr;
+        comp_real = nullptr;
+        phi_comp_R = nullptr;
+    }
     Vcav.create(nspin, nrxx);
     Vel.create(nspin, nrxx);
 
     ModuleBase::GlobalFunc::ZEROS(delta_phi, nrxx);
     ModuleBase::GlobalFunc::ZEROS(TOTN_real, nrxx);
     ModuleBase::GlobalFunc::ZEROS(epspot, nrxx);
+    ModuleBase::GlobalFunc::ZEROS(comp_real, nrxx);
+    ModuleBase::GlobalFunc::ZEROS(phi_comp_R, nrxx);
     return;
 }
 
@@ -45,4 +58,29 @@ surchem::~surchem()
 {
     delete[] TOTN_real;
     delete[] delta_phi;
+    delete[] epspot;
+    delete[] comp_real;
+    delete[] phi_comp_R;
+}
+
+void surchem::get_totn_reci(const UnitCell &cell, ModulePW::PW_Basis *rho_basis, complex<double> *totn_reci)
+{
+    double *tmp_totn_real = new double[rho_basis->nrxx];
+    double *tmp_comp_real = new double[rho_basis->nrxx];
+    complex<double> *comp_reci = new complex<double>[rho_basis->npw];
+    ModuleBase::GlobalFunc::ZEROS(tmp_totn_real, rho_basis->nrxx);
+    ModuleBase::GlobalFunc::ZEROS(tmp_comp_real, rho_basis->nrxx);
+    ModuleBase::GlobalFunc::ZEROS(comp_reci, rho_basis->npw);
+    add_comp_chg(cell, rho_basis, comp_q, comp_l, comp_center, comp_reci, comp_dim, false);
+    rho_basis->recip2real(comp_reci, tmp_comp_real);
+
+    for (int ir = 0; ir < rho_basis->nrxx;ir++)
+    {
+        tmp_totn_real[ir] = TOTN_real[ir] + tmp_comp_real[ir];
+    }
+
+    rho_basis->real2recip(tmp_totn_real, totn_reci);
+    delete[] tmp_totn_real;
+    delete[] tmp_comp_real;
+    delete[] comp_reci;
 }
\ No newline at end of file
diff --git a/source/module_surchem/surchem.h b/source/module_surchem/surchem.h
index 79a8ae85d4..93496fac7f 100644
--- a/source/module_surchem/surchem.h
+++ b/source/module_surchem/surchem.h
@@ -5,12 +5,12 @@
 #include "../module_base/global_variable.h"
 #include "../module_base/matrix.h"
 #include "../module_cell/unitcell.h"
+#include "../module_pw/pw_basis.h"
 #include "../src_parallel/parallel_reduce.h"
 #include "../src_pw/global.h"
 #include "../src_pw/structure_factor.h"
 #include "../src_pw/use_fft.h"
 #include "atom_in.h"
-#include "../module_pw/pw_basis.h"
 
 class surchem
 {
@@ -25,8 +25,9 @@ class surchem
     ModuleBase::matrix Vel;
     double qs;
 
-    // energy of compensating charge
-    double comp_chg_energy;
+    // compensating charge (in real space, used to cal_Acomp)
+    double *comp_real;
+    double *phi_comp_R;
 
     // compensating charge params
     double comp_q;
@@ -38,12 +39,12 @@ class surchem
 
     void allocate(const int &nrxx, const int &nspin);
 
-    void cal_epsilon(ModulePW::PW_Basis* rho_basis, const double *PS_TOTN_real, double *epsilon, double *epsilon0);
+    void cal_epsilon(ModulePW::PW_Basis *rho_basis, const double *PS_TOTN_real, double *epsilon, double *epsilon0);
 
     void cal_pseudo(const UnitCell &cell,
-                           ModulePW::PW_Basis* rho_basis,
-                           const complex<double> *Porter_g,
-                           complex<double> *PS_TOTN);
+                    ModulePW::PW_Basis *rho_basis,
+                    const complex<double> *Porter_g,
+                    complex<double> *PS_TOTN);
 
     void add_comp_chg(const UnitCell &cell,
                       ModulePW::PW_Basis *rho_basis,
@@ -51,64 +52,85 @@ class surchem
                       double l,
                       double center,
                       complex<double> *NG,
-                      int dim);
+                      int dim,
+                      bool flag); // Set value of comp_reci[ig_gge0] when flag is true. 
+
+    void cal_comp_force(ModuleBase::matrix &force_comp, ModulePW::PW_Basis *rho_basis);
 
-    void gauss_charge(const UnitCell &cell, ModulePW::PW_Basis* rho_basis, complex<double> *N);
+    void gauss_charge(const UnitCell &cell, ModulePW::PW_Basis *rho_basis, complex<double> *N);
 
     void cal_totn(const UnitCell &cell,
-                         ModulePW::PW_Basis* rho_basis,
-                         const complex<double> *Porter_g,
-                         complex<double> *N,
-                         complex<double> *TOTN);
-    void createcavity(const UnitCell &ucell, ModulePW::PW_Basis* rho_basis, const complex<double> *PS_TOTN, double *vwork);
+                  ModulePW::PW_Basis *rho_basis,
+                  const complex<double> *Porter_g,
+                  complex<double> *N,
+                  complex<double> *TOTN);
+    void createcavity(const UnitCell &ucell,
+                      ModulePW::PW_Basis *rho_basis,
+                      const complex<double> *PS_TOTN,
+                      double *vwork);
 
-    ModuleBase::matrix cal_vcav(const UnitCell &ucell, ModulePW::PW_Basis* rho_basis, complex<double> *PS_TOTN, int nspin);
+    ModuleBase::matrix cal_vcav(const UnitCell &ucell,
+                                ModulePW::PW_Basis *rho_basis,
+                                complex<double> *PS_TOTN,
+                                int nspin);
 
     ModuleBase::matrix cal_vel(const UnitCell &cell,
-                                     ModulePW::PW_Basis* rho_basis,
-                                      complex<double> *TOTN,
-                                      complex<double> *PS_TOTN,
-                                      int nspin);
-                            
+                               ModulePW::PW_Basis *rho_basis,
+                               complex<double> *TOTN,
+                               complex<double> *PS_TOTN,
+                               int nspin);
+
+    double cal_Ael(const UnitCell &cell, ModulePW::PW_Basis *rho_basis);
 
-    double cal_Ael(const UnitCell &cell, ModulePW::PW_Basis* rho_basis);
+    double cal_Acav(const UnitCell &cell, ModulePW::PW_Basis *rho_basis);
 
-    double cal_Acav(const UnitCell &cell, ModulePW::PW_Basis* rho_basis);
+    void cal_Acomp(const UnitCell &cell,
+                   ModulePW::PW_Basis *rho_basis,
+                   const double *const *const rho,
+                   vector<double> &res);
 
     void minimize_cg(const UnitCell &ucell,
-                            ModulePW::PW_Basis* rho_basis,
-                            double *d_eps,
-                            const complex<double> *tot_N,
-                            complex<double> *phi,
-                            int &ncgsol);
+                     ModulePW::PW_Basis *rho_basis,
+                     double *d_eps,
+                     const complex<double> *tot_N,
+                     complex<double> *phi,
+                     int &ncgsol);
 
     void Leps2(const UnitCell &ucell,
-                      ModulePW::PW_Basis* rho_basis,
-                      complex<double> *phi,
-                      double *epsilon, // epsilon from shapefunc, dim=nrxx
-                      complex<double> *gradphi_x, // dim=ngmc
-                      complex<double> *gradphi_y,
-                      complex<double> *gradphi_z,
-                      complex<double> *phi_work,
-                      complex<double> *lp);
+               ModulePW::PW_Basis *rho_basis,
+               complex<double> *phi,
+               double *epsilon, // epsilon from shapefunc, dim=nrxx
+               complex<double> *gradphi_x, // dim=ngmc
+               complex<double> *gradphi_y,
+               complex<double> *gradphi_z,
+               complex<double> *phi_work,
+               complex<double> *lp);
 
     ModuleBase::matrix v_correction(const UnitCell &cell,
-                                           ModulePW::PW_Basis* rho_basis,
-                                           const int &nspin,
-                                           const double *const *const rho);
-    
-    ModuleBase::matrix v_compensating(const UnitCell &cell, ModulePW::PW_Basis *pwb);
-
-    void test_V_to_N(ModuleBase::matrix &v, const UnitCell &cell, ModulePW::PW_Basis *rho_basis, const double *const *const rho);
-    
-    void cal_force_sol(const UnitCell &cell, ModulePW::PW_Basis* rho_basis , ModuleBase::matrix& forcesol);
- 
+                                    ModulePW::PW_Basis *rho_basis,
+                                    const int &nspin,
+                                    const double *const *const rho);
+
+    ModuleBase::matrix v_compensating(const UnitCell &cell,
+                                      ModulePW::PW_Basis *rho_basis,
+                                      const int &nspin,
+                                      const double *const *const rho);
+
+    void test_V_to_N(ModuleBase::matrix &v,
+                     const UnitCell &cell,
+                     ModulePW::PW_Basis *rho_basis,
+                     const double *const *const rho);
+
+    void cal_force_sol(const UnitCell &cell, ModulePW::PW_Basis *rho_basis, ModuleBase::matrix &forcesol);
+
+    void get_totn_reci(const UnitCell &cell, ModulePW::PW_Basis *rho_basis, complex<double> *totn_reci);
+
   private:
 };
 
 namespace GlobalC
 {
-  extern surchem solvent_model;
+extern surchem solvent_model;
 }
 
 #endif
diff --git a/source/src_io/to_wannier90.cpp b/source/src_io/to_wannier90.cpp
index 103a8a42b3..02c4bc9cc4 100644
--- a/source/src_io/to_wannier90.cpp
+++ b/source/src_io/to_wannier90.cpp
@@ -1,1876 +1,1923 @@
 #include "to_wannier90.h"
+
 #include "../src_pw/global.h"
 #ifdef __LCAO
 #include "../src_lcao/global_fp.h" // mohan add 2021-01-30, this module should be modified
 #endif
-#include "../module_base/math_integral.h" 
+#include "../module_base/math_integral.h"
+#include "../module_base/math_polyint.h"
 #include "../module_base/math_sphbes.h"
-#include "../module_base/math_polyint.h" 
-#include "../module_base/math_ylmreal.h" 
+#include "../module_base/math_ylmreal.h"
 
 toWannier90::toWannier90(int num_kpts, ModuleBase::Matrix3 recip_lattice)
 {
-	this->num_kpts = num_kpts;
-	this->recip_lattice = recip_lattice;
-	if(GlobalV::NSPIN==1 || GlobalV::NSPIN==4) this->cal_num_kpts = this->num_kpts;
-	else if(GlobalV::NSPIN==2) this->cal_num_kpts = this->num_kpts/2;
-
+    this->num_kpts = num_kpts;
+    this->recip_lattice = recip_lattice;
+    if (GlobalV::NSPIN == 1 || GlobalV::NSPIN == 4)
+        this->cal_num_kpts = this->num_kpts;
+    else if (GlobalV::NSPIN == 2)
+        this->cal_num_kpts = this->num_kpts / 2;
 }
 
-toWannier90::toWannier90(int num_kpts, ModuleBase::Matrix3 recip_lattice, std::complex<double>*** wfc_k_grid_in)
+toWannier90::toWannier90(int num_kpts, ModuleBase::Matrix3 recip_lattice, std::complex<double> ***wfc_k_grid_in)
 {
     this->wfc_k_grid = wfc_k_grid_in;
     this->num_kpts = num_kpts;
-	this->recip_lattice = recip_lattice;
-	if(GlobalV::NSPIN==1 || GlobalV::NSPIN==4) this->cal_num_kpts = this->num_kpts;
-	else if(GlobalV::NSPIN==2) this->cal_num_kpts = this->num_kpts/2;
-
+    this->recip_lattice = recip_lattice;
+    if (GlobalV::NSPIN == 1 || GlobalV::NSPIN == 4)
+        this->cal_num_kpts = this->num_kpts;
+    else if (GlobalV::NSPIN == 2)
+        this->cal_num_kpts = this->num_kpts / 2;
 }
 
 toWannier90::~toWannier90()
 {
-	if(num_exclude_bands > 0) delete[] exclude_bands;
-	if(GlobalV::BASIS_TYPE == "lcao") delete unk_inLcao;
+    if (num_exclude_bands > 0)
+        delete[] exclude_bands;
+    if (GlobalV::BASIS_TYPE == "lcao")
+        delete unk_inLcao;
 }
 
-
-void toWannier90::init_wannier(const psi::Psi<std::complex<double>>* psi)
-{	
-	this->read_nnkp();
-	
-	if(GlobalV::NSPIN == 2)
-	{
-		wannier_spin = INPUT.wannier_spin;
-		if(wannier_spin == "up") start_k_index = 0;
-		else if(wannier_spin == "down") start_k_index = num_kpts/2;
-		else
-		{
-			ModuleBase::WARNING_QUIT("toWannier90::init_wannier","Error wannier_spin set,is not \"up\" or \"down\" ");
-		}
-	}
-	
-	if(GlobalV::BASIS_TYPE == "pw")
-	{
-		writeUNK(*psi);
-		outEIG();
-		cal_Mmn(*psi);
-		cal_Amn(*psi);
-	}
+void toWannier90::init_wannier(const psi::Psi<std::complex<double>> *psi)
+{
+    this->read_nnkp();
+
+    if (GlobalV::NSPIN == 2)
+    {
+        wannier_spin = INPUT.wannier_spin;
+        if (wannier_spin == "up")
+            start_k_index = 0;
+        else if (wannier_spin == "down")
+            start_k_index = num_kpts / 2;
+        else
+        {
+            ModuleBase::WARNING_QUIT("toWannier90::init_wannier", "Error wannier_spin set,is not \"up\" or \"down\" ");
+        }
+    }
+
+    if (GlobalV::BASIS_TYPE == "pw")
+    {
+        writeUNK(*psi);
+        outEIG();
+        cal_Mmn(*psi);
+        cal_Amn(*psi);
+    }
 #ifdef __LCAO
-	else if(GlobalV::BASIS_TYPE == "lcao")
-	{
-		getUnkFromLcao();
-		cal_Amn(this->unk_inLcao[0]);
-		cal_Mmn(this->unk_inLcao[0]);
-		writeUNK(this->unk_inLcao[0]);
-		outEIG();
-	}
+    else if (GlobalV::BASIS_TYPE == "lcao")
+    {
+        getUnkFromLcao();
+        cal_Amn(this->unk_inLcao[0]);
+        cal_Mmn(this->unk_inLcao[0]);
+        writeUNK(this->unk_inLcao[0]);
+        outEIG();
+    }
 #endif
 
-	/*
-	if(GlobalV::MY_RANK==0)
-	{
-		if(GlobalV::BASIS_TYPE == "pw")
-		{
-			cal_Amn(GlobalC::wf.evc);
-			cal_Mmn(GlobalC::wf.evc);
-			writeUNK(GlobalC::wf.evc);
-			outEIG();
-		}
-		else if(GlobalV::BASIS_TYPE == "lcao")
-		{
-			getUnkFromLcao();
-			cal_Amn(this->unk_inLcao);
-			cal_Mmn(this->unk_inLcao);
-			writeUNK(this->unk_inLcao);
-			outEIG();
-		}
-	}
-	*/
-	
+    /*
+    if(GlobalV::MY_RANK==0)
+    {
+        if(GlobalV::BASIS_TYPE == "pw")
+        {
+            cal_Amn(GlobalC::wf.evc);
+            cal_Mmn(GlobalC::wf.evc);
+            writeUNK(GlobalC::wf.evc);
+            outEIG();
+        }
+        else if(GlobalV::BASIS_TYPE == "lcao")
+        {
+            getUnkFromLcao();
+            cal_Amn(this->unk_inLcao);
+            cal_Mmn(this->unk_inLcao);
+            writeUNK(this->unk_inLcao);
+            outEIG();
+        }
+    }
+    */
 }
 
 void toWannier90::read_nnkp()
 {
-	// read *.nnkp file
-	// ��� ����ʸ������ʸ��k�����꣬��̽���ͶӰ��ÿ��k��Ľ���k�㣬��Ҫ�ų����ܴ�ָ��
-	
-	wannier_file_name = INPUT.NNKP;
-	wannier_file_name = wannier_file_name.substr(0,wannier_file_name.length() - 5);
-
-	GlobalV::ofs_running << "reading the " << wannier_file_name << ".nnkp file." << std::endl;
-	
-	std::ifstream nnkp_read(INPUT.NNKP.c_str(), ios::in);
-	
-	if(!nnkp_read) ModuleBase::WARNING_QUIT("toWannier90::read_nnkp","Error during readin parameters.");
-	
-	if( ModuleBase::GlobalFunc::SCAN_BEGIN(nnkp_read,"real_lattice") )
-	{
-		ModuleBase::Matrix3 real_lattice_nnkp;
-		nnkp_read >> real_lattice_nnkp.e11 >> real_lattice_nnkp.e12 >> real_lattice_nnkp.e13
-				  >> real_lattice_nnkp.e21 >> real_lattice_nnkp.e22 >> real_lattice_nnkp.e23
-				  >> real_lattice_nnkp.e31 >> real_lattice_nnkp.e32 >> real_lattice_nnkp.e33;
-				  
-		real_lattice_nnkp = real_lattice_nnkp / GlobalC::ucell.lat0_angstrom;
-		
-		if(abs(real_lattice_nnkp.e11 - GlobalC::ucell.latvec.e11) > 1.0e-4) 
-			ModuleBase::WARNING_QUIT("toWannier90::read_nnkp","Error real_lattice in *.nnkp file");
-		if(abs(real_lattice_nnkp.e12 - GlobalC::ucell.latvec.e12) > 1.0e-4) 
-			ModuleBase::WARNING_QUIT("toWannier90::read_nnkp","Error real_lattice in *.nnkp file");
-		if(abs(real_lattice_nnkp.e13 - GlobalC::ucell.latvec.e13) > 1.0e-4) 
-			ModuleBase::WARNING_QUIT("toWannier90::read_nnkp","Error real_lattice in *.nnkp file");
-		if(abs(real_lattice_nnkp.e21 - GlobalC::ucell.latvec.e21) > 1.0e-4) 
-			ModuleBase::WARNING_QUIT("toWannier90::read_nnkp","Error real_lattice in *.nnkp file");
-		if(abs(real_lattice_nnkp.e22 - GlobalC::ucell.latvec.e22) > 1.0e-4) 
-			ModuleBase::WARNING_QUIT("toWannier90::read_nnkp","Error real_lattice in *.nnkp file");
-		if(abs(real_lattice_nnkp.e23 - GlobalC::ucell.latvec.e23) > 1.0e-4) 
-			ModuleBase::WARNING_QUIT("toWannier90::read_nnkp","Error real_lattice in *.nnkp file");
-		if(abs(real_lattice_nnkp.e31 - GlobalC::ucell.latvec.e31) > 1.0e-4) 
-			ModuleBase::WARNING_QUIT("toWannier90::read_nnkp","Error real_lattice in *.nnkp file");
-		if(abs(real_lattice_nnkp.e32 - GlobalC::ucell.latvec.e32) > 1.0e-4) 
-			ModuleBase::WARNING_QUIT("toWannier90::read_nnkp","Error real_lattice in *.nnkp file");
-		if(abs(real_lattice_nnkp.e33 - GlobalC::ucell.latvec.e33) > 1.0e-4) 
-			ModuleBase::WARNING_QUIT("toWannier90::read_nnkp","Error real_lattice in *.nnkp file");
-		
-	}
-	
-	if( ModuleBase::GlobalFunc::SCAN_BEGIN(nnkp_read,"recip_lattice") )
-	{
-		ModuleBase::Matrix3 recip_lattice_nnkp;
-		nnkp_read >> recip_lattice_nnkp.e11 >> recip_lattice_nnkp.e12 >> recip_lattice_nnkp.e13
-				  >> recip_lattice_nnkp.e21 >> recip_lattice_nnkp.e22 >> recip_lattice_nnkp.e23
-				  >> recip_lattice_nnkp.e31 >> recip_lattice_nnkp.e32 >> recip_lattice_nnkp.e33;
-		
-		const double tpiba_angstrom = ModuleBase::TWO_PI / GlobalC::ucell.lat0_angstrom;
-		recip_lattice_nnkp = recip_lattice_nnkp / tpiba_angstrom;
-		
-		if(abs(recip_lattice_nnkp.e11 - GlobalC::ucell.G.e11) > 1.0e-4) 
-			ModuleBase::WARNING_QUIT("toWannier90::read_nnkp","Error recip_lattice in *.nnkp file");
-		if(abs(recip_lattice_nnkp.e12 - GlobalC::ucell.G.e12) > 1.0e-4) 
-			ModuleBase::WARNING_QUIT("toWannier90::read_nnkp","Error recip_lattice in *.nnkp file");
-		if(abs(recip_lattice_nnkp.e13 - GlobalC::ucell.G.e13) > 1.0e-4) 
-			ModuleBase::WARNING_QUIT("toWannier90::read_nnkp","Error recip_lattice in *.nnkp file");
-		if(abs(recip_lattice_nnkp.e21 - GlobalC::ucell.G.e21) > 1.0e-4) 
-			ModuleBase::WARNING_QUIT("toWannier90::read_nnkp","Error recip_lattice in *.nnkp file");
-		if(abs(recip_lattice_nnkp.e22 - GlobalC::ucell.G.e22) > 1.0e-4) 
-			ModuleBase::WARNING_QUIT("toWannier90::read_nnkp","Error recip_lattice in *.nnkp file");
-		if(abs(recip_lattice_nnkp.e23 - GlobalC::ucell.G.e23) > 1.0e-4) 
-			ModuleBase::WARNING_QUIT("toWannier90::read_nnkp","Error recip_lattice in *.nnkp file");
-		if(abs(recip_lattice_nnkp.e31 - GlobalC::ucell.G.e31) > 1.0e-4) 
-			ModuleBase::WARNING_QUIT("toWannier90::read_nnkp","Error recip_lattice in *.nnkp file");
-		if(abs(recip_lattice_nnkp.e32 - GlobalC::ucell.G.e32) > 1.0e-4) 
-			ModuleBase::WARNING_QUIT("toWannier90::read_nnkp","Error recip_lattice in *.nnkp file");
-		if(abs(recip_lattice_nnkp.e33 - GlobalC::ucell.G.e33) > 1.0e-4) 
-			ModuleBase::WARNING_QUIT("toWannier90::read_nnkp","Error recip_lattice in *.nnkp file");
-	}
-	
-	if( ModuleBase::GlobalFunc::SCAN_BEGIN(nnkp_read,"kpoints") )
-	{
-		int numkpt_nnkp;
-		ModuleBase::GlobalFunc::READ_VALUE(nnkp_read, numkpt_nnkp);
-		if( (GlobalV::NSPIN == 1 || GlobalV::NSPIN == 4) && numkpt_nnkp != GlobalC::kv.nkstot ) ModuleBase::WARNING_QUIT("toWannier90::read_nnkp","Error kpoints in *.nnkp file");
-		else if(GlobalV::NSPIN == 2 && numkpt_nnkp != (GlobalC::kv.nkstot/2))	ModuleBase::WARNING_QUIT("toWannier90::read_nnkp","Error kpoints in *.nnkp file");
-	
-		ModuleBase::Vector3<double> *kpoints_direct_nnkp = new ModuleBase::Vector3<double>[numkpt_nnkp];
-		for(int ik = 0; ik < numkpt_nnkp; ik++)
-		{
-			nnkp_read >> kpoints_direct_nnkp[ik].x >> kpoints_direct_nnkp[ik].y >> kpoints_direct_nnkp[ik].z;
-			if(abs(kpoints_direct_nnkp[ik].x - GlobalC::kv.kvec_d[ik].x) > 1.0e-4) 
-				ModuleBase::WARNING_QUIT("toWannier90::read_nnkp","Error kpoints in *.nnkp file");
-			if(abs(kpoints_direct_nnkp[ik].y - GlobalC::kv.kvec_d[ik].y) > 1.0e-4) 
-				ModuleBase::WARNING_QUIT("toWannier90::read_nnkp","Error kpoints in *.nnkp file");
-			if(abs(kpoints_direct_nnkp[ik].z - GlobalC::kv.kvec_d[ik].z) > 1.0e-4) 
-				ModuleBase::WARNING_QUIT("toWannier90::read_nnkp","Error kpoints in *.nnkp file");
-		}
-				
-		delete[] kpoints_direct_nnkp;
-		
-		//�ж�gamma only
-		ModuleBase::Vector3<double> my_gamma_point(0.0,0.0,0.0);
-		//if( (GlobalC::kv.nkstot == 1) && (GlobalC::kv.kvec_d[0] == my_gamma_point) ) gamma_only_wannier = true;
-	} 
-	
-	if(GlobalV::NSPIN!=4)
-	{
-		if( ModuleBase::GlobalFunc::SCAN_BEGIN(nnkp_read,"projections") )
-		{
-			ModuleBase::GlobalFunc::READ_VALUE(nnkp_read, num_wannier);
-			// test
-			//GlobalV::ofs_running << "num_wannier = " << num_wannier << std::endl;
-			// test
-			if(num_wannier < 0)
-			{
-				ModuleBase::WARNING_QUIT("toWannier90::read_nnkp","wannier number is lower than 0");
-			}
-			
-			R_centre = new ModuleBase::Vector3<double>[num_wannier];
-			L = new int[num_wannier];
-			m = new int[num_wannier];
-			rvalue = new int[num_wannier];
-			ModuleBase::Vector3<double>* z_axis = new ModuleBase::Vector3<double>[num_wannier];
-			ModuleBase::Vector3<double>* x_axis = new ModuleBase::Vector3<double>[num_wannier];
-			alfa = new double[num_wannier];
-			
-			
-			for(int count = 0; count < num_wannier; count++)
-			{
-				nnkp_read >> R_centre[count].x >> R_centre[count].y >> R_centre[count].z;
-				nnkp_read >> L[count] >> m[count];
-				ModuleBase::GlobalFunc::READ_VALUE(nnkp_read,rvalue[count]);
-				nnkp_read >> z_axis[count].x >> z_axis[count].y >> z_axis[count].z;
-				nnkp_read >> x_axis[count].x >> x_axis[count].y >> x_axis[count].z;
-				ModuleBase::GlobalFunc::READ_VALUE(nnkp_read,alfa[count]);			
-			}
-			
-		}
-	}
-	else
-	{
-		ModuleBase::WARNING_QUIT("toWannier90::read_nnkp","noncolin spin is not done yet");
-	}
-
-	if( ModuleBase::GlobalFunc::SCAN_BEGIN(nnkp_read,"nnkpts") )
-	{
-		ModuleBase::GlobalFunc::READ_VALUE(nnkp_read, nntot);
-		nnlist.resize(GlobalC::kv.nkstot);
-		nncell.resize(GlobalC::kv.nkstot);
-		for(int ik = 0; ik < GlobalC::kv.nkstot; ik++)
-		{
-			nnlist[ik].resize(nntot);
-			nncell[ik].resize(nntot);
-		}
-		
-		int numkpt_nnkp;
-		if(GlobalV::NSPIN == 1 || GlobalV::NSPIN == 4) numkpt_nnkp = GlobalC::kv.nkstot;
-		else if(GlobalV::NSPIN == 2) numkpt_nnkp = GlobalC::kv.nkstot/2;
-		else throw std::runtime_error("numkpt_nnkp uninitialized in "+ModuleBase::GlobalFunc::TO_STRING(__FILE__)+" line "+ModuleBase::GlobalFunc::TO_STRING(__LINE__));
-		
-		for(int ik = 0; ik < numkpt_nnkp; ik++)
-		{
-			for(int ib = 0; ib < nntot; ib++)
-			{
-				int ik_nnkp;
-				nnkp_read >> ik_nnkp;
-				if(ik_nnkp != (ik+1)) ModuleBase::WARNING_QUIT("toWannier90::read_nnkp","error nnkpts in *.nnkp file");
-				nnkp_read >> nnlist[ik][ib];
-				nnkp_read >> nncell[ik][ib].x >> nncell[ik][ib].y >> nncell[ik][ib].z;
-				nnlist[ik][ib]--; // this is c++ , begin from 0
-			}
-			
-		}
-	}
-	
-	if( ModuleBase::GlobalFunc::SCAN_BEGIN(nnkp_read,"exclude_bands") )
-	{
-		ModuleBase::GlobalFunc::READ_VALUE(nnkp_read, num_exclude_bands);
-		if(num_exclude_bands > 0) exclude_bands = new int[num_exclude_bands];
-		else if(num_exclude_bands < 0) ModuleBase::WARNING_QUIT("toWannier90::read_nnkp","the exclude bands is wrong , please check *.nnkp file.");
-		
-		if(num_exclude_bands > 0)
-		{
-			for(int i = 0; i < num_exclude_bands; i++)
-			{
-				ModuleBase::GlobalFunc::READ_VALUE(nnkp_read, exclude_bands[i]);
-				exclude_bands[i]--; // this is c++ , begin from 0
-			}
-		}
-	}
-	
-	// test by jingan
-	//GlobalV::ofs_running << "num_exclude_bands = " << num_exclude_bands << std::endl;
-	//for(int i = 0; i < num_exclude_bands; i++)
-	//{
-	//	GlobalV::ofs_running << "exclude_bands : " << exclude_bands[i] << std::endl;
-	//}
-	// test by jingan
-	
-	nnkp_read.close();
-	
-	// ������̽�������
-	for(int i = 0; i < num_wannier; i++)
-	{
-		R_centre[i] = R_centre[i] * GlobalC::ucell.latvec;
-		m[i] = m[i] - 1; // ABACUS and wannier90 �ԴŽǶ���m�Ķ��岻һ����ABACUS�Ǵ�0��ʼ�ģ�wannier90�Ǵ�1��ʼ��
-	}
-	
-	// test by jingan
-	//GlobalV::ofs_running << "num_wannier is " << num_wannier << std::endl;
-	//for(int i = 0; i < num_wannier; i++)
-	//{
-	//	GlobalV::ofs_running << "num_wannier" << std::endl;
-	//	GlobalV::ofs_running << L[i] << " " << m[i] << " " << rvalue[i] << " " << alfa[i] << std::endl;
-	//}
-	// test by jingan
-	
-	// ����exclude_bands
-	tag_cal_band = new bool[GlobalV::NBANDS];
-	if(GlobalV::NBANDS <= num_exclude_bands) ModuleBase::WARNING_QUIT("toWannier90::read_nnkp","you set the band numer is not enough, please add bands number.");
-	if(num_exclude_bands == 0)
-	{
-		for(int ib = 0; ib < GlobalV::NBANDS; ib++) tag_cal_band[ib] = true;
-	}
-	else
-	{
-		for(int ib = 0; ib < GlobalV::NBANDS; ib++)
-		{
-			tag_cal_band[ib] = true;
-			for(int ibb = 0; ibb < num_exclude_bands; ibb++)
-			{
-				if(exclude_bands[ibb] == ib) 
-				{
-					tag_cal_band[ib] = false;
-					break;
-				}
-			}
-		}
-	}
-	
-	if(num_exclude_bands < 0) num_bands = GlobalV::NBANDS;
-	else num_bands = GlobalV::NBANDS - num_exclude_bands;
-	
-	
+    // read *.nnkp file
+
+    wannier_file_name = INPUT.NNKP;
+    wannier_file_name = wannier_file_name.substr(0, wannier_file_name.length() - 5);
+
+    GlobalV::ofs_running << "reading the " << wannier_file_name << ".nnkp file." << std::endl;
+
+    std::ifstream nnkp_read(INPUT.NNKP.c_str(), ios::in);
+
+    if (!nnkp_read)
+        ModuleBase::WARNING_QUIT("toWannier90::read_nnkp", "Error during readin parameters.");
+
+    if (ModuleBase::GlobalFunc::SCAN_BEGIN(nnkp_read, "real_lattice"))
+    {
+        ModuleBase::Matrix3 real_lattice_nnkp;
+        nnkp_read >> real_lattice_nnkp.e11 >> real_lattice_nnkp.e12 >> real_lattice_nnkp.e13 >> real_lattice_nnkp.e21
+            >> real_lattice_nnkp.e22 >> real_lattice_nnkp.e23 >> real_lattice_nnkp.e31 >> real_lattice_nnkp.e32
+            >> real_lattice_nnkp.e33;
+
+        real_lattice_nnkp = real_lattice_nnkp / GlobalC::ucell.lat0_angstrom;
+
+        if (abs(real_lattice_nnkp.e11 - GlobalC::ucell.latvec.e11) > 1.0e-4)
+            ModuleBase::WARNING_QUIT("toWannier90::read_nnkp", "Error real_lattice in *.nnkp file");
+        if (abs(real_lattice_nnkp.e12 - GlobalC::ucell.latvec.e12) > 1.0e-4)
+            ModuleBase::WARNING_QUIT("toWannier90::read_nnkp", "Error real_lattice in *.nnkp file");
+        if (abs(real_lattice_nnkp.e13 - GlobalC::ucell.latvec.e13) > 1.0e-4)
+            ModuleBase::WARNING_QUIT("toWannier90::read_nnkp", "Error real_lattice in *.nnkp file");
+        if (abs(real_lattice_nnkp.e21 - GlobalC::ucell.latvec.e21) > 1.0e-4)
+            ModuleBase::WARNING_QUIT("toWannier90::read_nnkp", "Error real_lattice in *.nnkp file");
+        if (abs(real_lattice_nnkp.e22 - GlobalC::ucell.latvec.e22) > 1.0e-4)
+            ModuleBase::WARNING_QUIT("toWannier90::read_nnkp", "Error real_lattice in *.nnkp file");
+        if (abs(real_lattice_nnkp.e23 - GlobalC::ucell.latvec.e23) > 1.0e-4)
+            ModuleBase::WARNING_QUIT("toWannier90::read_nnkp", "Error real_lattice in *.nnkp file");
+        if (abs(real_lattice_nnkp.e31 - GlobalC::ucell.latvec.e31) > 1.0e-4)
+            ModuleBase::WARNING_QUIT("toWannier90::read_nnkp", "Error real_lattice in *.nnkp file");
+        if (abs(real_lattice_nnkp.e32 - GlobalC::ucell.latvec.e32) > 1.0e-4)
+            ModuleBase::WARNING_QUIT("toWannier90::read_nnkp", "Error real_lattice in *.nnkp file");
+        if (abs(real_lattice_nnkp.e33 - GlobalC::ucell.latvec.e33) > 1.0e-4)
+            ModuleBase::WARNING_QUIT("toWannier90::read_nnkp", "Error real_lattice in *.nnkp file");
+    }
+
+    if (ModuleBase::GlobalFunc::SCAN_BEGIN(nnkp_read, "recip_lattice"))
+    {
+        ModuleBase::Matrix3 recip_lattice_nnkp;
+        nnkp_read >> recip_lattice_nnkp.e11 >> recip_lattice_nnkp.e12 >> recip_lattice_nnkp.e13
+            >> recip_lattice_nnkp.e21 >> recip_lattice_nnkp.e22 >> recip_lattice_nnkp.e23 >> recip_lattice_nnkp.e31
+            >> recip_lattice_nnkp.e32 >> recip_lattice_nnkp.e33;
+
+        const double tpiba_angstrom = ModuleBase::TWO_PI / GlobalC::ucell.lat0_angstrom;
+        recip_lattice_nnkp = recip_lattice_nnkp / tpiba_angstrom;
+
+        if (abs(recip_lattice_nnkp.e11 - GlobalC::ucell.G.e11) > 1.0e-4)
+            ModuleBase::WARNING_QUIT("toWannier90::read_nnkp", "Error recip_lattice in *.nnkp file");
+        if (abs(recip_lattice_nnkp.e12 - GlobalC::ucell.G.e12) > 1.0e-4)
+            ModuleBase::WARNING_QUIT("toWannier90::read_nnkp", "Error recip_lattice in *.nnkp file");
+        if (abs(recip_lattice_nnkp.e13 - GlobalC::ucell.G.e13) > 1.0e-4)
+            ModuleBase::WARNING_QUIT("toWannier90::read_nnkp", "Error recip_lattice in *.nnkp file");
+        if (abs(recip_lattice_nnkp.e21 - GlobalC::ucell.G.e21) > 1.0e-4)
+            ModuleBase::WARNING_QUIT("toWannier90::read_nnkp", "Error recip_lattice in *.nnkp file");
+        if (abs(recip_lattice_nnkp.e22 - GlobalC::ucell.G.e22) > 1.0e-4)
+            ModuleBase::WARNING_QUIT("toWannier90::read_nnkp", "Error recip_lattice in *.nnkp file");
+        if (abs(recip_lattice_nnkp.e23 - GlobalC::ucell.G.e23) > 1.0e-4)
+            ModuleBase::WARNING_QUIT("toWannier90::read_nnkp", "Error recip_lattice in *.nnkp file");
+        if (abs(recip_lattice_nnkp.e31 - GlobalC::ucell.G.e31) > 1.0e-4)
+            ModuleBase::WARNING_QUIT("toWannier90::read_nnkp", "Error recip_lattice in *.nnkp file");
+        if (abs(recip_lattice_nnkp.e32 - GlobalC::ucell.G.e32) > 1.0e-4)
+            ModuleBase::WARNING_QUIT("toWannier90::read_nnkp", "Error recip_lattice in *.nnkp file");
+        if (abs(recip_lattice_nnkp.e33 - GlobalC::ucell.G.e33) > 1.0e-4)
+            ModuleBase::WARNING_QUIT("toWannier90::read_nnkp", "Error recip_lattice in *.nnkp file");
+    }
+
+    if (ModuleBase::GlobalFunc::SCAN_BEGIN(nnkp_read, "kpoints"))
+    {
+        int numkpt_nnkp;
+        ModuleBase::GlobalFunc::READ_VALUE(nnkp_read, numkpt_nnkp);
+        if ((GlobalV::NSPIN == 1 || GlobalV::NSPIN == 4) && numkpt_nnkp != GlobalC::kv.nkstot)
+            ModuleBase::WARNING_QUIT("toWannier90::read_nnkp", "Error kpoints in *.nnkp file");
+        else if (GlobalV::NSPIN == 2 && numkpt_nnkp != (GlobalC::kv.nkstot / 2))
+            ModuleBase::WARNING_QUIT("toWannier90::read_nnkp", "Error kpoints in *.nnkp file");
+
+        ModuleBase::Vector3<double> *kpoints_direct_nnkp = new ModuleBase::Vector3<double>[numkpt_nnkp];
+        for (int ik = 0; ik < numkpt_nnkp; ik++)
+        {
+            nnkp_read >> kpoints_direct_nnkp[ik].x >> kpoints_direct_nnkp[ik].y >> kpoints_direct_nnkp[ik].z;
+            if (abs(kpoints_direct_nnkp[ik].x - GlobalC::kv.kvec_d[ik].x) > 1.0e-4)
+                ModuleBase::WARNING_QUIT("toWannier90::read_nnkp", "Error kpoints in *.nnkp file");
+            if (abs(kpoints_direct_nnkp[ik].y - GlobalC::kv.kvec_d[ik].y) > 1.0e-4)
+                ModuleBase::WARNING_QUIT("toWannier90::read_nnkp", "Error kpoints in *.nnkp file");
+            if (abs(kpoints_direct_nnkp[ik].z - GlobalC::kv.kvec_d[ik].z) > 1.0e-4)
+                ModuleBase::WARNING_QUIT("toWannier90::read_nnkp", "Error kpoints in *.nnkp file");
+        }
+
+        delete[] kpoints_direct_nnkp;
+
+        ModuleBase::Vector3<double> my_gamma_point(0.0, 0.0, 0.0);
+        // if( (GlobalC::kv.nkstot == 1) && (GlobalC::kv.kvec_d[0] == my_gamma_point) ) gamma_only_wannier = true;
+    }
+
+    if (GlobalV::NSPIN != 4)
+    {
+        if (ModuleBase::GlobalFunc::SCAN_BEGIN(nnkp_read, "projections"))
+        {
+            ModuleBase::GlobalFunc::READ_VALUE(nnkp_read, num_wannier);
+            // test
+            // GlobalV::ofs_running << "num_wannier = " << num_wannier << std::endl;
+            // test
+            if (num_wannier < 0)
+            {
+                ModuleBase::WARNING_QUIT("toWannier90::read_nnkp", "wannier number is lower than 0");
+            }
+
+            R_centre = new ModuleBase::Vector3<double>[num_wannier];
+            L = new int[num_wannier];
+            m = new int[num_wannier];
+            rvalue = new int[num_wannier];
+            ModuleBase::Vector3<double> *z_axis = new ModuleBase::Vector3<double>[num_wannier];
+            ModuleBase::Vector3<double> *x_axis = new ModuleBase::Vector3<double>[num_wannier];
+            alfa = new double[num_wannier];
+
+            for (int count = 0; count < num_wannier; count++)
+            {
+                nnkp_read >> R_centre[count].x >> R_centre[count].y >> R_centre[count].z;
+                nnkp_read >> L[count] >> m[count];
+                ModuleBase::GlobalFunc::READ_VALUE(nnkp_read, rvalue[count]);
+                nnkp_read >> z_axis[count].x >> z_axis[count].y >> z_axis[count].z;
+                nnkp_read >> x_axis[count].x >> x_axis[count].y >> x_axis[count].z;
+                ModuleBase::GlobalFunc::READ_VALUE(nnkp_read, alfa[count]);
+            }
+        }
+    }
+    else
+    {
+        ModuleBase::WARNING_QUIT("toWannier90::read_nnkp", "noncolin spin is not done yet");
+    }
+
+    if (ModuleBase::GlobalFunc::SCAN_BEGIN(nnkp_read, "nnkpts"))
+    {
+        ModuleBase::GlobalFunc::READ_VALUE(nnkp_read, nntot);
+        nnlist.resize(GlobalC::kv.nkstot);
+        nncell.resize(GlobalC::kv.nkstot);
+        for (int ik = 0; ik < GlobalC::kv.nkstot; ik++)
+        {
+            nnlist[ik].resize(nntot);
+            nncell[ik].resize(nntot);
+        }
+
+        int numkpt_nnkp;
+        if (GlobalV::NSPIN == 1 || GlobalV::NSPIN == 4)
+            numkpt_nnkp = GlobalC::kv.nkstot;
+        else if (GlobalV::NSPIN == 2)
+            numkpt_nnkp = GlobalC::kv.nkstot / 2;
+        else
+            throw std::runtime_error("numkpt_nnkp uninitialized in " + ModuleBase::GlobalFunc::TO_STRING(__FILE__)
+                                     + " line " + ModuleBase::GlobalFunc::TO_STRING(__LINE__));
+
+        for (int ik = 0; ik < numkpt_nnkp; ik++)
+        {
+            for (int ib = 0; ib < nntot; ib++)
+            {
+                int ik_nnkp;
+                nnkp_read >> ik_nnkp;
+                if (ik_nnkp != (ik + 1))
+                    ModuleBase::WARNING_QUIT("toWannier90::read_nnkp", "error nnkpts in *.nnkp file");
+                nnkp_read >> nnlist[ik][ib];
+                nnkp_read >> nncell[ik][ib].x >> nncell[ik][ib].y >> nncell[ik][ib].z;
+                nnlist[ik][ib]--; // this is c++ , begin from 0
+            }
+        }
+    }
+
+    if (ModuleBase::GlobalFunc::SCAN_BEGIN(nnkp_read, "exclude_bands"))
+    {
+        ModuleBase::GlobalFunc::READ_VALUE(nnkp_read, num_exclude_bands);
+        if (num_exclude_bands > 0)
+            exclude_bands = new int[num_exclude_bands];
+        else if (num_exclude_bands < 0)
+            ModuleBase::WARNING_QUIT("toWannier90::read_nnkp",
+                                     "the exclude bands is wrong , please check *.nnkp file.");
+
+        if (num_exclude_bands > 0)
+        {
+            for (int i = 0; i < num_exclude_bands; i++)
+            {
+                ModuleBase::GlobalFunc::READ_VALUE(nnkp_read, exclude_bands[i]);
+                exclude_bands[i]--; // this is c++ , begin from 0
+            }
+        }
+    }
+
+    // test by jingan
+    // GlobalV::ofs_running << "num_exclude_bands = " << num_exclude_bands << std::endl;
+    // for(int i = 0; i < num_exclude_bands; i++)
+    //{
+    //	GlobalV::ofs_running << "exclude_bands : " << exclude_bands[i] << std::endl;
+    //}
+    // test by jingan
+
+    nnkp_read.close();
+
+    for (int i = 0; i < num_wannier; i++)
+    {
+        R_centre[i] = R_centre[i] * GlobalC::ucell.latvec;
+        m[i] = m[i] - 1;
+    }
+
+    // test by jingan
+    // GlobalV::ofs_running << "num_wannier is " << num_wannier << std::endl;
+    // for(int i = 0; i < num_wannier; i++)
+    //{
+    //	GlobalV::ofs_running << "num_wannier" << std::endl;
+    //	GlobalV::ofs_running << L[i] << " " << m[i] << " " << rvalue[i] << " " << alfa[i] << std::endl;
+    //}
+    // test by jingan
+
+    tag_cal_band = new bool[GlobalV::NBANDS];
+    if (GlobalV::NBANDS <= num_exclude_bands)
+        ModuleBase::WARNING_QUIT("toWannier90::read_nnkp",
+                                 "you set the band numer is not enough, please add bands number.");
+    if (num_exclude_bands == 0)
+    {
+        for (int ib = 0; ib < GlobalV::NBANDS; ib++)
+            tag_cal_band[ib] = true;
+    }
+    else
+    {
+        for (int ib = 0; ib < GlobalV::NBANDS; ib++)
+        {
+            tag_cal_band[ib] = true;
+            for (int ibb = 0; ibb < num_exclude_bands; ibb++)
+            {
+                if (exclude_bands[ibb] == ib)
+                {
+                    tag_cal_band[ib] = false;
+                    break;
+                }
+            }
+        }
+    }
+
+    if (num_exclude_bands < 0)
+        num_bands = GlobalV::NBANDS;
+    else
+        num_bands = GlobalV::NBANDS - num_exclude_bands;
 }
 
 void toWannier90::outEIG()
 {
-	if(GlobalV::MY_RANK == 0)
-	{
-		std::string fileaddress = GlobalV::global_out_dir + wannier_file_name + ".eig";
-		std::ofstream eig_file( fileaddress.c_str() );
-		for(int ik = start_k_index; ik < (cal_num_kpts+start_k_index); ik++)
-		{
-			int index_band = 0;
-			for(int ib = 0; ib < GlobalV::NBANDS; ib++)
-			{
-				if(!tag_cal_band[ib]) continue;
-				index_band++;
-				eig_file << std::setw(5) << index_band << std::setw(5) << ik+1-start_k_index
-						 << std::setw(18) << showpoint << fixed << std::setprecision(12) 
-						 << GlobalC::wf.ekb[ik][ib] * ModuleBase::Ry_to_eV << std::endl;
-			}
-		}
-		
-		eig_file.close();
-	}
+    if (GlobalV::MY_RANK == 0)
+    {
+        std::string fileaddress = GlobalV::global_out_dir + wannier_file_name + ".eig";
+        std::ofstream eig_file(fileaddress.c_str());
+        for (int ik = start_k_index; ik < (cal_num_kpts + start_k_index); ik++)
+        {
+            int index_band = 0;
+            for (int ib = 0; ib < GlobalV::NBANDS; ib++)
+            {
+                if (!tag_cal_band[ib])
+                    continue;
+                index_band++;
+                eig_file << std::setw(5) << index_band << std::setw(5) << ik + 1 - start_k_index << std::setw(18)
+                         << showpoint << fixed << std::setprecision(12)
+                         << GlobalC::wf.ekb[ik][ib] * ModuleBase::Ry_to_eV << std::endl;
+            }
+        }
+
+        eig_file.close();
+    }
 }
 
-
-void toWannier90::writeUNK(const psi::Psi<std::complex<double>>& wfc_pw)
+void toWannier90::writeUNK(const psi::Psi<std::complex<double>> &wfc_pw)
 {
 
-	
-	// std::complex<double> *porter = new std::complex<double>[GlobalC::wfcpw->nrxx];
-	
-	// for(int ik = start_k_index; ik < (cal_num_kpts+start_k_index); ik++)
-	// {
-	// 	std::stringstream name;
-	// 	if(GlobalV::NSPIN==1 || GlobalV::NSPIN==4)
-	// 	{
-	// 		name << GlobalV::global_out_dir << "UNK" << std::setw(5) << setfill('0') << ik+1 << ".1" ;
-	// 	}
-	// 	else if(GlobalV::NSPIN==2)
-	// 	{
-	// 		if(wannier_spin=="up") name << GlobalV::global_out_dir << "UNK" << std::setw(5) << setfill('0') << ik+1-start_k_index << ".1" ;
-	// 		else if(wannier_spin=="down") name << GlobalV::global_out_dir << "UNK" << std::setw(5) << setfill('0') << ik+1-start_k_index << ".2" ;
-	// 	}
-		
-	// 	std::ofstream unkfile(name.str());
-		
-	// 	unkfile << std::setw(12) << GlobalC::rhopw->nx << std::setw(12) << GlobalC::rhopw->ny << std::setw(12) << GlobalC::rhopw->nz << std::setw(12) << ik+1 << std::setw(12) << num_bands << std::endl;
-		
-	// 	for(int ib = 0; ib < GlobalV::NBANDS; ib++)
-	// 	{
-	// 		if(!tag_cal_band[ib]) continue;
-	// 		//std::complex<double> *porter = GlobalC::UFFT.porter;
-	// 		//  u_k in real space
-	// 		ModuleBase::GlobalFunc::ZEROS(porter, GlobalC::rhopw->nrxx);
-	// 		for (int ig = 0; ig < GlobalC::kv.ngk[ik]; ig++)
-	// 		{
-	// 			porter[GlobalC::sf.ig2fftw[GlobalC::wf.igk(ik, ig)]] = wfc_pw[ik](ib, ig);
-	// 		}
-	// 		GlobalC::sf.FFT_wfc.FFT3D(porter, 1);
-			
-	// 		for(int k=0; k<GlobalC::rhopw->nz; k++)
-	// 		{
-	// 			for(int j=0; j<GlobalC::rhopw->ny; j++)
-	// 			{
-	// 				for(int i=0; i<GlobalC::rhopw->nx; i++)
-	// 				{
-	// 					if(!gamma_only_wannier)
-	// 					{
-	// 						unkfile << std::setw(20) << std::setprecision(9) << std::setiosflags(ios::scientific) << porter[i*GlobalC::rhopw->ny*GlobalC::rhopw->nz + j*GlobalC::rhopw->nz + k].real()
-	// 								<< std::setw(20) << std::setprecision(9) << std::setiosflags(ios::scientific) << porter[i*GlobalC::rhopw->ny*GlobalC::rhopw->nz + j*GlobalC::rhopw->nz + k].imag() 
-	// 								//jingan test
-	// 								//<< "       " << std::setw(12) << std::setprecision(9) << std::setiosflags(ios::scientific) << abs(porter[i*GlobalC::rhopw->ny*GlobalC::rhopw->nz + j*GlobalC::rhopw->nz + k])
-	// 								<< std::endl;
-	// 					}
-	// 					else
-	// 					{
-	// 						double zero = 0.0;
-	// 						unkfile << std::setw(20) << std::setprecision(9) << std::setiosflags(ios::scientific) << abs( porter[i*GlobalC::rhopw->ny*GlobalC::rhopw->nz + j*GlobalC::rhopw->nz + k] )
-	// 								<< std::setw(20) << std::setprecision(9) << std::setiosflags(ios::scientific) << zero
-	// 								//jingan test
-	// 								//<< "       " << std::setw(12) << std::setprecision(9) << std::setiosflags(ios::scientific) << abs(porter[i*GlobalC::rhopw->ny*GlobalC::rhopw->nz + j*GlobalC::rhopw->nz + k])
-	// 								<< std::endl;
-	// 					}
-	// 				}
-	// 			}
-	// 		}
-			
-			
-	// 	}
-
-		
-	// 	unkfile.close();
-		
-	// }
-	
-	// delete[] porter;
-
+/*
+    std::complex<double> *porter = new std::complex<double>[GlobalC::wfcpw->nrxx];
+
+    for(int ik = start_k_index; ik < (cal_num_kpts+start_k_index); ik++)
+    {
+        std::stringstream name;
+        if(GlobalV::NSPIN==1 || GlobalV::NSPIN==4)
+        {
+            name << GlobalV::global_out_dir << "UNK" << std::setw(5) << setfill('0') << ik+1 << ".1" ;
+        }
+        else if(GlobalV::NSPIN==2)
+        {
+            if(wannier_spin=="up") name << GlobalV::global_out_dir << "UNK" << std::setw(5) << setfill('0') <<
+   ik+1-start_k_index << ".1" ; else if(wannier_spin=="down") name << GlobalV::global_out_dir << "UNK" << std::setw(5)
+   << setfill('0') << ik+1-start_k_index << ".2" ;
+        }
+
+        std::ofstream unkfile(name.str());
+
+        unkfile << std::setw(12) << GlobalC::rhopw->nx << std::setw(12) << GlobalC::rhopw->ny << std::setw(12) <<
+   GlobalC::rhopw->nz << std::setw(12) << ik+1 << std::setw(12) << num_bands << std::endl;
+
+        for(int ib = 0; ib < GlobalV::NBANDS; ib++)
+        {
+            if(!tag_cal_band[ib]) continue;
+            //std::complex<double> *porter = GlobalC::UFFT.porter;
+            //  u_k in real space
+            ModuleBase::GlobalFunc::ZEROS(porter, GlobalC::rhopw->nrxx);
+            for (int ig = 0; ig < GlobalC::kv.ngk[ik]; ig++)
+            {
+                porter[GlobalC::sf.ig2fftw[GlobalC::wf.igk(ik, ig)]] = wfc_pw[ik](ib, ig);
+            }
+            GlobalC::sf.FFT_wfc.FFT3D(porter, 1);
+
+            for(int k=0; k<GlobalC::rhopw->nz; k++)
+            {
+                for(int j=0; j<GlobalC::rhopw->ny; j++)
+                {
+                    for(int i=0; i<GlobalC::rhopw->nx; i++)
+                    {
+                        if(!gamma_only_wannier)
+                        {
+                            unkfile << std::setw(20) << std::setprecision(9) << std::setiosflags(ios::scientific) <<
+   porter[i*GlobalC::rhopw->ny*GlobalC::rhopw->nz + j*GlobalC::rhopw->nz + k].real()
+                                    << std::setw(20) << std::setprecision(9) << std::setiosflags(ios::scientific) <<
+   porter[i*GlobalC::rhopw->ny*GlobalC::rhopw->nz + j*GlobalC::rhopw->nz + k].imag()
+                                    //jingan test
+                                    //<< "       " << std::setw(12) << std::setprecision(9) <<
+   std::setiosflags(ios::scientific) << abs(porter[i*GlobalC::rhopw->ny*GlobalC::rhopw->nz + j*GlobalC::rhopw->nz + k])
+                                    << std::endl;
+                        }
+                        else
+                        {
+                            double zero = 0.0;
+                            unkfile << std::setw(20) << std::setprecision(9) << std::setiosflags(ios::scientific) <<
+   abs( porter[i*GlobalC::rhopw->ny*GlobalC::rhopw->nz + j*GlobalC::rhopw->nz + k] )
+                                    << std::setw(20) << std::setprecision(9) << std::setiosflags(ios::scientific) <<
+   zero
+                                    //jingan test
+                                    //<< "       " << std::setw(12) << std::setprecision(9) <<
+   std::setiosflags(ios::scientific) << abs(porter[i*GlobalC::rhopw->ny*GlobalC::rhopw->nz + j*GlobalC::rhopw->nz + k])
+                                    << std::endl;
+                        }
+                    }
+                }
+            }
+
+
+        }
+
+
+        unkfile.close();
+
+    }
+
+    delete[] porter;
+*/
 #ifdef __MPI
-	// num_z: how many planes on processor 'ip'
-	int *num_z = new int[GlobalV::NPROC_IN_POOL];
-	ModuleBase::GlobalFunc::ZEROS(num_z, GlobalV::NPROC_IN_POOL);
-	for (int iz=0;iz<GlobalC::bigpw->nbz;iz++)
-	{
-		int ip = iz % GlobalV::NPROC_IN_POOL;
-		num_z[ip] += GlobalC::bigpw->bz;
-	}	
-
-	// start_z: start position of z in 
-	// processor ip.
-	int *start_z = new int[GlobalV::NPROC_IN_POOL];
-	ModuleBase::GlobalFunc::ZEROS(start_z, GlobalV::NPROC_IN_POOL);
-	for (int ip=1;ip<GlobalV::NPROC_IN_POOL;ip++)
-	{
-		start_z[ip] = start_z[ip-1]+num_z[ip-1];
-	}	
-
-	// which_ip: found iz belongs to which ip.
-	int *which_ip = new int[GlobalC::wfcpw->nz];
-	ModuleBase::GlobalFunc::ZEROS(which_ip, GlobalC::wfcpw->nz);
-	for(int iz=0; iz<GlobalC::wfcpw->nz; iz++)
-	{
-		for(int ip=0; ip<GlobalV::NPROC_IN_POOL; ip++)
-		{
-			if(iz>=start_z[GlobalV::NPROC_IN_POOL-1]) 
-			{
-				which_ip[iz] = GlobalV::NPROC_IN_POOL-1;
-				break;
-			}
-			else if(iz>=start_z[ip] && iz<start_z[ip+1])
-			{
-				which_ip[iz] = ip;
-				break;
-			}
-		}
-	}
-	
-	
-	// only do in the first pool.
-	std::complex<double> *porter = new std::complex<double>[GlobalC::wfcpw->nrxx];
-	int nxy = GlobalC::wfcpw->nx * GlobalC::wfcpw->ny;
-	std::complex<double> *zpiece = new std::complex<double>[nxy];
-	
-	if(GlobalV::MY_POOL==0)
-	{
-		for(int ik = start_k_index; ik < (cal_num_kpts+start_k_index); ik++)
-		{
-			std::ofstream unkfile;
-			
-			if(GlobalV::MY_RANK == 0)
-			{
-				std::stringstream name;
-				if(GlobalV::NSPIN==1 || GlobalV::NSPIN==4)
-				{
-					name << GlobalV::global_out_dir << "UNK" << std::setw(5) << setfill('0') << ik+1 << ".1" ;
-				}
-				else if(GlobalV::NSPIN==2)
-				{
-					if(wannier_spin=="up") name << GlobalV::global_out_dir << "UNK" << std::setw(5) << setfill('0') << ik+1-start_k_index << ".1" ;
-					else if(wannier_spin=="down") name << GlobalV::global_out_dir << "UNK" << std::setw(5) << setfill('0') << ik+1-start_k_index << ".2" ;
-				}
-				
-				unkfile.open(name.str(),ios::out);
-				
-				unkfile << std::setw(12) << GlobalC::wfcpw->nx << std::setw(12) << GlobalC::wfcpw->ny << std::setw(12) << GlobalC::wfcpw->nz << std::setw(12) << ik+1 << std::setw(12) << num_bands << std::endl;
-			}
-			
-			for(int ib = 0; ib < GlobalV::NBANDS; ib++)
-			{
-				if(!tag_cal_band[ib]) continue;
-				
-				GlobalC::wfcpw->recip2real(&wfc_pw(ik, ib, 0), porter, ik);
-
-				// save the rho one z by one z.
-				for(int iz=0; iz<GlobalC::wfcpw->nz; iz++)
-				{
-					// tag must be different for different iz.
-					ModuleBase::GlobalFunc::ZEROS(zpiece, nxy);
-					int tag = iz;
-					MPI_Status ierror;
-
-					// case 1: the first part of rho in processor 0.
-					if(which_ip[iz] == 0 && GlobalV::RANK_IN_POOL ==0)
-					{
-						for(int ir=0; ir<nxy; ir++)
-						{
-							zpiece[ir] = porter[ir*GlobalC::wfcpw->nplane+iz-GlobalC::wfcpw->startz_current];
-						}
-					}
-					// case 2: > first part rho: send the rho to 
-					// processor 0.
-					else if(which_ip[iz] == GlobalV::RANK_IN_POOL )
-					{
-						for(int ir=0; ir<nxy; ir++)
-						{
-							zpiece[ir] = porter[ir*GlobalC::wfcpw->nplane+iz-GlobalC::wfcpw->startz_current];
-						}
-						MPI_Send(zpiece, nxy, MPI_DOUBLE_COMPLEX, 0, tag, POOL_WORLD);
-					}
-
-					// case 2: > first part rho: processor 0 receive the rho
-					// from other processors
-					else if(GlobalV::RANK_IN_POOL==0)
-					{
-						MPI_Recv(zpiece, nxy, MPI_DOUBLE_COMPLEX, which_ip[iz], tag, POOL_WORLD, &ierror);
-					}
-
-					// write data	
-					if(GlobalV::MY_RANK==0)
-					{
-						for(int iy=0; iy<GlobalC::wfcpw->ny; iy++)
-						{
-							for(int ix=0; ix<GlobalC::wfcpw->nx; ix++)
-							{
-								unkfile << std::setw(20) << std::setprecision(9) << std::setiosflags(ios::scientific) << zpiece[ix*GlobalC::wfcpw->ny+iy].real()
-										<< std::setw(20) << std::setprecision(9) << std::setiosflags(ios::scientific) << zpiece[ix*GlobalC::wfcpw->ny+iy].imag() 
-										<< std::endl;
-							}
-						}
-					}
-				}// end iz
-				MPI_Barrier(POOL_WORLD);
-			}
-			
-			if(GlobalV::MY_RANK == 0)
-			{
-				unkfile.close();
-			}
-		
-		}
-	}
-	MPI_Barrier(MPI_COMM_WORLD);
-	
-	delete[] num_z;
-	delete[] start_z;
-	delete[] which_ip;
-	delete[] porter;
-	delete[] zpiece;
-
-#endif	
-	
-}
-
-
-
-
-
+    // num_z: how many planes on processor 'ip'
+    int *num_z = new int[GlobalV::NPROC_IN_POOL];
+    ModuleBase::GlobalFunc::ZEROS(num_z, GlobalV::NPROC_IN_POOL);
+    for (int iz = 0; iz < GlobalC::bigpw->nbz; iz++)
+    {
+        int ip = iz % GlobalV::NPROC_IN_POOL;
+        num_z[ip] += GlobalC::bigpw->bz;
+    }
+
+    // start_z: start position of z in
+    // processor ip.
+    int *start_z = new int[GlobalV::NPROC_IN_POOL];
+    ModuleBase::GlobalFunc::ZEROS(start_z, GlobalV::NPROC_IN_POOL);
+    for (int ip = 1; ip < GlobalV::NPROC_IN_POOL; ip++)
+    {
+        start_z[ip] = start_z[ip - 1] + num_z[ip - 1];
+    }
+
+    // which_ip: found iz belongs to which ip.
+    int *which_ip = new int[GlobalC::wfcpw->nz];
+    ModuleBase::GlobalFunc::ZEROS(which_ip, GlobalC::wfcpw->nz);
+    for (int iz = 0; iz < GlobalC::wfcpw->nz; iz++)
+    {
+        for (int ip = 0; ip < GlobalV::NPROC_IN_POOL; ip++)
+        {
+            if (iz >= start_z[GlobalV::NPROC_IN_POOL - 1])
+            {
+                which_ip[iz] = GlobalV::NPROC_IN_POOL - 1;
+                break;
+            }
+            else if (iz >= start_z[ip] && iz < start_z[ip + 1])
+            {
+                which_ip[iz] = ip;
+                break;
+            }
+        }
+    }
+
+    // only do in the first pool.
+    std::complex<double> *porter = new std::complex<double>[GlobalC::wfcpw->nrxx];
+    int nxy = GlobalC::wfcpw->nx * GlobalC::wfcpw->ny;
+    std::complex<double> *zpiece = new std::complex<double>[nxy];
+
+    if (GlobalV::MY_POOL == 0)
+    {
+        for (int ik = start_k_index; ik < (cal_num_kpts + start_k_index); ik++)
+        {
+            std::ofstream unkfile;
+
+            if (GlobalV::MY_RANK == 0)
+            {
+                std::stringstream name;
+                if (GlobalV::NSPIN == 1 || GlobalV::NSPIN == 4)
+                {
+                    name << GlobalV::global_out_dir << "UNK" << std::setw(5) << setfill('0') << ik + 1 << ".1";
+                }
+                else if (GlobalV::NSPIN == 2)
+                {
+                    if (wannier_spin == "up")
+                        name << GlobalV::global_out_dir << "UNK" << std::setw(5) << setfill('0')
+                             << ik + 1 - start_k_index << ".1";
+                    else if (wannier_spin == "down")
+                        name << GlobalV::global_out_dir << "UNK" << std::setw(5) << setfill('0')
+                             << ik + 1 - start_k_index << ".2";
+                }
+
+                unkfile.open(name.str(), ios::out);
+
+                unkfile << std::setw(12) << GlobalC::wfcpw->nx << std::setw(12) << GlobalC::wfcpw->ny << std::setw(12)
+                        << GlobalC::wfcpw->nz << std::setw(12) << ik + 1 << std::setw(12) << num_bands << std::endl;
+            }
+
+            for (int ib = 0; ib < GlobalV::NBANDS; ib++)
+            {
+                if (!tag_cal_band[ib])
+                    continue;
+
+                GlobalC::wfcpw->recip2real(&wfc_pw(ik, ib, 0), porter, ik);
+
+                // save the rho one z by one z.
+                for (int iz = 0; iz < GlobalC::wfcpw->nz; iz++)
+                {
+                    // tag must be different for different iz.
+                    ModuleBase::GlobalFunc::ZEROS(zpiece, nxy);
+                    int tag = iz;
+                    MPI_Status ierror;
+
+                    // case 1: the first part of rho in processor 0.
+                    if (which_ip[iz] == 0 && GlobalV::RANK_IN_POOL == 0)
+                    {
+                        for (int ir = 0; ir < nxy; ir++)
+                        {
+                            zpiece[ir] = porter[ir * GlobalC::wfcpw->nplane + iz - GlobalC::wfcpw->startz_current];
+                        }
+                    }
+                    // case 2: > first part rho: send the rho to
+                    // processor 0.
+                    else if (which_ip[iz] == GlobalV::RANK_IN_POOL)
+                    {
+                        for (int ir = 0; ir < nxy; ir++)
+                        {
+                            zpiece[ir] = porter[ir * GlobalC::wfcpw->nplane + iz - GlobalC::wfcpw->startz_current];
+                        }
+                        MPI_Send(zpiece, nxy, MPI_DOUBLE_COMPLEX, 0, tag, POOL_WORLD);
+                    }
+
+                    // case 2: > first part rho: processor 0 receive the rho
+                    // from other processors
+                    else if (GlobalV::RANK_IN_POOL == 0)
+                    {
+                        MPI_Recv(zpiece, nxy, MPI_DOUBLE_COMPLEX, which_ip[iz], tag, POOL_WORLD, &ierror);
+                    }
+
+                    // write data
+                    if (GlobalV::MY_RANK == 0)
+                    {
+                        for (int iy = 0; iy < GlobalC::wfcpw->ny; iy++)
+                        {
+                            for (int ix = 0; ix < GlobalC::wfcpw->nx; ix++)
+                            {
+                                unkfile << std::setw(20) << std::setprecision(9) << std::setiosflags(ios::scientific)
+                                        << zpiece[ix * GlobalC::wfcpw->ny + iy].real() << std::setw(20)
+                                        << std::setprecision(9) << std::setiosflags(ios::scientific)
+                                        << zpiece[ix * GlobalC::wfcpw->ny + iy].imag() << std::endl;
+                            }
+                        }
+                    }
+                } // end iz
+                MPI_Barrier(POOL_WORLD);
+            }
+
+            if (GlobalV::MY_RANK == 0)
+            {
+                unkfile.close();
+            }
+        }
+    }
+    MPI_Barrier(MPI_COMM_WORLD);
+
+    delete[] num_z;
+    delete[] start_z;
+    delete[] which_ip;
+    delete[] porter;
+    delete[] zpiece;
 
+#endif
+}
 
-void toWannier90::cal_Amn(const psi::Psi<std::complex<double>>& wfc_pw)
+void toWannier90::cal_Amn(const psi::Psi<std::complex<double>> &wfc_pw)
 {
-	// ��һ��������ʵ��г����lm��ĳ��k���µ�ƽ�沨�����µı��񣨾���	
-	// �ڶ���������̽����ľ��򲿷���ĳ��k����ƽ�沨ͶӰ
-	// ����������ȡ��̽�����ĳ��k����ƽ�沨�����µ�ͶӰ
-	const int pwNumberMax = GlobalC::wf.npwx;
-	
-	std::ofstream Amn_file;
-	
-	if(GlobalV::MY_RANK == 0)
-	{
-		time_t  time_now = time(NULL);
-		std::string fileaddress = GlobalV::global_out_dir + wannier_file_name + ".amn";
-		Amn_file.open( fileaddress.c_str() , ios::out);
-		Amn_file << " Created on " << ctime(&time_now);
-		Amn_file << std::setw(12) << num_bands << std::setw(12) << cal_num_kpts << std::setw(12) << num_wannier << std::endl;
-	}
-	
-	ModuleBase::ComplexMatrix *trial_orbitals = new ModuleBase::ComplexMatrix[cal_num_kpts];
-	for(int ik = 0; ik < cal_num_kpts; ik++)
-	{
-		trial_orbitals[ik].create(num_wannier,pwNumberMax);
-		produce_trial_in_pw(ik,trial_orbitals[ik]);
-	}	
-	
-	// test by jingan
-	//GlobalV::ofs_running << __FILE__ << __LINE__ << "start_k_index = " << start_k_index << "  cal_num_kpts = " << cal_num_kpts << std::endl;
-	// test by jingan
-
-	for(int ik = start_k_index; ik < (cal_num_kpts+start_k_index); ik++)
-	{
-		for(int iw = 0; iw < num_wannier; iw++)
-		{
-			int index_band = 0;
-			for(int ib = 0; ib < GlobalV::NBANDS; ib++)
-			{
-				if(!tag_cal_band[ib]) continue;
-				index_band++;
-				std::complex<double> amn(0.0,0.0);
-				std::complex<double> amn_tem(0.0,0.0);
-				for(int ig = 0; ig < pwNumberMax; ig++)
-				{
-					int cal_ik = ik - start_k_index;
-					amn_tem = amn_tem + conj( wfc_pw(ik,ib,ig) ) * trial_orbitals[cal_ik](iw,ig);
-				}
+    const int pwNumberMax = GlobalC::wf.npwx;
+
+    std::ofstream Amn_file;
+
+    if (GlobalV::MY_RANK == 0)
+    {
+        time_t time_now = time(NULL);
+        std::string fileaddress = GlobalV::global_out_dir + wannier_file_name + ".amn";
+        Amn_file.open(fileaddress.c_str(), ios::out);
+        Amn_file << " Created on " << ctime(&time_now);
+        Amn_file << std::setw(12) << num_bands << std::setw(12) << cal_num_kpts << std::setw(12) << num_wannier
+                 << std::endl;
+    }
+
+    ModuleBase::ComplexMatrix *trial_orbitals = new ModuleBase::ComplexMatrix[cal_num_kpts];
+    for (int ik = 0; ik < cal_num_kpts; ik++)
+    {
+        trial_orbitals[ik].create(num_wannier, pwNumberMax);
+        produce_trial_in_pw(ik, trial_orbitals[ik]);
+    }
+
+    // test by jingan
+    // GlobalV::ofs_running << __FILE__ << __LINE__ << "start_k_index = " << start_k_index << "  cal_num_kpts = " <<
+    // cal_num_kpts << std::endl;
+    // test by jingan
+
+    for (int ik = start_k_index; ik < (cal_num_kpts + start_k_index); ik++)
+    {
+        for (int iw = 0; iw < num_wannier; iw++)
+        {
+            int index_band = 0;
+            for (int ib = 0; ib < GlobalV::NBANDS; ib++)
+            {
+                if (!tag_cal_band[ib])
+                    continue;
+                index_band++;
+                std::complex<double> amn(0.0, 0.0);
+                std::complex<double> amn_tem(0.0, 0.0);
+                for (int ig = 0; ig < pwNumberMax; ig++)
+                {
+                    int cal_ik = ik - start_k_index;
+                    amn_tem = amn_tem + conj(wfc_pw(ik, ib, ig)) * trial_orbitals[cal_ik](iw, ig);
+                }
 #ifdef __MPI
-				MPI_Allreduce(&amn_tem , &amn , 1, MPI_DOUBLE_COMPLEX , MPI_SUM , POOL_WORLD);
+                MPI_Allreduce(&amn_tem, &amn, 1, MPI_DOUBLE_COMPLEX, MPI_SUM, POOL_WORLD);
 #else
-				amn=amn_tem;
+                amn = amn_tem;
 #endif
-				if(GlobalV::MY_RANK == 0)
-				{
-					Amn_file << std::setw(5) << index_band << std::setw(5) << iw+1 << std::setw(5) << ik+1-start_k_index 
-							 << std::setw(18) << showpoint << fixed << std::setprecision(12) << amn.real() 
-							 << std::setw(18) << showpoint << fixed << std::setprecision(12) << amn.imag()
-							 //jingan test
-							 //<< "   " << std::setw(18) << std::setprecision(13) << abs(amn)
-							 << std::endl;
-				}
-			}
-		}
-	}
-	
-
-	
-	if(GlobalV::MY_RANK == 0) Amn_file.close();
-	
-	delete[] trial_orbitals;
-	
+                if (GlobalV::MY_RANK == 0)
+                {
+                    Amn_file << std::setw(5) << index_band << std::setw(5) << iw + 1 << std::setw(5)
+                             << ik + 1 - start_k_index << std::setw(18) << showpoint << fixed << std::setprecision(12)
+                             << amn.real() << std::setw(18) << showpoint << fixed << std::setprecision(12)
+                             << amn.imag()
+                             // jingan test
+                             //<< "   " << std::setw(18) << std::setprecision(13) << abs(amn)
+                             << std::endl;
+                }
+            }
+        }
+    }
+
+    if (GlobalV::MY_RANK == 0)
+        Amn_file.close();
+
+    delete[] trial_orbitals;
 }
 
-
-
-void toWannier90::cal_Mmn(const psi::Psi<std::complex<double>>& wfc_pw)
-{	
-	// test by jingan
-	//GlobalV::ofs_running << __FILE__ << __LINE__ << " cal_num_kpts = " << cal_num_kpts << std::endl;
-	// test by jingan
-	
-	std::ofstream mmn_file;
-	
-	if(GlobalV::MY_RANK == 0)
-	{
-		std::string fileaddress = GlobalV::global_out_dir + wannier_file_name + ".mmn";
-		mmn_file.open( fileaddress.c_str() , ios::out);	
-		
-		time_t  time_now = time(NULL);
-		mmn_file << " Created on " << ctime(&time_now);
-		mmn_file << std::setw(12) << num_bands << std::setw(12) << cal_num_kpts << std::setw(12) << nntot << std::endl;
-	}
-	
-	/*
-	ModuleBase::ComplexMatrix Mmn(GlobalV::NBANDS,GlobalV::NBANDS);
-	if(gamma_only_wannier)
-	{
-		for(int ib = 0; ib < nntot; ib++)
-		{
-			ModuleBase::Vector3<double> phase_G = nncell[0][ib];
-			for(int m = 0; m < GlobalV::NBANDS; m++)
-			{
-				if(!tag_cal_band[m]) continue;
-				for(int n = 0; n <= m; n++)
-				{
-					if(!tag_cal_band[n]) continue;
-					std::complex<double> mmn_tem = gamma_only_cal(m,n,wfc_pw,phase_G);
-					Mmn(m,n) = mmn_tem;
-					if(m!=n) Mmn(n,m) = Mmn(m,n);				
-				}
-			}
-		}
-	}
-	*/
-	
-	for(int ik = 0; ik < cal_num_kpts; ik++)
-	{
-		for(int ib = 0; ib < nntot; ib++)
-		{
-			int ikb = nnlist[ik][ib];             // ik+b : ik�Ľ���k��	
-			
-			ModuleBase::Vector3<double> phase_G = nncell[ik][ib];
-			
-			if(GlobalV::MY_RANK == 0)
-			{
-				mmn_file << std::setw(5) << ik+1 << std::setw(5) << ikb+1 << std::setw(5) 
-						 << int(phase_G.x) << std::setw(5) << int(phase_G.y) << std::setw(5) << int(phase_G.z) 
-						 << std::endl;
-			}
-		
-			for(int m = 0; m < GlobalV::NBANDS; m++)
-			{
-				if(!tag_cal_band[m]) continue;
-				for(int n = 0; n < GlobalV::NBANDS; n++)
-				{
-					if(!tag_cal_band[n]) continue;
-					std::complex<double> mmn(0.0,0.0);
-				
-					if(!gamma_only_wannier)
-					{
-						int cal_ik = ik + start_k_index;
-						int cal_ikb = ikb + start_k_index;												
-						// test by jingan
-						//GlobalV::ofs_running << __FILE__ << __LINE__ << "cal_ik = " << cal_ik << "cal_ikb = " << cal_ikb << std::endl;
-						// test by jingan
-						//std::complex<double> *unk_L_r = new std::complex<double>[GlobalC::wfcpw->nrxx];
-						//ToRealSpace(cal_ik,n,wfc_pw,unk_L_r,phase_G);				
-						//mmn = unkdotb(unk_L_r,cal_ikb,m,wfc_pw);
-						mmn = unkdotkb(cal_ik,cal_ikb,n,m,phase_G,wfc_pw);
-						//delete[] unk_L_r;
-					}
-					else
-					{
-						//GlobalV::ofs_running << "gamma only test" << std::endl;
-						//mmn = Mmn(n,m);
-					}
-					
-					if(GlobalV::MY_RANK == 0)
-					{
-						mmn_file << std::setw(18) << std::setprecision(12) << showpoint << fixed << mmn.real() 
-								 << std::setw(18) << std::setprecision(12) << showpoint << fixed << mmn.imag()
-								 // jingan test
-								 //<< "    " << std::setw(12) << std::setprecision(9) << abs(mmn)
-								 << std::endl;				
-					}
-				}
-			}
-		}
-	
-	}
-	
-	if(GlobalV::MY_RANK == 0) mmn_file.close();
-	
+void toWannier90::cal_Mmn(const psi::Psi<std::complex<double>> &wfc_pw)
+{
+    // test by jingan
+    // GlobalV::ofs_running << __FILE__ << __LINE__ << " cal_num_kpts = " << cal_num_kpts << std::endl;
+    // test by jingan
+
+    std::ofstream mmn_file;
+
+    if (GlobalV::MY_RANK == 0)
+    {
+        std::string fileaddress = GlobalV::global_out_dir + wannier_file_name + ".mmn";
+        mmn_file.open(fileaddress.c_str(), ios::out);
+
+        time_t time_now = time(NULL);
+        mmn_file << " Created on " << ctime(&time_now);
+        mmn_file << std::setw(12) << num_bands << std::setw(12) << cal_num_kpts << std::setw(12) << nntot << std::endl;
+    }
+
+    /*
+    ModuleBase::ComplexMatrix Mmn(GlobalV::NBANDS,GlobalV::NBANDS);
+    if(gamma_only_wannier)
+    {
+        for(int ib = 0; ib < nntot; ib++)
+        {
+            ModuleBase::Vector3<double> phase_G = nncell[0][ib];
+            for(int m = 0; m < GlobalV::NBANDS; m++)
+            {
+                if(!tag_cal_band[m]) continue;
+                for(int n = 0; n <= m; n++)
+                {
+                    if(!tag_cal_band[n]) continue;
+                    std::complex<double> mmn_tem = gamma_only_cal(m,n,wfc_pw,phase_G);
+                    Mmn(m,n) = mmn_tem;
+                    if(m!=n) Mmn(n,m) = Mmn(m,n);
+                }
+            }
+        }
+    }
+    */
+
+    for (int ik = 0; ik < cal_num_kpts; ik++)
+    {
+        for (int ib = 0; ib < nntot; ib++)
+        {
+            int ikb = nnlist[ik][ib];
+
+            ModuleBase::Vector3<double> phase_G = nncell[ik][ib];
+
+            if (GlobalV::MY_RANK == 0)
+            {
+                mmn_file << std::setw(5) << ik + 1 << std::setw(5) << ikb + 1 << std::setw(5) << int(phase_G.x)
+                         << std::setw(5) << int(phase_G.y) << std::setw(5) << int(phase_G.z) << std::endl;
+            }
+
+            for (int m = 0; m < GlobalV::NBANDS; m++)
+            {
+                if (!tag_cal_band[m])
+                    continue;
+                for (int n = 0; n < GlobalV::NBANDS; n++)
+                {
+                    if (!tag_cal_band[n])
+                        continue;
+                    std::complex<double> mmn(0.0, 0.0);
+
+                    if (!gamma_only_wannier)
+                    {
+                        int cal_ik = ik + start_k_index;
+                        int cal_ikb = ikb + start_k_index;
+                        // test by jingan
+                        // GlobalV::ofs_running << __FILE__ << __LINE__ << "cal_ik = " << cal_ik << "cal_ikb = " <<
+                        // cal_ikb << std::endl;
+                        // test by jingan
+                        // std::complex<double> *unk_L_r = new std::complex<double>[GlobalC::wfcpw->nrxx];
+                        // ToRealSpace(cal_ik,n,wfc_pw,unk_L_r,phase_G);
+                        // mmn = unkdotb(unk_L_r,cal_ikb,m,wfc_pw);
+                        mmn = unkdotkb(cal_ik, cal_ikb, n, m, phase_G, wfc_pw);
+                        // delete[] unk_L_r;
+                    }
+                    else
+                    {
+                        // GlobalV::ofs_running << "gamma only test" << std::endl;
+                        // mmn = Mmn(n,m);
+                    }
+
+                    if (GlobalV::MY_RANK == 0)
+                    {
+                        mmn_file << std::setw(18) << std::setprecision(12) << showpoint << fixed << mmn.real()
+                                 << std::setw(18) << std::setprecision(12) << showpoint << fixed
+                                 << mmn.imag()
+                                 // jingan test
+                                 //<< "    " << std::setw(12) << std::setprecision(9) << abs(mmn)
+                                 << std::endl;
+                    }
+                }
+            }
+        }
+    }
+
+    if (GlobalV::MY_RANK == 0)
+        mmn_file.close();
 }
 
-
 void toWannier90::produce_trial_in_pw(const int &ik, ModuleBase::ComplexMatrix &trial_orbitals_k)
 {
-	// �������Ƿ���ȷ
-	for(int i =0; i < num_wannier; i++)
-	{
-		if(L[i] < -5 || L[i] > 3) std::cout << "toWannier90::produce_trial_in_pw() your L angular momentum is wrong , please check !!! " << std::endl;
-	
-		if(L[i] >= 0) 
-		{
-			if(m[i] < 0 || m[i] > 2*L[i]) std::cout << "toWannier90::produce_trial_in_pw() your m momentum is wrong , please check !!! " << std::endl;
-		}
-		else
-		{
-			if(m[i] < 0 || m[i] > -L[i]) std::cout << "toWannier90::produce_trial_in_pw() your m momentum is wrong , please check !!! " << std::endl;
-		
-		}
-	}
-	
-	const int npw = GlobalC::kv.ngk[ik];
-	const int npwx = GlobalC::wf.npwx;
-	const int total_lm = 16;
-	ModuleBase::matrix ylm(total_lm,npw);               //�������͵���г����
-	//matrix wannier_ylm(num_wannier,npw);    //Ҫ��̽�����ʹ�õ���г����
-	double bs2, bs3, bs6, bs12;
-	bs2 = 1.0/sqrt(2.0);
-	bs3 = 1.0/sqrt(3.0);
-	bs6 = 1.0/sqrt(6.0);
-	bs12 = 1.0/sqrt(12.0);
-	
-	ModuleBase::Vector3<double> *gk = new ModuleBase::Vector3<double>[npw];
-	for(int ig = 0; ig < npw; ig++)
-	{
-		gk[ig] = GlobalC::wf.get_1qvec_cartesian(ik, ig);  // k+Gʸ��
-	}
-	
-	ModuleBase::YlmReal::Ylm_Real(total_lm, npw, gk, ylm);
-	
-	// test by jingan
-	//GlobalV::ofs_running << "the mathzone::ylm_real is successful!" << std::endl;
-	//GlobalV::ofs_running << "produce_trial_in_pw: num_wannier is " << num_wannier << std::endl;
-	// test by jingan
-	
-	
-	// 1.���ɾ�������ĳ��k��ƽ�沨�����ͶӰ
-	const int mesh_r = 333; 		//��������������Ҫ�ĸ����
-	const double dx = 0.025; 		//�̶�������������ɷǹ̶������dr����߾���,���ֵ������
-	const double x_min = -6.0;  	// ��������dr��r����ʼ��
-	ModuleBase::matrix r(num_wannier,mesh_r);   //��ͬalfa�ľ�������r
-	ModuleBase::matrix dr(num_wannier,mesh_r);  //��ͬalfa�ľ�������ÿ��r��ļ��
-	ModuleBase::matrix psi(num_wannier,mesh_r); //������psi in ʵ�ռ�
-	ModuleBase::matrix psir(num_wannier,mesh_r);// psi * r in ʵ�ռ�
-	ModuleBase::matrix psik(num_wannier,npw);   //��������ĳ��k���µ��ռ��ͶӰ
-	
-	// ����r,dr
-	for(int i = 0; i < num_wannier; i++)
-	{
-		double x = 0;
-		for(int ir = 0; ir < mesh_r; ir++)
-		{
-			x = x_min + ir * dx;
-			r(i,ir) = exp(x) / alfa[i];
-			dr(i,ir) = dx * r(i,ir);
-		}
-		
-	}
-	
-	// ����psi
-	for(int i = 0; i < num_wannier; i++)
-	{
-		double alfa32 = pow(alfa[i],3.0/2.0);
-		double alfa_new = alfa[i];
-		int wannier_index = i;
-		
-		if(rvalue[i] == 1)
-		{
-			for(int ir = 0; ir < mesh_r; ir++)
-			{
-				psi(wannier_index,ir) = 2.0 * alfa32 * exp( -alfa_new * r(wannier_index,ir) );
-			}
-		}
-	
-		if(rvalue[i] == 2)
-		{
-			for(int ir = 0; ir < mesh_r; ir++)
-			{
-				psi(wannier_index,ir) = 1.0/sqrt(8.0) * alfa32
-										* (2.0 - alfa_new * r(wannier_index,ir))
-										* exp( -alfa_new * r(wannier_index,ir) * 0.5 );
-			}
-		}
-
-		if(rvalue[i] == 3)
-		{
-			for(int ir = 0; ir < mesh_r; ir++)
-			{
-				psi(wannier_index,ir) = sqrt(4.0/27.0) * alfa32
-										* ( 1.0 - 2.0/3.0 * alfa_new * r(wannier_index,ir) + 2.0/27.0 * pow(alfa_new,2.0) * r(wannier_index,ir) * r(wannier_index,ir) )
-										* exp( -alfa_new * r(wannier_index,ir) * 1.0/3.0 );
-			}
-		}
-		
-	}
-
-	// ����psir
-	for(int i = 0; i < num_wannier; i++)
-	{
-		for(int ir = 0; ir < mesh_r; ir++)
-		{
-			psir(i,ir) = psi(i,ir) * r(i,ir);
-		}
-	}
-	
-	
-	// �����̽���
-	for(int wannier_index = 0; wannier_index < num_wannier; wannier_index++)
-	{
-		if(L[wannier_index] >= 0)
-		{
-			get_trial_orbitals_lm_k(wannier_index, L[wannier_index], m[wannier_index], ylm, dr,r,psir,mesh_r,gk,npw,trial_orbitals_k);
-		}
-		else
-		{
-			if(L[wannier_index] == -1 && m[wannier_index] == 0)
-			{	
-				get_trial_orbitals_lm_k(wannier_index, 0, 0, ylm, dr,r,psir,mesh_r,gk,npw,trial_orbitals_k);
-				std::complex<double> *tem_array = new std::complex<double>[npwx];
-				for(int ig = 0; ig < npwx; ig++)
-				{
-					tem_array[ig] = trial_orbitals_k(wannier_index,ig);
-				}
-				get_trial_orbitals_lm_k(wannier_index, 1, 1, ylm, dr,r,psir,mesh_r,gk,npw,trial_orbitals_k);
-				for(int ig = 0; ig < npwx; ig++)
-				{
-					trial_orbitals_k(wannier_index,ig) = bs2 * tem_array[ig] + bs2 * trial_orbitals_k(wannier_index,ig);
-				}
-				delete[] tem_array;
-				
-			}
-			else if(L[wannier_index] == -1 && m[wannier_index] == 1)
-			{
-				get_trial_orbitals_lm_k(wannier_index, 0, 0, ylm, dr,r,psir,mesh_r,gk,npw,trial_orbitals_k);
-				std::complex<double> *tem_array = new std::complex<double>[npwx];
-				for(int ig = 0; ig < npwx; ig++)
-				{
-					tem_array[ig] = trial_orbitals_k(wannier_index,ig);
-				}
-				get_trial_orbitals_lm_k(wannier_index, 1, 1, ylm, dr,r,psir,mesh_r,gk,npw,trial_orbitals_k);
-				for(int ig = 0; ig < npwx; ig++)
-				{
-					trial_orbitals_k(wannier_index,ig) = bs2 * tem_array[ig] - bs2 * trial_orbitals_k(wannier_index,ig);
-				}	
-				delete[] tem_array;
-			}
-			else if(L[wannier_index] == -2 && m[wannier_index] == 0)
-			{
-				get_trial_orbitals_lm_k(wannier_index, 0, 0, ylm, dr,r,psir,mesh_r,gk,npw,trial_orbitals_k);
-				std::complex<double> * tem_array_1 = new std::complex<double>[npwx];
-				for(int ig = 0; ig < npwx; ig++)
-				{
-					tem_array_1[ig] = trial_orbitals_k(wannier_index,ig);
-				}
-				get_trial_orbitals_lm_k(wannier_index, 1, 1, ylm, dr,r,psir,mesh_r,gk,npw,trial_orbitals_k);
-				std::complex<double> * tem_array_2 = new std::complex<double>[npwx];
-				for(int ig = 0; ig < npwx; ig++)
-				{
-					tem_array_2[ig] = trial_orbitals_k(wannier_index,ig);
-				}
-				get_trial_orbitals_lm_k(wannier_index, 1, 2, ylm, dr,r,psir,mesh_r,gk,npw,trial_orbitals_k);
-				for(int ig = 0; ig < npwx; ig++)
-				{
-					trial_orbitals_k(wannier_index,ig) = bs3 * tem_array_1[ig] - bs6 * tem_array_2[ig] + bs2 * trial_orbitals_k(wannier_index,ig);
-				}	
-				delete[] tem_array_1;
-				delete[] tem_array_2;
-			}
-			else if(L[wannier_index] == -2 && m[wannier_index] == 1)
-			{
-				get_trial_orbitals_lm_k(wannier_index, 0, 0, ylm, dr,r,psir,mesh_r,gk,npw,trial_orbitals_k);
-				std::complex<double> * tem_array_1 = new std::complex<double>[npwx];
-				for(int ig = 0; ig < npwx; ig++)
-				{
-					tem_array_1[ig] = trial_orbitals_k(wannier_index,ig);
-				}
-				get_trial_orbitals_lm_k(wannier_index, 1, 1, ylm, dr,r,psir,mesh_r,gk,npw,trial_orbitals_k);
-				std::complex<double> * tem_array_2 = new std::complex<double>[npwx];
-				for(int ig = 0; ig < npwx; ig++)
-				{
-					tem_array_2[ig] = trial_orbitals_k(wannier_index,ig);
-				}
-				get_trial_orbitals_lm_k(wannier_index, 1, 2, ylm, dr,r,psir,mesh_r,gk,npw,trial_orbitals_k);
-				for(int ig = 0; ig < npwx; ig++)
-				{
-					trial_orbitals_k(wannier_index,ig) = bs3 * tem_array_1[ig] - bs6 * tem_array_2[ig] - bs2 * trial_orbitals_k(wannier_index,ig);
-				}
-				delete[] tem_array_1;
-				delete[] tem_array_2;
-			}			
-			else if(L[wannier_index] == -2 && m[wannier_index] == 2)
-			{
-				get_trial_orbitals_lm_k(wannier_index, 0, 0, ylm, dr,r,psir,mesh_r,gk,npw,trial_orbitals_k);
-				std::complex<double> * tem_array = new std::complex<double>[npwx];
-				for(int ig = 0; ig < npwx; ig++)
-				{
-					tem_array[ig] = trial_orbitals_k(wannier_index,ig);
-				}
-				get_trial_orbitals_lm_k(wannier_index, 1, 1, ylm, dr,r,psir,mesh_r,gk,npw,trial_orbitals_k);
-				for(int ig = 0; ig < npwx; ig++)
-				{
-					trial_orbitals_k(wannier_index,ig) = bs3 * tem_array[ig] + 2 * bs6 * trial_orbitals_k(wannier_index,ig);
-				}
-				delete[] tem_array;
-			}			
-			else if(L[wannier_index] == -3 && m[wannier_index] == 0)
-			{
-				get_trial_orbitals_lm_k(wannier_index, 0, 0, ylm, dr,r,psir,mesh_r,gk,npw,trial_orbitals_k);
-				std::complex<double> * tem_array_1 = new std::complex<double>[npwx];
-				for(int ig = 0; ig < npwx; ig++)
-				{
-					tem_array_1[ig] = trial_orbitals_k(wannier_index,ig);
-				}
-				get_trial_orbitals_lm_k(wannier_index, 1, 1, ylm, dr,r,psir,mesh_r,gk,npw,trial_orbitals_k);
-				std::complex<double> * tem_array_2 = new std::complex<double>[npwx];
-				for(int ig = 0; ig < npwx; ig++)
-				{
-					tem_array_2[ig] = trial_orbitals_k(wannier_index,ig);
-				}
-				get_trial_orbitals_lm_k(wannier_index, 1, 2, ylm, dr,r,psir,mesh_r,gk,npw,trial_orbitals_k);
-				std::complex<double> * tem_array_3 = new std::complex<double>[npwx];
-				for(int ig = 0; ig < npwx; ig++)
-				{
-					tem_array_3[ig] = trial_orbitals_k(wannier_index,ig);
-				}
-				get_trial_orbitals_lm_k(wannier_index, 1, 0, ylm, dr,r,psir,mesh_r,gk,npw,trial_orbitals_k);
-				for(int ig = 0; ig < npwx; ig++)
-				{
-					trial_orbitals_k(wannier_index,ig) = 0.5*(tem_array_1[ig] + tem_array_2[ig] + tem_array_3[ig] + trial_orbitals_k(wannier_index,ig));
-				}
-				delete[] tem_array_1;
-				delete[] tem_array_2;
-				delete[] tem_array_3;
-				
-			}			
-			else if(L[wannier_index] == -3 && m[wannier_index] == 1)
-			{
-				get_trial_orbitals_lm_k(wannier_index, 0, 0, ylm, dr,r,psir,mesh_r,gk,npw,trial_orbitals_k);
-				std::complex<double> * tem_array_1 = new std::complex<double>[npwx];
-				for(int ig = 0; ig < npwx; ig++)
-				{
-					tem_array_1[ig] = trial_orbitals_k(wannier_index,ig);
-				}
-				get_trial_orbitals_lm_k(wannier_index, 1, 1, ylm, dr,r,psir,mesh_r,gk,npw,trial_orbitals_k);
-				std::complex<double> * tem_array_2 = new std::complex<double>[npwx];
-				for(int ig = 0; ig < npwx; ig++)
-				{
-					tem_array_2[ig] = trial_orbitals_k(wannier_index,ig);
-				}
-				get_trial_orbitals_lm_k(wannier_index, 1, 2, ylm, dr,r,psir,mesh_r,gk,npw,trial_orbitals_k);
-				std::complex<double> * tem_array_3 = new std::complex<double>[npwx];
-				for(int ig = 0; ig < npwx; ig++)
-				{
-					tem_array_3[ig] = trial_orbitals_k(wannier_index,ig);
-				}
-				get_trial_orbitals_lm_k(wannier_index, 1, 0, ylm, dr,r,psir,mesh_r,gk,npw,trial_orbitals_k);
-				for(int ig = 0; ig < npwx; ig++)
-				{
-					trial_orbitals_k(wannier_index,ig) = 0.5*(tem_array_1[ig] + tem_array_2[ig] - tem_array_3[ig] - trial_orbitals_k(wannier_index,ig));
-				}
-				delete[] tem_array_1;
-				delete[] tem_array_2;
-				delete[] tem_array_3;
-			}			
-			else if(L[wannier_index] == -3 && m[wannier_index] == 2)
-			{
-				get_trial_orbitals_lm_k(wannier_index, 0, 0, ylm, dr,r,psir,mesh_r,gk,npw,trial_orbitals_k);
-				std::complex<double> * tem_array_1 = new std::complex<double>[npwx];
-				for(int ig = 0; ig < npwx; ig++)
-				{
-					tem_array_1[ig] = trial_orbitals_k(wannier_index,ig);
-				}
-				get_trial_orbitals_lm_k(wannier_index, 1, 1, ylm, dr,r,psir,mesh_r,gk,npw,trial_orbitals_k);
-				std::complex<double> * tem_array_2 = new std::complex<double>[npwx];
-				for(int ig = 0; ig < npwx; ig++)
-				{
-					tem_array_2[ig] = trial_orbitals_k(wannier_index,ig);
-				}
-				get_trial_orbitals_lm_k(wannier_index, 1, 2, ylm, dr,r,psir,mesh_r,gk,npw,trial_orbitals_k);
-				std::complex<double> * tem_array_3 = new std::complex<double>[npwx];
-				for(int ig = 0; ig < npwx; ig++)
-				{
-					tem_array_3[ig] = trial_orbitals_k(wannier_index,ig);
-				}
-				get_trial_orbitals_lm_k(wannier_index, 1, 0, ylm, dr,r,psir,mesh_r,gk,npw,trial_orbitals_k);
-				for(int ig = 0; ig < npwx; ig++)
-				{
-					trial_orbitals_k(wannier_index,ig) = 0.5*(tem_array_1[ig] - tem_array_2[ig] + tem_array_3[ig] - trial_orbitals_k(wannier_index,ig));
-				}
-				delete[] tem_array_1;
-				delete[] tem_array_2;
-				delete[] tem_array_3;
-			}			
-			else if(L[wannier_index] == -3 && m[wannier_index] == 3)
-			{
-				get_trial_orbitals_lm_k(wannier_index, 0, 0, ylm, dr,r,psir,mesh_r,gk,npw,trial_orbitals_k);
-				std::complex<double> * tem_array_1 = new std::complex<double>[npwx];
-				for(int ig = 0; ig < npwx; ig++)
-				{
-					tem_array_1[ig] = trial_orbitals_k(wannier_index,ig);
-				}
-				get_trial_orbitals_lm_k(wannier_index, 1, 1, ylm, dr,r,psir,mesh_r,gk,npw,trial_orbitals_k);
-				std::complex<double> * tem_array_2 = new std::complex<double>[npwx];
-				for(int ig = 0; ig < npwx; ig++)
-				{
-					tem_array_2[ig] = trial_orbitals_k(wannier_index,ig);
-				}
-				get_trial_orbitals_lm_k(wannier_index, 1, 2, ylm, dr,r,psir,mesh_r,gk,npw,trial_orbitals_k);
-				std::complex<double> * tem_array_3 = new std::complex<double>[npwx];
-				for(int ig = 0; ig < npwx; ig++)
-				{
-					tem_array_3[ig] = trial_orbitals_k(wannier_index,ig);
-				}
-				get_trial_orbitals_lm_k(wannier_index, 1, 0, ylm, dr,r,psir,mesh_r,gk,npw,trial_orbitals_k);
-				for(int ig = 0; ig < npwx; ig++)
-				{
-					trial_orbitals_k(wannier_index,ig) = 0.5*(tem_array_1[ig] - tem_array_2[ig] - tem_array_3[ig] + trial_orbitals_k(wannier_index,ig));
-				}
-				delete[] tem_array_1;
-				delete[] tem_array_2;
-				delete[] tem_array_3;
-			}			
-			else if(L[wannier_index] == -4 && m[wannier_index] == 0)
-			{
-				get_trial_orbitals_lm_k(wannier_index, 0, 0, ylm, dr,r,psir,mesh_r,gk,npw,trial_orbitals_k);
-				std::complex<double> * tem_array_1 = new std::complex<double>[npwx];
-				for(int ig = 0; ig < npwx; ig++)
-				{
-					tem_array_1[ig] = trial_orbitals_k(wannier_index,ig);
-				}
-				get_trial_orbitals_lm_k(wannier_index, 1, 1, ylm, dr,r,psir,mesh_r,gk,npw,trial_orbitals_k);
-				std::complex<double> * tem_array_2 = new std::complex<double>[npwx];
-				for(int ig = 0; ig < npwx; ig++)
-				{
-					tem_array_2[ig] = trial_orbitals_k(wannier_index,ig);
-				}
-				get_trial_orbitals_lm_k(wannier_index, 1, 2, ylm, dr,r,psir,mesh_r,gk,npw,trial_orbitals_k);
-				for(int ig = 0; ig < npwx; ig++)
-				{
-					trial_orbitals_k(wannier_index,ig) = bs3 * tem_array_1[ig] - bs6 * tem_array_2[ig] + bs2 * trial_orbitals_k(wannier_index,ig);
-				}
-				delete[] tem_array_1;
-				delete[] tem_array_2;
-			}			
-			else if(L[wannier_index] == -4 && m[wannier_index] == 1)
-			{	
-				get_trial_orbitals_lm_k(wannier_index, 0, 0, ylm, dr,r,psir,mesh_r,gk,npw,trial_orbitals_k);
-				std::complex<double> * tem_array_1 = new std::complex<double>[npwx];
-				for(int ig = 0; ig < npwx; ig++)
-				{
-					tem_array_1[ig] = trial_orbitals_k(wannier_index,ig);
-				}
-				get_trial_orbitals_lm_k(wannier_index, 1, 1, ylm, dr,r,psir,mesh_r,gk,npw,trial_orbitals_k);
-				std::complex<double> * tem_array_2 = new std::complex<double>[npwx];
-				for(int ig = 0; ig < npwx; ig++)
-				{
-					tem_array_2[ig] = trial_orbitals_k(wannier_index,ig);
-				}
-				get_trial_orbitals_lm_k(wannier_index, 1, 2, ylm, dr,r,psir,mesh_r,gk,npw,trial_orbitals_k);
-				for(int ig = 0; ig < npwx; ig++)
-				{
-					trial_orbitals_k(wannier_index,ig) = bs3 * tem_array_1[ig] - bs6 * tem_array_2[ig] - bs2 * trial_orbitals_k(wannier_index,ig);
-				}
-				delete[] tem_array_1;
-				delete[] tem_array_2;
-			}			
-			else if(L[wannier_index] == -4 && m[wannier_index] == 2)
-			{	
-				get_trial_orbitals_lm_k(wannier_index, 0, 0, ylm, dr,r,psir,mesh_r,gk,npw,trial_orbitals_k);
-				std::complex<double> * tem_array_1 = new std::complex<double>[npwx];
-				for(int ig = 0; ig < npwx; ig++)
-				{
-					tem_array_1[ig] = trial_orbitals_k(wannier_index,ig);
-				}
-				get_trial_orbitals_lm_k(wannier_index, 1, 1, ylm, dr,r,psir,mesh_r,gk,npw,trial_orbitals_k);
-				for(int ig = 0; ig < npwx; ig++)
-				{
-					trial_orbitals_k(wannier_index,ig) = bs3 * tem_array_1[ig] - 2 * bs6 * trial_orbitals_k(wannier_index,ig);
-				}
-				delete[] tem_array_1;
-			}			
-			else if(L[wannier_index] == -4 && m[wannier_index] == 3)
-			{
-				get_trial_orbitals_lm_k(wannier_index, 1, 0, ylm, dr,r,psir,mesh_r,gk,npw,trial_orbitals_k);
-				std::complex<double> * tem_array_1 = new std::complex<double>[npwx];
-				for(int ig = 0; ig < npwx; ig++)
-				{
-					tem_array_1[ig] = trial_orbitals_k(wannier_index,ig);
-				}
-				get_trial_orbitals_lm_k(wannier_index, 2, 0, ylm, dr,r,psir,mesh_r,gk,npw,trial_orbitals_k);
-				for(int ig = 0; ig < npwx; ig++)
-				{
-					trial_orbitals_k(wannier_index,ig) = bs2 * tem_array_1[ig] + bs2 * trial_orbitals_k(wannier_index,ig);
-				}
-				delete[] tem_array_1;
-			}			
-			else if(L[wannier_index] == -4 && m[wannier_index] == 4)
-			{	
-				get_trial_orbitals_lm_k(wannier_index, 1, 0, ylm, dr,r,psir,mesh_r,gk,npw,trial_orbitals_k);
-				std::complex<double> * tem_array_1 = new std::complex<double>[npwx];
-				for(int ig = 0; ig < npwx; ig++)
-				{
-					tem_array_1[ig] = trial_orbitals_k(wannier_index,ig);
-				}
-				get_trial_orbitals_lm_k(wannier_index, 2, 0, ylm, dr,r,psir,mesh_r,gk,npw,trial_orbitals_k);
-				for(int ig = 0; ig < npwx; ig++)
-				{
-					trial_orbitals_k(wannier_index,ig) = -1.0 * bs2 * tem_array_1[ig] + bs2 * trial_orbitals_k(wannier_index,ig);
-				}
-				delete[] tem_array_1;
-			}			
-			else if(L[wannier_index] == -5 && m[wannier_index] == 0)
-			{	
-				get_trial_orbitals_lm_k(wannier_index, 0, 0, ylm, dr,r,psir,mesh_r,gk,npw,trial_orbitals_k);
-				std::complex<double> * tem_array_1 = new std::complex<double>[npwx];
-				for(int ig = 0; ig < npwx; ig++)
-				{
-					tem_array_1[ig] = trial_orbitals_k(wannier_index,ig);
-				}
-				get_trial_orbitals_lm_k(wannier_index, 1, 1, ylm, dr,r,psir,mesh_r,gk,npw,trial_orbitals_k);
-				std::complex<double> * tem_array_2 = new std::complex<double>[npwx];
-				for(int ig = 0; ig < npwx; ig++)
-				{
-					tem_array_2[ig] = trial_orbitals_k(wannier_index,ig);
-				}
-				get_trial_orbitals_lm_k(wannier_index, 2, 0, ylm, dr,r,psir,mesh_r,gk,npw,trial_orbitals_k);
-				std::complex<double> * tem_array_3 = new std::complex<double>[npwx];
-				for(int ig = 0; ig < npwx; ig++)
-				{
-					tem_array_3[ig] = trial_orbitals_k(wannier_index,ig);
-				}
-				get_trial_orbitals_lm_k(wannier_index, 2, 3, ylm, dr,r,psir,mesh_r,gk,npw,trial_orbitals_k);
-				for(int ig = 0; ig < npwx; ig++)
-				{
-					trial_orbitals_k(wannier_index,ig) = bs6 * tem_array_1[ig] - bs2 * tem_array_2[ig] - bs12 * tem_array_3[ig] + 0.5 * trial_orbitals_k(wannier_index,ig);
-				}
-				delete[] tem_array_1;
-				delete[] tem_array_2;
-				delete[] tem_array_3;
-			}			
-			else if(L[wannier_index] == -5 && m[wannier_index] == 1)
-			{
-				get_trial_orbitals_lm_k(wannier_index, 0, 0, ylm, dr,r,psir,mesh_r,gk,npw,trial_orbitals_k);
-				std::complex<double> * tem_array_1 = new std::complex<double>[npwx];
-				for(int ig = 0; ig < npwx; ig++)
-				{
-					tem_array_1[ig] = trial_orbitals_k(wannier_index,ig);
-				}
-				get_trial_orbitals_lm_k(wannier_index, 1, 1, ylm, dr,r,psir,mesh_r,gk,npw,trial_orbitals_k);
-				std::complex<double> * tem_array_2 = new std::complex<double>[npwx];
-				for(int ig = 0; ig < npwx; ig++)
-				{
-					tem_array_2[ig] = trial_orbitals_k(wannier_index,ig);
-				}
-				get_trial_orbitals_lm_k(wannier_index, 2, 0, ylm, dr,r,psir,mesh_r,gk,npw,trial_orbitals_k);
-				std::complex<double> * tem_array_3 = new std::complex<double>[npwx];
-				for(int ig = 0; ig < npwx; ig++)
-				{
-					tem_array_3[ig] = trial_orbitals_k(wannier_index,ig);
-				}
-				get_trial_orbitals_lm_k(wannier_index, 2, 3, ylm, dr,r,psir,mesh_r,gk,npw,trial_orbitals_k);
-				for(int ig = 0; ig < npwx; ig++)
-				{
-					trial_orbitals_k(wannier_index,ig) = bs6 * tem_array_1[ig] + bs2 * tem_array_2[ig] - bs12 * tem_array_3[ig] + 0.5 * trial_orbitals_k(wannier_index,ig);
-				}
-				delete[] tem_array_1;
-				delete[] tem_array_2;
-				delete[] tem_array_3;
-			}			
-			else if(L[wannier_index] == -5 && m[wannier_index] == 2)
-			{
-				get_trial_orbitals_lm_k(wannier_index, 0, 0, ylm, dr,r,psir,mesh_r,gk,npw,trial_orbitals_k);
-				std::complex<double> * tem_array_1 = new std::complex<double>[npwx];
-				for(int ig = 0; ig < npwx; ig++)
-				{
-					tem_array_1[ig] = trial_orbitals_k(wannier_index,ig);
-				}
-				get_trial_orbitals_lm_k(wannier_index, 1, 2, ylm, dr,r,psir,mesh_r,gk,npw,trial_orbitals_k);
-				std::complex<double> * tem_array_2 = new std::complex<double>[npwx];
-				for(int ig = 0; ig < npwx; ig++)
-				{
-					tem_array_2[ig] = trial_orbitals_k(wannier_index,ig);
-				}
-				get_trial_orbitals_lm_k(wannier_index, 2, 0, ylm, dr,r,psir,mesh_r,gk,npw,trial_orbitals_k);
-				std::complex<double> * tem_array_3 = new std::complex<double>[npwx];
-				for(int ig = 0; ig < npwx; ig++)
-				{
-					tem_array_3[ig] = trial_orbitals_k(wannier_index,ig);
-				}
-				get_trial_orbitals_lm_k(wannier_index, 2, 3, ylm, dr,r,psir,mesh_r,gk,npw,trial_orbitals_k);
-				for(int ig = 0; ig < npwx; ig++)
-				{
-					trial_orbitals_k(wannier_index,ig) = bs6 * tem_array_1[ig] - bs2 * tem_array_2[ig] - bs12 * tem_array_3[ig] - 0.5 * trial_orbitals_k(wannier_index,ig);
-				}
-				delete[] tem_array_1;
-				delete[] tem_array_2;
-				delete[] tem_array_3;
-			}			
-			else if(L[wannier_index] == -5 && m[wannier_index] == 3)
-			{	
-				get_trial_orbitals_lm_k(wannier_index, 0, 0, ylm, dr,r,psir,mesh_r,gk,npw,trial_orbitals_k);
-				std::complex<double> * tem_array_1 = new std::complex<double>[npwx];
-				for(int ig = 0; ig < npwx; ig++)
-				{
-					tem_array_1[ig] = trial_orbitals_k(wannier_index,ig);
-				}
-				get_trial_orbitals_lm_k(wannier_index, 1, 2, ylm, dr,r,psir,mesh_r,gk,npw,trial_orbitals_k);
-				std::complex<double> * tem_array_2 = new std::complex<double>[npwx];
-				for(int ig = 0; ig < npwx; ig++)
-				{
-					tem_array_2[ig] = trial_orbitals_k(wannier_index,ig);
-				}
-				get_trial_orbitals_lm_k(wannier_index, 2, 0, ylm, dr,r,psir,mesh_r,gk,npw,trial_orbitals_k);
-				std::complex<double> * tem_array_3 = new std::complex<double>[npwx];
-				for(int ig = 0; ig < npwx; ig++)
-				{
-					tem_array_3[ig] = trial_orbitals_k(wannier_index,ig);
-				}
-				get_trial_orbitals_lm_k(wannier_index, 2, 3, ylm, dr,r,psir,mesh_r,gk,npw,trial_orbitals_k);
-				for(int ig = 0; ig < npwx; ig++)
-				{
-					trial_orbitals_k(wannier_index,ig) = bs6 * tem_array_1[ig] + bs2 * tem_array_2[ig] - bs12 * tem_array_3[ig] - 0.5 * trial_orbitals_k(wannier_index,ig);
-				}
-				delete[] tem_array_1;
-				delete[] tem_array_2;
-				delete[] tem_array_3;
-			}			
-			else if(L[wannier_index] == -5 && m[wannier_index] == 4)
-			{
-				get_trial_orbitals_lm_k(wannier_index, 0, 0, ylm, dr,r,psir,mesh_r,gk,npw,trial_orbitals_k);
-				std::complex<double> * tem_array_1 = new std::complex<double>[npwx];
-				for(int ig = 0; ig < npwx; ig++)
-				{
-					tem_array_1[ig] = trial_orbitals_k(wannier_index,ig);
-				}
-				get_trial_orbitals_lm_k(wannier_index, 1, 0, ylm, dr,r,psir,mesh_r,gk,npw,trial_orbitals_k);
-				std::complex<double> * tem_array_2 = new std::complex<double>[npwx];
-				for(int ig = 0; ig < npwx; ig++)
-				{
-					tem_array_2[ig] = trial_orbitals_k(wannier_index,ig);
-				}
-				get_trial_orbitals_lm_k(wannier_index, 2, 0, ylm, dr,r,psir,mesh_r,gk,npw,trial_orbitals_k);
-				for(int ig = 0; ig < npwx; ig++)
-				{
-					trial_orbitals_k(wannier_index,ig) = bs6 * tem_array_1[ig] - bs2 * tem_array_2[ig] + bs3 * trial_orbitals_k(wannier_index,ig);
-				}
-				delete[] tem_array_1;
-				delete[] tem_array_2;
-			}			
-			else if(L[wannier_index] == -5 && m[wannier_index] == 5)
-			{
-				get_trial_orbitals_lm_k(wannier_index, 0, 0, ylm, dr,r,psir,mesh_r,gk,npw,trial_orbitals_k);
-				std::complex<double> * tem_array_1 = new std::complex<double>[npwx];
-				for(int ig = 0; ig < npwx; ig++)
-				{
-					tem_array_1[ig] = trial_orbitals_k(wannier_index,ig);
-				}
-				get_trial_orbitals_lm_k(wannier_index, 1, 0, ylm, dr,r,psir,mesh_r,gk,npw,trial_orbitals_k);
-				std::complex<double> * tem_array_2 = new std::complex<double>[npwx];
-				for(int ig = 0; ig < npwx; ig++)
-				{
-					tem_array_2[ig] = trial_orbitals_k(wannier_index,ig);
-				}
-				get_trial_orbitals_lm_k(wannier_index, 2, 0, ylm, dr,r,psir,mesh_r,gk,npw,trial_orbitals_k);
-				for(int ig = 0; ig < npwx; ig++)
-				{
-					trial_orbitals_k(wannier_index,ig) = bs6 * tem_array_1[ig] + bs2 * tem_array_2[ig] + bs3 * trial_orbitals_k(wannier_index,ig);
-				}
-				delete[] tem_array_1;
-				delete[] tem_array_2;
-			}	
-		}
-	}
-
-	
-	
+
+    for (int i = 0; i < num_wannier; i++)
+    {
+        if (L[i] < -5 || L[i] > 3)
+            std::cout << "toWannier90::produce_trial_in_pw() your L angular momentum is wrong , please check !!! "
+                      << std::endl;
+
+        if (L[i] >= 0)
+        {
+            if (m[i] < 0 || m[i] > 2 * L[i])
+                std::cout << "toWannier90::produce_trial_in_pw() your m momentum is wrong , please check !!! "
+                          << std::endl;
+        }
+        else
+        {
+            if (m[i] < 0 || m[i] > -L[i])
+                std::cout << "toWannier90::produce_trial_in_pw() your m momentum is wrong , please check !!! "
+                          << std::endl;
+        }
+    }
+
+    const int npw = GlobalC::kv.ngk[ik];
+    const int npwx = GlobalC::wf.npwx;
+    const int total_lm = 16;
+    ModuleBase::matrix ylm(total_lm, npw);
+
+    double bs2, bs3, bs6, bs12;
+    bs2 = 1.0 / sqrt(2.0);
+    bs3 = 1.0 / sqrt(3.0);
+    bs6 = 1.0 / sqrt(6.0);
+    bs12 = 1.0 / sqrt(12.0);
+
+    ModuleBase::Vector3<double> *gk = new ModuleBase::Vector3<double>[npw];
+    for (int ig = 0; ig < npw; ig++)
+    {
+        gk[ig] = GlobalC::wf.get_1qvec_cartesian(ik, ig);
+    }
+
+    ModuleBase::YlmReal::Ylm_Real(total_lm, npw, gk, ylm);
+
+    // test by jingan
+    // GlobalV::ofs_running << "the mathzone::ylm_real is successful!" << std::endl;
+    // GlobalV::ofs_running << "produce_trial_in_pw: num_wannier is " << num_wannier << std::endl;
+    // test by jingan
+
+    const int mesh_r = 333;
+    const double dx = 0.025;
+    const double x_min = -6.0;
+    ModuleBase::matrix r(num_wannier, mesh_r);
+    ModuleBase::matrix dr(num_wannier, mesh_r);
+    ModuleBase::matrix psi(num_wannier, mesh_r);
+    ModuleBase::matrix psir(num_wannier, mesh_r);
+    ModuleBase::matrix psik(num_wannier, npw);
+
+    for (int i = 0; i < num_wannier; i++)
+    {
+        double x = 0;
+        for (int ir = 0; ir < mesh_r; ir++)
+        {
+            x = x_min + ir * dx;
+            r(i, ir) = exp(x) / alfa[i];
+            dr(i, ir) = dx * r(i, ir);
+        }
+    }
+
+    for (int i = 0; i < num_wannier; i++)
+    {
+        double alfa32 = pow(alfa[i], 3.0 / 2.0);
+        double alfa_new = alfa[i];
+        int wannier_index = i;
+
+        if (rvalue[i] == 1)
+        {
+            for (int ir = 0; ir < mesh_r; ir++)
+            {
+                psi(wannier_index, ir) = 2.0 * alfa32 * exp(-alfa_new * r(wannier_index, ir));
+            }
+        }
+
+        if (rvalue[i] == 2)
+        {
+            for (int ir = 0; ir < mesh_r; ir++)
+            {
+                psi(wannier_index, ir) = 1.0 / sqrt(8.0) * alfa32 * (2.0 - alfa_new * r(wannier_index, ir))
+                                         * exp(-alfa_new * r(wannier_index, ir) * 0.5);
+            }
+        }
+
+        if (rvalue[i] == 3)
+        {
+            for (int ir = 0; ir < mesh_r; ir++)
+            {
+                psi(wannier_index, ir)
+                    = sqrt(4.0 / 27.0) * alfa32
+                      * (1.0 - 2.0 / 3.0 * alfa_new * r(wannier_index, ir)
+                         + 2.0 / 27.0 * pow(alfa_new, 2.0) * r(wannier_index, ir) * r(wannier_index, ir))
+                      * exp(-alfa_new * r(wannier_index, ir) * 1.0 / 3.0);
+            }
+        }
+    }
+
+    for (int i = 0; i < num_wannier; i++)
+    {
+        for (int ir = 0; ir < mesh_r; ir++)
+        {
+            psir(i, ir) = psi(i, ir) * r(i, ir);
+        }
+    }
+
+    for (int wannier_index = 0; wannier_index < num_wannier; wannier_index++)
+    {
+        if (L[wannier_index] >= 0)
+        {
+            get_trial_orbitals_lm_k(wannier_index,
+                                    L[wannier_index],
+                                    m[wannier_index],
+                                    ylm,
+                                    dr,
+                                    r,
+                                    psir,
+                                    mesh_r,
+                                    gk,
+                                    npw,
+                                    trial_orbitals_k);
+        }
+        else
+        {
+            if (L[wannier_index] == -1 && m[wannier_index] == 0)
+            {
+                get_trial_orbitals_lm_k(wannier_index, 0, 0, ylm, dr, r, psir, mesh_r, gk, npw, trial_orbitals_k);
+                std::complex<double> *tem_array = new std::complex<double>[npwx];
+                for (int ig = 0; ig < npwx; ig++)
+                {
+                    tem_array[ig] = trial_orbitals_k(wannier_index, ig);
+                }
+                get_trial_orbitals_lm_k(wannier_index, 1, 1, ylm, dr, r, psir, mesh_r, gk, npw, trial_orbitals_k);
+                for (int ig = 0; ig < npwx; ig++)
+                {
+                    trial_orbitals_k(wannier_index, ig)
+                        = bs2 * tem_array[ig] + bs2 * trial_orbitals_k(wannier_index, ig);
+                }
+                delete[] tem_array;
+            }
+            else if (L[wannier_index] == -1 && m[wannier_index] == 1)
+            {
+                get_trial_orbitals_lm_k(wannier_index, 0, 0, ylm, dr, r, psir, mesh_r, gk, npw, trial_orbitals_k);
+                std::complex<double> *tem_array = new std::complex<double>[npwx];
+                for (int ig = 0; ig < npwx; ig++)
+                {
+                    tem_array[ig] = trial_orbitals_k(wannier_index, ig);
+                }
+                get_trial_orbitals_lm_k(wannier_index, 1, 1, ylm, dr, r, psir, mesh_r, gk, npw, trial_orbitals_k);
+                for (int ig = 0; ig < npwx; ig++)
+                {
+                    trial_orbitals_k(wannier_index, ig)
+                        = bs2 * tem_array[ig] - bs2 * trial_orbitals_k(wannier_index, ig);
+                }
+                delete[] tem_array;
+            }
+            else if (L[wannier_index] == -2 && m[wannier_index] == 0)
+            {
+                get_trial_orbitals_lm_k(wannier_index, 0, 0, ylm, dr, r, psir, mesh_r, gk, npw, trial_orbitals_k);
+                std::complex<double> *tem_array_1 = new std::complex<double>[npwx];
+                for (int ig = 0; ig < npwx; ig++)
+                {
+                    tem_array_1[ig] = trial_orbitals_k(wannier_index, ig);
+                }
+                get_trial_orbitals_lm_k(wannier_index, 1, 1, ylm, dr, r, psir, mesh_r, gk, npw, trial_orbitals_k);
+                std::complex<double> *tem_array_2 = new std::complex<double>[npwx];
+                for (int ig = 0; ig < npwx; ig++)
+                {
+                    tem_array_2[ig] = trial_orbitals_k(wannier_index, ig);
+                }
+                get_trial_orbitals_lm_k(wannier_index, 1, 2, ylm, dr, r, psir, mesh_r, gk, npw, trial_orbitals_k);
+                for (int ig = 0; ig < npwx; ig++)
+                {
+                    trial_orbitals_k(wannier_index, ig)
+                        = bs3 * tem_array_1[ig] - bs6 * tem_array_2[ig] + bs2 * trial_orbitals_k(wannier_index, ig);
+                }
+                delete[] tem_array_1;
+                delete[] tem_array_2;
+            }
+            else if (L[wannier_index] == -2 && m[wannier_index] == 1)
+            {
+                get_trial_orbitals_lm_k(wannier_index, 0, 0, ylm, dr, r, psir, mesh_r, gk, npw, trial_orbitals_k);
+                std::complex<double> *tem_array_1 = new std::complex<double>[npwx];
+                for (int ig = 0; ig < npwx; ig++)
+                {
+                    tem_array_1[ig] = trial_orbitals_k(wannier_index, ig);
+                }
+                get_trial_orbitals_lm_k(wannier_index, 1, 1, ylm, dr, r, psir, mesh_r, gk, npw, trial_orbitals_k);
+                std::complex<double> *tem_array_2 = new std::complex<double>[npwx];
+                for (int ig = 0; ig < npwx; ig++)
+                {
+                    tem_array_2[ig] = trial_orbitals_k(wannier_index, ig);
+                }
+                get_trial_orbitals_lm_k(wannier_index, 1, 2, ylm, dr, r, psir, mesh_r, gk, npw, trial_orbitals_k);
+                for (int ig = 0; ig < npwx; ig++)
+                {
+                    trial_orbitals_k(wannier_index, ig)
+                        = bs3 * tem_array_1[ig] - bs6 * tem_array_2[ig] - bs2 * trial_orbitals_k(wannier_index, ig);
+                }
+                delete[] tem_array_1;
+                delete[] tem_array_2;
+            }
+            else if (L[wannier_index] == -2 && m[wannier_index] == 2)
+            {
+                get_trial_orbitals_lm_k(wannier_index, 0, 0, ylm, dr, r, psir, mesh_r, gk, npw, trial_orbitals_k);
+                std::complex<double> *tem_array = new std::complex<double>[npwx];
+                for (int ig = 0; ig < npwx; ig++)
+                {
+                    tem_array[ig] = trial_orbitals_k(wannier_index, ig);
+                }
+                get_trial_orbitals_lm_k(wannier_index, 1, 1, ylm, dr, r, psir, mesh_r, gk, npw, trial_orbitals_k);
+                for (int ig = 0; ig < npwx; ig++)
+                {
+                    trial_orbitals_k(wannier_index, ig)
+                        = bs3 * tem_array[ig] + 2 * bs6 * trial_orbitals_k(wannier_index, ig);
+                }
+                delete[] tem_array;
+            }
+            else if (L[wannier_index] == -3 && m[wannier_index] == 0)
+            {
+                get_trial_orbitals_lm_k(wannier_index, 0, 0, ylm, dr, r, psir, mesh_r, gk, npw, trial_orbitals_k);
+                std::complex<double> *tem_array_1 = new std::complex<double>[npwx];
+                for (int ig = 0; ig < npwx; ig++)
+                {
+                    tem_array_1[ig] = trial_orbitals_k(wannier_index, ig);
+                }
+                get_trial_orbitals_lm_k(wannier_index, 1, 1, ylm, dr, r, psir, mesh_r, gk, npw, trial_orbitals_k);
+                std::complex<double> *tem_array_2 = new std::complex<double>[npwx];
+                for (int ig = 0; ig < npwx; ig++)
+                {
+                    tem_array_2[ig] = trial_orbitals_k(wannier_index, ig);
+                }
+                get_trial_orbitals_lm_k(wannier_index, 1, 2, ylm, dr, r, psir, mesh_r, gk, npw, trial_orbitals_k);
+                std::complex<double> *tem_array_3 = new std::complex<double>[npwx];
+                for (int ig = 0; ig < npwx; ig++)
+                {
+                    tem_array_3[ig] = trial_orbitals_k(wannier_index, ig);
+                }
+                get_trial_orbitals_lm_k(wannier_index, 1, 0, ylm, dr, r, psir, mesh_r, gk, npw, trial_orbitals_k);
+                for (int ig = 0; ig < npwx; ig++)
+                {
+                    trial_orbitals_k(wannier_index, ig)
+                        = 0.5
+                          * (tem_array_1[ig] + tem_array_2[ig] + tem_array_3[ig] + trial_orbitals_k(wannier_index, ig));
+                }
+                delete[] tem_array_1;
+                delete[] tem_array_2;
+                delete[] tem_array_3;
+            }
+            else if (L[wannier_index] == -3 && m[wannier_index] == 1)
+            {
+                get_trial_orbitals_lm_k(wannier_index, 0, 0, ylm, dr, r, psir, mesh_r, gk, npw, trial_orbitals_k);
+                std::complex<double> *tem_array_1 = new std::complex<double>[npwx];
+                for (int ig = 0; ig < npwx; ig++)
+                {
+                    tem_array_1[ig] = trial_orbitals_k(wannier_index, ig);
+                }
+                get_trial_orbitals_lm_k(wannier_index, 1, 1, ylm, dr, r, psir, mesh_r, gk, npw, trial_orbitals_k);
+                std::complex<double> *tem_array_2 = new std::complex<double>[npwx];
+                for (int ig = 0; ig < npwx; ig++)
+                {
+                    tem_array_2[ig] = trial_orbitals_k(wannier_index, ig);
+                }
+                get_trial_orbitals_lm_k(wannier_index, 1, 2, ylm, dr, r, psir, mesh_r, gk, npw, trial_orbitals_k);
+                std::complex<double> *tem_array_3 = new std::complex<double>[npwx];
+                for (int ig = 0; ig < npwx; ig++)
+                {
+                    tem_array_3[ig] = trial_orbitals_k(wannier_index, ig);
+                }
+                get_trial_orbitals_lm_k(wannier_index, 1, 0, ylm, dr, r, psir, mesh_r, gk, npw, trial_orbitals_k);
+                for (int ig = 0; ig < npwx; ig++)
+                {
+                    trial_orbitals_k(wannier_index, ig)
+                        = 0.5
+                          * (tem_array_1[ig] + tem_array_2[ig] - tem_array_3[ig] - trial_orbitals_k(wannier_index, ig));
+                }
+                delete[] tem_array_1;
+                delete[] tem_array_2;
+                delete[] tem_array_3;
+            }
+            else if (L[wannier_index] == -3 && m[wannier_index] == 2)
+            {
+                get_trial_orbitals_lm_k(wannier_index, 0, 0, ylm, dr, r, psir, mesh_r, gk, npw, trial_orbitals_k);
+                std::complex<double> *tem_array_1 = new std::complex<double>[npwx];
+                for (int ig = 0; ig < npwx; ig++)
+                {
+                    tem_array_1[ig] = trial_orbitals_k(wannier_index, ig);
+                }
+                get_trial_orbitals_lm_k(wannier_index, 1, 1, ylm, dr, r, psir, mesh_r, gk, npw, trial_orbitals_k);
+                std::complex<double> *tem_array_2 = new std::complex<double>[npwx];
+                for (int ig = 0; ig < npwx; ig++)
+                {
+                    tem_array_2[ig] = trial_orbitals_k(wannier_index, ig);
+                }
+                get_trial_orbitals_lm_k(wannier_index, 1, 2, ylm, dr, r, psir, mesh_r, gk, npw, trial_orbitals_k);
+                std::complex<double> *tem_array_3 = new std::complex<double>[npwx];
+                for (int ig = 0; ig < npwx; ig++)
+                {
+                    tem_array_3[ig] = trial_orbitals_k(wannier_index, ig);
+                }
+                get_trial_orbitals_lm_k(wannier_index, 1, 0, ylm, dr, r, psir, mesh_r, gk, npw, trial_orbitals_k);
+                for (int ig = 0; ig < npwx; ig++)
+                {
+                    trial_orbitals_k(wannier_index, ig)
+                        = 0.5
+                          * (tem_array_1[ig] - tem_array_2[ig] + tem_array_3[ig] - trial_orbitals_k(wannier_index, ig));
+                }
+                delete[] tem_array_1;
+                delete[] tem_array_2;
+                delete[] tem_array_3;
+            }
+            else if (L[wannier_index] == -3 && m[wannier_index] == 3)
+            {
+                get_trial_orbitals_lm_k(wannier_index, 0, 0, ylm, dr, r, psir, mesh_r, gk, npw, trial_orbitals_k);
+                std::complex<double> *tem_array_1 = new std::complex<double>[npwx];
+                for (int ig = 0; ig < npwx; ig++)
+                {
+                    tem_array_1[ig] = trial_orbitals_k(wannier_index, ig);
+                }
+                get_trial_orbitals_lm_k(wannier_index, 1, 1, ylm, dr, r, psir, mesh_r, gk, npw, trial_orbitals_k);
+                std::complex<double> *tem_array_2 = new std::complex<double>[npwx];
+                for (int ig = 0; ig < npwx; ig++)
+                {
+                    tem_array_2[ig] = trial_orbitals_k(wannier_index, ig);
+                }
+                get_trial_orbitals_lm_k(wannier_index, 1, 2, ylm, dr, r, psir, mesh_r, gk, npw, trial_orbitals_k);
+                std::complex<double> *tem_array_3 = new std::complex<double>[npwx];
+                for (int ig = 0; ig < npwx; ig++)
+                {
+                    tem_array_3[ig] = trial_orbitals_k(wannier_index, ig);
+                }
+                get_trial_orbitals_lm_k(wannier_index, 1, 0, ylm, dr, r, psir, mesh_r, gk, npw, trial_orbitals_k);
+                for (int ig = 0; ig < npwx; ig++)
+                {
+                    trial_orbitals_k(wannier_index, ig)
+                        = 0.5
+                          * (tem_array_1[ig] - tem_array_2[ig] - tem_array_3[ig] + trial_orbitals_k(wannier_index, ig));
+                }
+                delete[] tem_array_1;
+                delete[] tem_array_2;
+                delete[] tem_array_3;
+            }
+            else if (L[wannier_index] == -4 && m[wannier_index] == 0)
+            {
+                get_trial_orbitals_lm_k(wannier_index, 0, 0, ylm, dr, r, psir, mesh_r, gk, npw, trial_orbitals_k);
+                std::complex<double> *tem_array_1 = new std::complex<double>[npwx];
+                for (int ig = 0; ig < npwx; ig++)
+                {
+                    tem_array_1[ig] = trial_orbitals_k(wannier_index, ig);
+                }
+                get_trial_orbitals_lm_k(wannier_index, 1, 1, ylm, dr, r, psir, mesh_r, gk, npw, trial_orbitals_k);
+                std::complex<double> *tem_array_2 = new std::complex<double>[npwx];
+                for (int ig = 0; ig < npwx; ig++)
+                {
+                    tem_array_2[ig] = trial_orbitals_k(wannier_index, ig);
+                }
+                get_trial_orbitals_lm_k(wannier_index, 1, 2, ylm, dr, r, psir, mesh_r, gk, npw, trial_orbitals_k);
+                for (int ig = 0; ig < npwx; ig++)
+                {
+                    trial_orbitals_k(wannier_index, ig)
+                        = bs3 * tem_array_1[ig] - bs6 * tem_array_2[ig] + bs2 * trial_orbitals_k(wannier_index, ig);
+                }
+                delete[] tem_array_1;
+                delete[] tem_array_2;
+            }
+            else if (L[wannier_index] == -4 && m[wannier_index] == 1)
+            {
+                get_trial_orbitals_lm_k(wannier_index, 0, 0, ylm, dr, r, psir, mesh_r, gk, npw, trial_orbitals_k);
+                std::complex<double> *tem_array_1 = new std::complex<double>[npwx];
+                for (int ig = 0; ig < npwx; ig++)
+                {
+                    tem_array_1[ig] = trial_orbitals_k(wannier_index, ig);
+                }
+                get_trial_orbitals_lm_k(wannier_index, 1, 1, ylm, dr, r, psir, mesh_r, gk, npw, trial_orbitals_k);
+                std::complex<double> *tem_array_2 = new std::complex<double>[npwx];
+                for (int ig = 0; ig < npwx; ig++)
+                {
+                    tem_array_2[ig] = trial_orbitals_k(wannier_index, ig);
+                }
+                get_trial_orbitals_lm_k(wannier_index, 1, 2, ylm, dr, r, psir, mesh_r, gk, npw, trial_orbitals_k);
+                for (int ig = 0; ig < npwx; ig++)
+                {
+                    trial_orbitals_k(wannier_index, ig)
+                        = bs3 * tem_array_1[ig] - bs6 * tem_array_2[ig] - bs2 * trial_orbitals_k(wannier_index, ig);
+                }
+                delete[] tem_array_1;
+                delete[] tem_array_2;
+            }
+            else if (L[wannier_index] == -4 && m[wannier_index] == 2)
+            {
+                get_trial_orbitals_lm_k(wannier_index, 0, 0, ylm, dr, r, psir, mesh_r, gk, npw, trial_orbitals_k);
+                std::complex<double> *tem_array_1 = new std::complex<double>[npwx];
+                for (int ig = 0; ig < npwx; ig++)
+                {
+                    tem_array_1[ig] = trial_orbitals_k(wannier_index, ig);
+                }
+                get_trial_orbitals_lm_k(wannier_index, 1, 1, ylm, dr, r, psir, mesh_r, gk, npw, trial_orbitals_k);
+                for (int ig = 0; ig < npwx; ig++)
+                {
+                    trial_orbitals_k(wannier_index, ig)
+                        = bs3 * tem_array_1[ig] - 2 * bs6 * trial_orbitals_k(wannier_index, ig);
+                }
+                delete[] tem_array_1;
+            }
+            else if (L[wannier_index] == -4 && m[wannier_index] == 3)
+            {
+                get_trial_orbitals_lm_k(wannier_index, 1, 0, ylm, dr, r, psir, mesh_r, gk, npw, trial_orbitals_k);
+                std::complex<double> *tem_array_1 = new std::complex<double>[npwx];
+                for (int ig = 0; ig < npwx; ig++)
+                {
+                    tem_array_1[ig] = trial_orbitals_k(wannier_index, ig);
+                }
+                get_trial_orbitals_lm_k(wannier_index, 2, 0, ylm, dr, r, psir, mesh_r, gk, npw, trial_orbitals_k);
+                for (int ig = 0; ig < npwx; ig++)
+                {
+                    trial_orbitals_k(wannier_index, ig)
+                        = bs2 * tem_array_1[ig] + bs2 * trial_orbitals_k(wannier_index, ig);
+                }
+                delete[] tem_array_1;
+            }
+            else if (L[wannier_index] == -4 && m[wannier_index] == 4)
+            {
+                get_trial_orbitals_lm_k(wannier_index, 1, 0, ylm, dr, r, psir, mesh_r, gk, npw, trial_orbitals_k);
+                std::complex<double> *tem_array_1 = new std::complex<double>[npwx];
+                for (int ig = 0; ig < npwx; ig++)
+                {
+                    tem_array_1[ig] = trial_orbitals_k(wannier_index, ig);
+                }
+                get_trial_orbitals_lm_k(wannier_index, 2, 0, ylm, dr, r, psir, mesh_r, gk, npw, trial_orbitals_k);
+                for (int ig = 0; ig < npwx; ig++)
+                {
+                    trial_orbitals_k(wannier_index, ig)
+                        = -1.0 * bs2 * tem_array_1[ig] + bs2 * trial_orbitals_k(wannier_index, ig);
+                }
+                delete[] tem_array_1;
+            }
+            else if (L[wannier_index] == -5 && m[wannier_index] == 0)
+            {
+                get_trial_orbitals_lm_k(wannier_index, 0, 0, ylm, dr, r, psir, mesh_r, gk, npw, trial_orbitals_k);
+                std::complex<double> *tem_array_1 = new std::complex<double>[npwx];
+                for (int ig = 0; ig < npwx; ig++)
+                {
+                    tem_array_1[ig] = trial_orbitals_k(wannier_index, ig);
+                }
+                get_trial_orbitals_lm_k(wannier_index, 1, 1, ylm, dr, r, psir, mesh_r, gk, npw, trial_orbitals_k);
+                std::complex<double> *tem_array_2 = new std::complex<double>[npwx];
+                for (int ig = 0; ig < npwx; ig++)
+                {
+                    tem_array_2[ig] = trial_orbitals_k(wannier_index, ig);
+                }
+                get_trial_orbitals_lm_k(wannier_index, 2, 0, ylm, dr, r, psir, mesh_r, gk, npw, trial_orbitals_k);
+                std::complex<double> *tem_array_3 = new std::complex<double>[npwx];
+                for (int ig = 0; ig < npwx; ig++)
+                {
+                    tem_array_3[ig] = trial_orbitals_k(wannier_index, ig);
+                }
+                get_trial_orbitals_lm_k(wannier_index, 2, 3, ylm, dr, r, psir, mesh_r, gk, npw, trial_orbitals_k);
+                for (int ig = 0; ig < npwx; ig++)
+                {
+                    trial_orbitals_k(wannier_index, ig) = bs6 * tem_array_1[ig] - bs2 * tem_array_2[ig]
+                                                          - bs12 * tem_array_3[ig]
+                                                          + 0.5 * trial_orbitals_k(wannier_index, ig);
+                }
+                delete[] tem_array_1;
+                delete[] tem_array_2;
+                delete[] tem_array_3;
+            }
+            else if (L[wannier_index] == -5 && m[wannier_index] == 1)
+            {
+                get_trial_orbitals_lm_k(wannier_index, 0, 0, ylm, dr, r, psir, mesh_r, gk, npw, trial_orbitals_k);
+                std::complex<double> *tem_array_1 = new std::complex<double>[npwx];
+                for (int ig = 0; ig < npwx; ig++)
+                {
+                    tem_array_1[ig] = trial_orbitals_k(wannier_index, ig);
+                }
+                get_trial_orbitals_lm_k(wannier_index, 1, 1, ylm, dr, r, psir, mesh_r, gk, npw, trial_orbitals_k);
+                std::complex<double> *tem_array_2 = new std::complex<double>[npwx];
+                for (int ig = 0; ig < npwx; ig++)
+                {
+                    tem_array_2[ig] = trial_orbitals_k(wannier_index, ig);
+                }
+                get_trial_orbitals_lm_k(wannier_index, 2, 0, ylm, dr, r, psir, mesh_r, gk, npw, trial_orbitals_k);
+                std::complex<double> *tem_array_3 = new std::complex<double>[npwx];
+                for (int ig = 0; ig < npwx; ig++)
+                {
+                    tem_array_3[ig] = trial_orbitals_k(wannier_index, ig);
+                }
+                get_trial_orbitals_lm_k(wannier_index, 2, 3, ylm, dr, r, psir, mesh_r, gk, npw, trial_orbitals_k);
+                for (int ig = 0; ig < npwx; ig++)
+                {
+                    trial_orbitals_k(wannier_index, ig) = bs6 * tem_array_1[ig] + bs2 * tem_array_2[ig]
+                                                          - bs12 * tem_array_3[ig]
+                                                          + 0.5 * trial_orbitals_k(wannier_index, ig);
+                }
+                delete[] tem_array_1;
+                delete[] tem_array_2;
+                delete[] tem_array_3;
+            }
+            else if (L[wannier_index] == -5 && m[wannier_index] == 2)
+            {
+                get_trial_orbitals_lm_k(wannier_index, 0, 0, ylm, dr, r, psir, mesh_r, gk, npw, trial_orbitals_k);
+                std::complex<double> *tem_array_1 = new std::complex<double>[npwx];
+                for (int ig = 0; ig < npwx; ig++)
+                {
+                    tem_array_1[ig] = trial_orbitals_k(wannier_index, ig);
+                }
+                get_trial_orbitals_lm_k(wannier_index, 1, 2, ylm, dr, r, psir, mesh_r, gk, npw, trial_orbitals_k);
+                std::complex<double> *tem_array_2 = new std::complex<double>[npwx];
+                for (int ig = 0; ig < npwx; ig++)
+                {
+                    tem_array_2[ig] = trial_orbitals_k(wannier_index, ig);
+                }
+                get_trial_orbitals_lm_k(wannier_index, 2, 0, ylm, dr, r, psir, mesh_r, gk, npw, trial_orbitals_k);
+                std::complex<double> *tem_array_3 = new std::complex<double>[npwx];
+                for (int ig = 0; ig < npwx; ig++)
+                {
+                    tem_array_3[ig] = trial_orbitals_k(wannier_index, ig);
+                }
+                get_trial_orbitals_lm_k(wannier_index, 2, 3, ylm, dr, r, psir, mesh_r, gk, npw, trial_orbitals_k);
+                for (int ig = 0; ig < npwx; ig++)
+                {
+                    trial_orbitals_k(wannier_index, ig) = bs6 * tem_array_1[ig] - bs2 * tem_array_2[ig]
+                                                          - bs12 * tem_array_3[ig]
+                                                          - 0.5 * trial_orbitals_k(wannier_index, ig);
+                }
+                delete[] tem_array_1;
+                delete[] tem_array_2;
+                delete[] tem_array_3;
+            }
+            else if (L[wannier_index] == -5 && m[wannier_index] == 3)
+            {
+                get_trial_orbitals_lm_k(wannier_index, 0, 0, ylm, dr, r, psir, mesh_r, gk, npw, trial_orbitals_k);
+                std::complex<double> *tem_array_1 = new std::complex<double>[npwx];
+                for (int ig = 0; ig < npwx; ig++)
+                {
+                    tem_array_1[ig] = trial_orbitals_k(wannier_index, ig);
+                }
+                get_trial_orbitals_lm_k(wannier_index, 1, 2, ylm, dr, r, psir, mesh_r, gk, npw, trial_orbitals_k);
+                std::complex<double> *tem_array_2 = new std::complex<double>[npwx];
+                for (int ig = 0; ig < npwx; ig++)
+                {
+                    tem_array_2[ig] = trial_orbitals_k(wannier_index, ig);
+                }
+                get_trial_orbitals_lm_k(wannier_index, 2, 0, ylm, dr, r, psir, mesh_r, gk, npw, trial_orbitals_k);
+                std::complex<double> *tem_array_3 = new std::complex<double>[npwx];
+                for (int ig = 0; ig < npwx; ig++)
+                {
+                    tem_array_3[ig] = trial_orbitals_k(wannier_index, ig);
+                }
+                get_trial_orbitals_lm_k(wannier_index, 2, 3, ylm, dr, r, psir, mesh_r, gk, npw, trial_orbitals_k);
+                for (int ig = 0; ig < npwx; ig++)
+                {
+                    trial_orbitals_k(wannier_index, ig) = bs6 * tem_array_1[ig] + bs2 * tem_array_2[ig]
+                                                          - bs12 * tem_array_3[ig]
+                                                          - 0.5 * trial_orbitals_k(wannier_index, ig);
+                }
+                delete[] tem_array_1;
+                delete[] tem_array_2;
+                delete[] tem_array_3;
+            }
+            else if (L[wannier_index] == -5 && m[wannier_index] == 4)
+            {
+                get_trial_orbitals_lm_k(wannier_index, 0, 0, ylm, dr, r, psir, mesh_r, gk, npw, trial_orbitals_k);
+                std::complex<double> *tem_array_1 = new std::complex<double>[npwx];
+                for (int ig = 0; ig < npwx; ig++)
+                {
+                    tem_array_1[ig] = trial_orbitals_k(wannier_index, ig);
+                }
+                get_trial_orbitals_lm_k(wannier_index, 1, 0, ylm, dr, r, psir, mesh_r, gk, npw, trial_orbitals_k);
+                std::complex<double> *tem_array_2 = new std::complex<double>[npwx];
+                for (int ig = 0; ig < npwx; ig++)
+                {
+                    tem_array_2[ig] = trial_orbitals_k(wannier_index, ig);
+                }
+                get_trial_orbitals_lm_k(wannier_index, 2, 0, ylm, dr, r, psir, mesh_r, gk, npw, trial_orbitals_k);
+                for (int ig = 0; ig < npwx; ig++)
+                {
+                    trial_orbitals_k(wannier_index, ig)
+                        = bs6 * tem_array_1[ig] - bs2 * tem_array_2[ig] + bs3 * trial_orbitals_k(wannier_index, ig);
+                }
+                delete[] tem_array_1;
+                delete[] tem_array_2;
+            }
+            else if (L[wannier_index] == -5 && m[wannier_index] == 5)
+            {
+                get_trial_orbitals_lm_k(wannier_index, 0, 0, ylm, dr, r, psir, mesh_r, gk, npw, trial_orbitals_k);
+                std::complex<double> *tem_array_1 = new std::complex<double>[npwx];
+                for (int ig = 0; ig < npwx; ig++)
+                {
+                    tem_array_1[ig] = trial_orbitals_k(wannier_index, ig);
+                }
+                get_trial_orbitals_lm_k(wannier_index, 1, 0, ylm, dr, r, psir, mesh_r, gk, npw, trial_orbitals_k);
+                std::complex<double> *tem_array_2 = new std::complex<double>[npwx];
+                for (int ig = 0; ig < npwx; ig++)
+                {
+                    tem_array_2[ig] = trial_orbitals_k(wannier_index, ig);
+                }
+                get_trial_orbitals_lm_k(wannier_index, 2, 0, ylm, dr, r, psir, mesh_r, gk, npw, trial_orbitals_k);
+                for (int ig = 0; ig < npwx; ig++)
+                {
+                    trial_orbitals_k(wannier_index, ig)
+                        = bs6 * tem_array_1[ig] + bs2 * tem_array_2[ig] + bs3 * trial_orbitals_k(wannier_index, ig);
+                }
+                delete[] tem_array_1;
+                delete[] tem_array_2;
+            }
+        }
+    }
 }
 
-// ע����������Lֵ�����Ǵ��ڵ���0��
-void toWannier90::get_trial_orbitals_lm_k(const int wannier_index, const int orbital_L, const int orbital_m, ModuleBase::matrix &ylm, 
-										ModuleBase::matrix &dr, ModuleBase::matrix &r, ModuleBase::matrix &psir, const int mesh_r, 
-										ModuleBase::Vector3<double> *gk, const int npw, ModuleBase::ComplexMatrix &trial_orbitals_k)
+void toWannier90::get_trial_orbitals_lm_k(const int wannier_index,
+                                          const int orbital_L,
+                                          const int orbital_m,
+                                          ModuleBase::matrix &ylm,
+                                          ModuleBase::matrix &dr,
+                                          ModuleBase::matrix &r,
+                                          ModuleBase::matrix &psir,
+                                          const int mesh_r,
+                                          ModuleBase::Vector3<double> *gk,
+                                          const int npw,
+                                          ModuleBase::ComplexMatrix &trial_orbitals_k)
 {
-	//���㾶������ĳ��k���µ��ռ��ͶӰ
-	double *psik = new double[npw];
-	double *psir_tem = new double[mesh_r];
-	double *r_tem = new double[mesh_r];
-	double *dr_tem = new double[mesh_r];
-	double *psik_tem = new double[GlobalV::NQX];    //�������ڹ̶�k�ռ��ͶӰ����ʱʹ�õ����飩
-	ModuleBase::GlobalFunc::ZEROS(psir_tem,mesh_r);
-	ModuleBase::GlobalFunc::ZEROS(r_tem,mesh_r);
-	ModuleBase::GlobalFunc::ZEROS(dr_tem,mesh_r);
-	
-	for(int ir = 0; ir < mesh_r; ir++)
-	{
-		psir_tem[ir] = psir(wannier_index,ir);
-		r_tem[ir] = r(wannier_index,ir);
-		dr_tem[ir] = dr(wannier_index,ir);
-	}
-	
-	toWannier90::integral(mesh_r,psir_tem,r_tem,dr_tem,orbital_L,psik_tem);
-	
-	// ��GlobalV::NQX��G���в�ֵ�����npw��G���ֵ
-	for(int ig = 0; ig < npw; ig++)
-	{
-		psik[ig] = ModuleBase::PolyInt::Polynomial_Interpolation(psik_tem, GlobalV::NQX, GlobalV::DQ, gk[ig].norm() * GlobalC::ucell.tpiba);
-	}
-	
-	
-	// 2.������ԭ��ѡ�񣨼�������ģ�����������λ��ƽ�沨������	
-	std::complex<double> *sk = new std::complex<double>[npw];
-	for(int ig = 0; ig < npw; ig++)
-	{
-		const double arg = ( gk[ig] * R_centre[wannier_index] ) * ModuleBase::TWO_PI;
-		sk[ig] = std::complex <double> ( cos(arg),  -sin(arg) );
-	}
-	
-	// 3.���� wannier_ylm
-	double *wannier_ylm = new double[npw];
-	for(int ig = 0; ig < npw; ig++)
-	{
-		int index = orbital_L * orbital_L + orbital_m;
-		if(index == 2 || index == 3 || index == 5 || index == 6 || index == 14 || index == 15)
-		{
-			wannier_ylm[ig] = -1 * ylm(index,ig);
-		}
-		else
-		{
-			wannier_ylm[ig] = ylm(index,ig);
-		}
-	}
-	
-	// 4.����������̽�����ĳ��k����ƽ�沨�����ͶӰ
-	std::complex<double> lphase = pow(ModuleBase::NEG_IMAG_UNIT, orbital_L);
-	for(int ig = 0; ig < GlobalC::wf.npwx; ig++)
-	{
-		if(ig < npw)
-		{
-			trial_orbitals_k(wannier_index,ig) = lphase * sk[ig] * wannier_ylm[ig] * psik[ig];
-		}
-		else trial_orbitals_k(wannier_index,ig) = std::complex<double>(0.0,0.0);
-	}
-	
-	
-	// 5.��һ��
-	std::complex<double> anorm(0.0,0.0);
-	for(int ig = 0; ig < GlobalC::wf.npwx; ig++)
-	{
-		anorm = anorm + conj(trial_orbitals_k(wannier_index,ig)) * trial_orbitals_k(wannier_index,ig);
-	}
-	
-	std::complex<double> anorm_tem(0.0,0.0);
+
+    double *psik = new double[npw];
+    double *psir_tem = new double[mesh_r];
+    double *r_tem = new double[mesh_r];
+    double *dr_tem = new double[mesh_r];
+    double *psik_tem = new double[GlobalV::NQX];
+    ModuleBase::GlobalFunc::ZEROS(psir_tem, mesh_r);
+    ModuleBase::GlobalFunc::ZEROS(r_tem, mesh_r);
+    ModuleBase::GlobalFunc::ZEROS(dr_tem, mesh_r);
+
+    for (int ir = 0; ir < mesh_r; ir++)
+    {
+        psir_tem[ir] = psir(wannier_index, ir);
+        r_tem[ir] = r(wannier_index, ir);
+        dr_tem[ir] = dr(wannier_index, ir);
+    }
+
+    toWannier90::integral(mesh_r, psir_tem, r_tem, dr_tem, orbital_L, psik_tem);
+
+    for (int ig = 0; ig < npw; ig++)
+    {
+        psik[ig] = ModuleBase::PolyInt::Polynomial_Interpolation(psik_tem,
+                                                                 GlobalV::NQX,
+                                                                 GlobalV::DQ,
+                                                                 gk[ig].norm() * GlobalC::ucell.tpiba);
+    }
+
+    std::complex<double> *sk = new std::complex<double>[npw];
+    for (int ig = 0; ig < npw; ig++)
+    {
+        const double arg = (gk[ig] * R_centre[wannier_index]) * ModuleBase::TWO_PI;
+        sk[ig] = std::complex<double>(cos(arg), -sin(arg));
+    }
+
+    double *wannier_ylm = new double[npw];
+    for (int ig = 0; ig < npw; ig++)
+    {
+        int index = orbital_L * orbital_L + orbital_m;
+        if (index == 2 || index == 3 || index == 5 || index == 6 || index == 14 || index == 15)
+        {
+            wannier_ylm[ig] = -1 * ylm(index, ig);
+        }
+        else
+        {
+            wannier_ylm[ig] = ylm(index, ig);
+        }
+    }
+
+    std::complex<double> lphase = pow(ModuleBase::NEG_IMAG_UNIT, orbital_L);
+    for (int ig = 0; ig < GlobalC::wf.npwx; ig++)
+    {
+        if (ig < npw)
+        {
+            trial_orbitals_k(wannier_index, ig) = lphase * sk[ig] * wannier_ylm[ig] * psik[ig];
+        }
+        else
+            trial_orbitals_k(wannier_index, ig) = std::complex<double>(0.0, 0.0);
+    }
+
+    std::complex<double> anorm(0.0, 0.0);
+    for (int ig = 0; ig < GlobalC::wf.npwx; ig++)
+    {
+        anorm = anorm + conj(trial_orbitals_k(wannier_index, ig)) * trial_orbitals_k(wannier_index, ig);
+    }
+
+    std::complex<double> anorm_tem(0.0, 0.0);
 #ifdef __MPI
-	MPI_Allreduce(&anorm , &anorm_tem , 1, MPI_DOUBLE_COMPLEX , MPI_SUM , POOL_WORLD);
+    MPI_Allreduce(&anorm, &anorm_tem, 1, MPI_DOUBLE_COMPLEX, MPI_SUM, POOL_WORLD);
 #else
-	anorm_tem=anorm;
+    anorm_tem = anorm;
 #endif
-	
-	for(int ig = 0; ig < GlobalC::wf.npwx; ig++)
-	{
-		trial_orbitals_k(wannier_index,ig) = trial_orbitals_k(wannier_index,ig) / sqrt(anorm_tem);
-	}
-	
-	delete[] psik;
-	delete[] psir_tem;
-	delete[] r_tem;
-	delete[] dr_tem;
-	delete[] psik_tem;
-	delete[] sk;
-	delete[] wannier_ylm;
-	
-	return;
-	
-}
 
+    for (int ig = 0; ig < GlobalC::wf.npwx; ig++)
+    {
+        trial_orbitals_k(wannier_index, ig) = trial_orbitals_k(wannier_index, ig) / sqrt(anorm_tem);
+    }
+
+    delete[] psik;
+    delete[] psir_tem;
+    delete[] r_tem;
+    delete[] dr_tem;
+    delete[] psik_tem;
+    delete[] sk;
+    delete[] wannier_ylm;
 
-void toWannier90::integral(const int meshr, const double *psir, const double *r, const double *rab, const int &l, double* table)
+    return;
+}
+
+void toWannier90::integral(const int meshr,
+                           const double *psir,
+                           const double *r,
+                           const double *rab,
+                           const int &l,
+                           double *table)
 {
-	const double pref = ModuleBase::FOUR_PI / sqrt(GlobalC::ucell.omega);
-	
-	double *inner_part = new double[meshr];
-	for(int ir=0; ir<meshr; ir++)
-	{
-		inner_part[ir] = psir[ir] * psir[ir];
-	}
-	
-	double unit = 0.0;
-	ModuleBase::Integral::Simpson_Integral(meshr, inner_part, rab, unit);
-	delete[] inner_part;
-
-	double *aux = new double[meshr];
-	double *vchi = new double[meshr];
-	for (int iq=0; iq<GlobalV::NQX; iq++)
-	{
-		const double q = GlobalV::DQ * iq;
-		ModuleBase::Sphbes::Spherical_Bessel(meshr, r, q, l, aux);
-		for (int ir = 0;ir < meshr;ir++)
-		{
-			vchi[ir] = psir[ir] * aux[ir] * r[ir];
-		}
-		
-		double vqint = 0.0;
-		ModuleBase::Integral::Simpson_Integral(meshr, vchi, rab, vqint);
-
-		table[iq] =  vqint * pref;
-	}
-	delete[] aux;
-	delete[] vchi;
-	return;
+    const double pref = ModuleBase::FOUR_PI / sqrt(GlobalC::ucell.omega);
+
+    double *inner_part = new double[meshr];
+    for (int ir = 0; ir < meshr; ir++)
+    {
+        inner_part[ir] = psir[ir] * psir[ir];
+    }
+
+    double unit = 0.0;
+    ModuleBase::Integral::Simpson_Integral(meshr, inner_part, rab, unit);
+    delete[] inner_part;
+
+    double *aux = new double[meshr];
+    double *vchi = new double[meshr];
+    for (int iq = 0; iq < GlobalV::NQX; iq++)
+    {
+        const double q = GlobalV::DQ * iq;
+        ModuleBase::Sphbes::Spherical_Bessel(meshr, r, q, l, aux);
+        for (int ir = 0; ir < meshr; ir++)
+        {
+            vchi[ir] = psir[ir] * aux[ir] * r[ir];
+        }
+
+        double vqint = 0.0;
+        ModuleBase::Integral::Simpson_Integral(meshr, vchi, rab, vqint);
+
+        table[iq] = vqint * pref;
+    }
+    delete[] aux;
+    delete[] vchi;
+    return;
 }
 
+/*
+void toWannier90::ToRealSpace(const int &ik,
+                              const int &ib,
+                              const ModuleBase::ComplexMatrix *evc,
+                              std::complex<double> *psir,
+                              const ModuleBase::Vector3<double> G)
+{
+    // (1) set value
+    std::complex<double> *phase = GlobalC::UFFT.porter;
+    ModuleBase::GlobalFunc::ZEROS(psir, GlobalC::wfcpw->nrxx);
+    ModuleBase::GlobalFunc::ZEROS(phase, GlobalC::wfcpw->nrxx);
+
+    for (int ig = 0; ig < GlobalC::kv.ngk[ik]; ig++)
+    {
+        psir[GlobalC::wfcpw->ng2fftw[GlobalC::wf.igk(ik, ig)]] = evc[ik](ib, ig);
+    }
+
+    // get the phase value in realspace
+    for (int ig = 0; ig < GlobalC::wfcpw->ngmw; ig++)
+    {
+        if (GlobalC::wfcpw->ndirect[ig] == G)
+        {
+            phase[GlobalC::wfcpw->ng2fftw[ig]] = std::complex<double>(1.0, 0.0);
+            break;
+        }
+    }
+    // (2) fft and get value
+    GlobalC::wfcpw->nFT_wfc.FFT3D(psir, 1);
+    GlobalC::wfcpw->nFT_wfc.FFT3D(phase, 1);
+
+    for (int ir = 0; ir < GlobalC::wfcpw->nrxx; ir++)
+    {
+        psir[ir] = psir[ir] * phase[ir];
+    }
+    return;
+}
 
-// void toWannier90::ToRealSpace(const int &ik, const int &ib, const ModuleBase::ComplexMatrix *evc, std::complex<double> *psir, const ModuleBase::Vector3<double> G)
-// {
-// 	// (1) set value
-// 	std::complex<double> *phase = GlobalC::UFFT.porter;
-//     ModuleBase::GlobalFunc::ZEROS( psir, GlobalC::wfcpw->nrxx );
-// 	ModuleBase::GlobalFunc::ZEROS( phase, GlobalC::wfcpw->nrxx);
-
-
-//     for (int ig = 0; ig < GlobalC::kv.ngk[ik]; ig++)
-//     {
-//         psir[ GlobalC::wfcpw->ng2fftw[ GlobalC::wf.igk(ik,ig) ] ] = evc[ik](ib, ig);
-//     }
-	
-// 	// get the phase value in realspace
-// 	for (int ig = 0; ig < GlobalC::wfcpw->ngmw; ig++)
-// 	{
-// 		if (GlobalC::wfcpw->ndirect[ig] == G)
-// 		{
-// 			phase[ GlobalC::wfcpw->ng2fftw[ig] ] = std::complex<double>(1.0,0.0);
-// 			break;
-// 		}
-// 	}
-// 	// (2) fft and get value
-//     GlobalC::wfcpw->nFT_wfc.FFT3D(psir, 1);
-// 	GlobalC::wfcpw->nFT_wfc.FFT3D(phase, 1);
-	
-
-	
-// 	for (int ir = 0; ir < GlobalC::wfcpw->nrxx; ir++)
-// 	{
-// 		psir[ir] = psir[ir] * phase[ir];
-// 	}
-//     return;
-// }
-
-// std::complex<double> toWannier90::unkdotb(const std::complex<double> *psir, const int ikb, const int bandindex, const ModuleBase::ComplexMatrix *wfc_pw)
-// {
-// 	std::complex<double> result(0.0,0.0);
-// 	int knumber = GlobalC::kv.ngk[ikb];
-// 	std::complex<double> *porter = GlobalC::UFFT.porter;
-// 	ModuleBase::GlobalFunc::ZEROS( porter, GlobalC::wfcpw->nrxx);
-// 	for (int ir = 0; ir < GlobalC::wfcpw->nrxx; ir++)
-// 	{
-// 		porter[ir] = psir[ir];
-// 	}
-// 	GlobalC::wfcpw->nFT_wfc.FFT3D( porter, -1);
-	
-	
-// 	for (int ig = 0; ig < knumber; ig++)
-// 	{
-// 		result = result + conj( porter[ GlobalC::wfcpw->ng2fftw[GlobalC::wf.igk(ikb, ig)] ] ) * wfc_pw[ikb](bandindex,ig);	
-		
-// 	}
-// 	return result;
-// }
-
-std::complex<double> toWannier90::unkdotkb(const int &ik, const int &ikb, const int &iband_L, const int &iband_R, const ModuleBase::Vector3<double> G, const psi::Psi<std::complex<double>>& wfc_pw)
+std::complex<double> toWannier90::unkdotb(const std::complex<double> *psir,
+                                          const int ikb,
+                                          const int bandindex,
+                                          const ModuleBase::ComplexMatrix *wfc_pw)
 {
-	// (1) set value
-	std::complex<double> result(0.0,0.0);
-	std::complex<double> *psir = new std::complex<double>[GlobalC::wfcpw->nmaxgr];
-	std::complex<double> *phase = new std::complex<double>[GlobalC::rhopw->nmaxgr];
-	
-	// get the phase value in realspace
-	for (int ig = 0; ig < GlobalC::rhopw->npw; ig++)
-	{
-		if (GlobalC::rhopw->gdirect[ig] == G) //It should be used carefully. We cannot judge if two double are equal.
-		{
-			phase[ig] = std::complex<double>(1.0,0.0);
-			break;
-		}
-	}
-	
-	// (2) fft and get value
-	GlobalC::rhopw->recip2real(phase, phase);
-	GlobalC::wfcpw->recip2real(&wfc_pw(ik,iband_L,0), psir, ik);
-		
-	for (int ir = 0; ir < GlobalC::wfcpw->nrxx; ir++)
-	{
-		psir[ir] *= phase[ir];
-	}
-
-	GlobalC::wfcpw->real2recip(psir, psir, ik);
-	
-	std::complex<double> result_tem(0.0,0.0);
-	
-	for (int ig = 0; ig < GlobalC::kv.ngk[ikb]; ig++)
-	{
-		result_tem = result_tem + conj( psir[ig]) * wfc_pw(ikb, iband_R,ig);	
-		
-	}
+    std::complex<double> result(0.0, 0.0);
+    int knumber = GlobalC::kv.ngk[ikb];
+    std::complex<double> *porter = GlobalC::UFFT.porter;
+    ModuleBase::GlobalFunc::ZEROS(porter, GlobalC::wfcpw->nrxx);
+    for (int ir = 0; ir < GlobalC::wfcpw->nrxx; ir++)
+    {
+        porter[ir] = psir[ir];
+    }
+    GlobalC::wfcpw->nFT_wfc.FFT3D(porter, -1);
+
+    for (int ig = 0; ig < knumber; ig++)
+    {
+        result = result + conj(porter[GlobalC::wfcpw->ng2fftw[GlobalC::wf.igk(ikb, ig)]]) * wfc_pw[ikb](bandindex, ig);
+    }
+    return result;
+}
+*/
+std::complex<double> toWannier90::unkdotkb(const int &ik,
+                                           const int &ikb,
+                                           const int &iband_L,
+                                           const int &iband_R,
+                                           const ModuleBase::Vector3<double> G,
+                                           const psi::Psi<std::complex<double>> &wfc_pw)
+{
+    // (1) set value
+    std::complex<double> result(0.0, 0.0);
+    std::complex<double> *psir = new std::complex<double>[GlobalC::wfcpw->nmaxgr];
+    std::complex<double> *phase = new std::complex<double>[GlobalC::rhopw->nmaxgr];
+
+    // get the phase value in realspace
+    for (int ig = 0; ig < GlobalC::rhopw->npw; ig++)
+    {
+        if (GlobalC::rhopw->gdirect[ig] == G) // It should be used carefully. We cannot judge if two double are equal.
+        {
+            phase[ig] = std::complex<double>(1.0, 0.0);
+            break;
+        }
+    }
+
+    // (2) fft and get value
+    GlobalC::rhopw->recip2real(phase, phase);
+    GlobalC::wfcpw->recip2real(&wfc_pw(ik, iband_L, 0), psir, ik);
+
+    for (int ir = 0; ir < GlobalC::wfcpw->nrxx; ir++)
+    {
+        psir[ir] *= phase[ir];
+    }
+
+    GlobalC::wfcpw->real2recip(psir, psir, ik);
+
+    std::complex<double> result_tem(0.0, 0.0);
+
+    for (int ig = 0; ig < GlobalC::kv.ngk[ikb]; ig++)
+    {
+        result_tem = result_tem + conj(psir[ig]) * wfc_pw(ikb, iband_R, ig);
+    }
 #ifdef __MPI
-	MPI_Allreduce(&result_tem , &result , 1, MPI_DOUBLE_COMPLEX , MPI_SUM , POOL_WORLD);
+    MPI_Allreduce(&result_tem, &result, 1, MPI_DOUBLE_COMPLEX, MPI_SUM, POOL_WORLD);
 #else
-	result=result_tem;
+    result = result_tem;
 #endif
-	delete[] psir;	
-	delete[] phase;
-	return result;	
-	
+    delete[] psir;
+    delete[] phase;
+    return result;
+}
+
+/*
+std::complex<double> toWannier90::gamma_only_cal(const int &ib_L,
+                                                   const int &ib_R,
+                                                   const ModuleBase::ComplexMatrix *wfc_pw,
+                                                   const ModuleBase::Vector3<double> G)
+{
+    std::complex<double> *phase = new std::complex<double>[GlobalC::wfcpw->nrxx];
+    std::complex<double> *psir = new std::complex<double>[GlobalC::wfcpw->nrxx];
+    std::complex<double> *psir_2 = new std::complex<double>[GlobalC::wfcpw->nrxx];
+    ModuleBase::GlobalFunc::ZEROS(phase, GlobalC::wfcpw->nrxx);
+    ModuleBase::GlobalFunc::ZEROS(psir, GlobalC::wfcpw->nrxx);
+    ModuleBase::GlobalFunc::ZEROS(psir_2, GlobalC::wfcpw->nrxx);
+
+    for (int ig = 0; ig < GlobalC::kv.ngk[0]; ig++)
+    {
+        // psir[ GlobalC::wfcpw->ng2fftw[ GlobalC::wf.igk(0,ig) ] ] = wfc_pw[0](ib_L, ig);
+        psir[GlobalC::wfcpw->ng2fftw[GlobalC::wf.igk(0, ig)]] = std::complex<double>(abs(wfc_pw[0](ib_L, ig)), 0.0);
+    }
+
+    // get the phase value in realspace
+    for (int ig = 0; ig < GlobalC::wfcpw->ngmw; ig++)
+    {
+        if (GlobalC::wfcpw->ndirect[ig] == G)
+        {
+            phase[GlobalC::wfcpw->ng2fftw[ig]] = std::complex<double>(1.0, 0.0);
+            break;
+        }
+    }
+    // (2) fft and get value
+    GlobalC::wfcpw->nFT_wfc.FFT3D(psir, 1);
+    GlobalC::wfcpw->nFT_wfc.FFT3D(phase, 1);
+
+    for (int ir = 0; ir < GlobalC::wfcpw->nrxx; ir++)
+    {
+        psir_2[ir] = conj(psir[ir]) * phase[ir];
+    }
+
+    for (int ir = 0; ir < GlobalC::wfcpw->nrxx; ir++)
+    {
+        psir[ir] = psir[ir] * phase[ir];
+    }
+
+    GlobalC::wfcpw->nFT_wfc.FFT3D(psir, -1);
+    GlobalC::wfcpw->nFT_wfc.FFT3D(psir_2, -1);
+
+    std::complex<double> result(0.0, 0.0);
+
+    for (int ig = 0; ig < GlobalC::kv.ngk[0]; ig++)
+    {
+        // result = result + conj(psir_2[ GlobalC::wfcpw->ng2fftw[GlobalC::wf.igk(0,ig)] ]) * wfc_pw[0](ib_R,ig) + psir[
+GlobalC::wfcpw->ng2fftw[ GlobalC::wf.igk(0,ig)] ] * conj(wfc_pw[0](ib_R,ig));
+// std::complex<double> tem = std::complex<double>( abs(wfc_pw[0](ib_R,ig)), 0.0 );
+result = result + conj(psir[GlobalC::wfcpw->ng2fftw[GlobalC::wf.igk(0, ig)]]); // * tem;
+    }
+
+    delete[] phase;
+    delete[] psir;
+    delete[] psir_2;
+
+    return result;
 }
+*/
 
-// std::complex<double> toWannier90::gamma_only_cal(const int &ib_L, const int &ib_R, const ModuleBase::ComplexMatrix *wfc_pw, const ModuleBase::Vector3<double> G)
-// {
-// 	std::complex<double> *phase = new std::complex<double>[GlobalC::wfcpw->nrxx];
-// 	std::complex<double> *psir = new std::complex<double>[GlobalC::wfcpw->nrxx];
-// 	std::complex<double> *psir_2 = new std::complex<double>[GlobalC::wfcpw->nrxx];
-// 	ModuleBase::GlobalFunc::ZEROS( phase, GlobalC::wfcpw->nrxx);
-// 	ModuleBase::GlobalFunc::ZEROS( psir, GlobalC::wfcpw->nrxx);
-// 	ModuleBase::GlobalFunc::ZEROS( psir_2, GlobalC::wfcpw->nrxx);
-
-//     for (int ig = 0; ig < GlobalC::kv.ngk[0]; ig++)
-//     {
-//         //psir[ GlobalC::wfcpw->ng2fftw[ GlobalC::wf.igk(0,ig) ] ] = wfc_pw[0](ib_L, ig);
-// 		psir[ GlobalC::wfcpw->ng2fftw[ GlobalC::wf.igk(0,ig) ] ] = std::complex<double> ( abs(wfc_pw[0](ib_L, ig)), 0.0 );
-//     }
-	
-// 	// get the phase value in realspace
-// 	for (int ig = 0; ig < GlobalC::wfcpw->ngmw; ig++)
-// 	{
-// 		if (GlobalC::wfcpw->ndirect[ig] == G)
-// 		{
-// 			phase[ GlobalC::wfcpw->ng2fftw[ig] ] = std::complex<double>(1.0,0.0);
-// 			break;
-// 		}
-// 	}
-// 	// (2) fft and get value
-//     GlobalC::wfcpw->nFT_wfc.FFT3D(psir, 1);
-// 	GlobalC::wfcpw->nFT_wfc.FFT3D(phase, 1);
-	
-// 	for (int ir = 0; ir < GlobalC::wfcpw->nrxx; ir++)
-// 	{
-// 		psir_2[ir] = conj(psir[ir]) * phase[ir];
-// 	}
-	
-// 		for (int ir = 0; ir < GlobalC::wfcpw->nrxx; ir++)
-// 	{
-// 		psir[ir] = psir[ir] * phase[ir];
-// 	}
-	
-// 	GlobalC::wfcpw->nFT_wfc.FFT3D( psir, -1);
-// 	GlobalC::wfcpw->nFT_wfc.FFT3D( psir_2, -1);
-	
-// 	std::complex<double> result(0.0,0.0);
-	
-// 	for (int ig = 0; ig < GlobalC::kv.ngk[0]; ig++)
-// 	{
-// 		//result = result + conj(psir_2[ GlobalC::wfcpw->ng2fftw[GlobalC::wf.igk(0,ig)] ]) * wfc_pw[0](ib_R,ig) + psir[ GlobalC::wfcpw->ng2fftw[ GlobalC::wf.igk(0,ig)] ] * conj(wfc_pw[0](ib_R,ig));
-// 		//std::complex<double> tem = std::complex<double>( abs(wfc_pw[0](ib_R,ig)), 0.0 );
-// 		result = result +  conj(psir[ GlobalC::wfcpw->ng2fftw[ GlobalC::wf.igk(0,ig)] ]);// * tem;
-// 	}
-	
-// 	delete[] phase;
-// 	delete[] psir;
-// 	delete[] psir_2;
-	
-// 	return result;
-	
-// }
-
-//ʹ��lcao_in_pw������lcao����ת��pw����
 #ifdef __LCAO
 void toWannier90::lcao2pw_basis(const int ik, ModuleBase::ComplexMatrix &orbital_in_G)
 {
-	this->table_local.create(GlobalC::ucell.ntype, GlobalC::ucell.nmax_total, GlobalV::NQX);
-	Wavefunc_in_pw::make_table_q(GlobalC::ORB.orbital_file, this->table_local);
-	Wavefunc_in_pw::produce_local_basis_in_pw(ik, orbital_in_G, this->table_local);
+    this->table_local.create(GlobalC::ucell.ntype, GlobalC::ucell.nmax_total, GlobalV::NQX);
+    Wavefunc_in_pw::make_table_q(GlobalC::ORB.orbital_file, this->table_local);
+    Wavefunc_in_pw::produce_local_basis_in_pw(ik, orbital_in_G, this->table_local);
 }
 
-
-// ��lcao�����²���pw����Ĳ��������ڲ���unk��ֵ��unk_inLcao[ik](ib,ig),ig�ķ�Χ��GlobalC::kv.ngk[ik]
 void toWannier90::getUnkFromLcao()
 {
-	std::complex<double>*** lcao_wfc_global = new std::complex<double>**[num_kpts];
-	for(int ik = 0; ik < num_kpts; ik++)
-	{
-		lcao_wfc_global[ik] = new std::complex<double>*[GlobalV::NBANDS];
-		for(int ib = 0; ib < GlobalV::NBANDS; ib++)
-		{
-			lcao_wfc_global[ik][ib] = new std::complex<double>[GlobalV::NLOCAL];
-			ModuleBase::GlobalFunc::ZEROS(lcao_wfc_global[ik][ib], GlobalV::NLOCAL);
-		}
-	}
-	
-	
-	if(this->unk_inLcao != nullptr)
-	{
-		delete this->unk_inLcao;
-	}
-	this->unk_inLcao = new psi::Psi<std::complex<double>>(num_kpts, GlobalV::NBANDS, GlobalC::wf.npwx, nullptr);
-	ModuleBase::ComplexMatrix *orbital_in_G = new ModuleBase::ComplexMatrix[num_kpts];
-
-	for(int ik = 0; ik < num_kpts; ik++)
-	{
-		// ��ȡȫ�ֵ�lcao�Ĳ�����ϵ��
-		get_lcao_wfc_global_ik(lcao_wfc_global[ik], this->wfc_k_grid[ik]);
-
-		int npw = GlobalC::kv.ngk[ik];
-		orbital_in_G[ik].create(GlobalV::NLOCAL,npw);
-		this->lcao2pw_basis(ik,orbital_in_G[ik]);
-	
-	}
-
-	// ��lcao�����unkת��pw�����µ�unk
-	for(int ik = 0; ik < num_kpts; ik++)
-	{
-		for(int ib = 0; ib < GlobalV::NBANDS; ib++)
-		{
-			for(int ig = 0; ig < GlobalC::kv.ngk[ik]; ig++)
-			{
-				for(int iw = 0; iw < GlobalV::NLOCAL; iw++)
-				{
-					unk_inLcao[0](ik,ib,ig) += orbital_in_G[ik](iw,ig)*lcao_wfc_global[ik][ib][iw];
-				}
-			}
-		}
-	}
-
-	// ��һ��
-	for(int ik = 0; ik < num_kpts; ik++)
-	{
-		for(int ib = 0; ib < GlobalV::NBANDS; ib++)
-		{
-			std::complex<double> anorm(0.0,0.0);
-			for(int ig = 0; ig < GlobalC::kv.ngk[ik]; ig++)
-			{
-				anorm = anorm + conj( unk_inLcao[0](ik,ib,ig) ) * unk_inLcao[0](ik,ib,ig);
-			}
-			
-			std::complex<double> anorm_tem(0.0,0.0);
-			#ifdef __MPI
-			MPI_Allreduce(&anorm , &anorm_tem , 1, MPI_DOUBLE_COMPLEX , MPI_SUM , POOL_WORLD);
-			#endif
-
-			for(int ig = 0; ig < GlobalC::kv.ngk[ik]; ig++)
-			{
-				unk_inLcao[0](ik,ib,ig) = unk_inLcao[0](ik,ib,ig) / sqrt(anorm_tem);
-			}
-			
-		}
-	}
-	
-
-	for(int ik = 0; ik < GlobalC::kv.nkstot; ik++)
-	{
-		for(int ib = 0; ib < GlobalV::NBANDS; ib++)
-		{
-			delete[] lcao_wfc_global[ik][ib];
-		}
-		delete[] lcao_wfc_global[ik];
-	}
-	delete[] lcao_wfc_global;
-	
-	delete[] orbital_in_G;
+    std::complex<double> ***lcao_wfc_global = new std::complex<double> **[num_kpts];
+    for (int ik = 0; ik < num_kpts; ik++)
+    {
+        lcao_wfc_global[ik] = new std::complex<double> *[GlobalV::NBANDS];
+        for (int ib = 0; ib < GlobalV::NBANDS; ib++)
+        {
+            lcao_wfc_global[ik][ib] = new std::complex<double>[GlobalV::NLOCAL];
+            ModuleBase::GlobalFunc::ZEROS(lcao_wfc_global[ik][ib], GlobalV::NLOCAL);
+        }
+    }
+
+    if (this->unk_inLcao != nullptr)
+    {
+        delete this->unk_inLcao;
+    }
+    this->unk_inLcao = new psi::Psi<std::complex<double>>(num_kpts, GlobalV::NBANDS, GlobalC::wf.npwx, nullptr);
+    ModuleBase::ComplexMatrix *orbital_in_G = new ModuleBase::ComplexMatrix[num_kpts];
+
+    for (int ik = 0; ik < num_kpts; ik++)
+    {
+
+        get_lcao_wfc_global_ik(lcao_wfc_global[ik], this->wfc_k_grid[ik]);
+
+        int npw = GlobalC::kv.ngk[ik];
+        orbital_in_G[ik].create(GlobalV::NLOCAL, npw);
+        this->lcao2pw_basis(ik, orbital_in_G[ik]);
+    }
+
+    for (int ik = 0; ik < num_kpts; ik++)
+    {
+        for (int ib = 0; ib < GlobalV::NBANDS; ib++)
+        {
+            for (int ig = 0; ig < GlobalC::kv.ngk[ik]; ig++)
+            {
+                for (int iw = 0; iw < GlobalV::NLOCAL; iw++)
+                {
+                    unk_inLcao[0](ik, ib, ig) += orbital_in_G[ik](iw, ig) * lcao_wfc_global[ik][ib][iw];
+                }
+            }
+        }
+    }
+
+    for (int ik = 0; ik < num_kpts; ik++)
+    {
+        for (int ib = 0; ib < GlobalV::NBANDS; ib++)
+        {
+            std::complex<double> anorm(0.0, 0.0);
+            for (int ig = 0; ig < GlobalC::kv.ngk[ik]; ig++)
+            {
+                anorm = anorm + conj(unk_inLcao[0](ik, ib, ig)) * unk_inLcao[0](ik, ib, ig);
+            }
+
+            std::complex<double> anorm_tem(0.0, 0.0);
+#ifdef __MPI
+            MPI_Allreduce(&anorm, &anorm_tem, 1, MPI_DOUBLE_COMPLEX, MPI_SUM, POOL_WORLD);
+#endif
+
+            for (int ig = 0; ig < GlobalC::kv.ngk[ik]; ig++)
+            {
+                unk_inLcao[0](ik, ib, ig) = unk_inLcao[0](ik, ib, ig) / sqrt(anorm_tem);
+            }
+        }
+    }
+
+    for (int ik = 0; ik < GlobalC::kv.nkstot; ik++)
+    {
+        for (int ib = 0; ib < GlobalV::NBANDS; ib++)
+        {
+            delete[] lcao_wfc_global[ik][ib];
+        }
+        delete[] lcao_wfc_global[ik];
+    }
+    delete[] lcao_wfc_global;
+
+    delete[] orbital_in_G;
 
 #ifdef __MPI
-	MPI_Barrier(MPI_COMM_WORLD);
+    MPI_Barrier(MPI_COMM_WORLD);
 #endif
 
-	return;
+    return;
 }
 
-// ��ȡȫ�ֵ�lcao�Ĳ�����ϵ��
 void toWannier90::get_lcao_wfc_global_ik(std::complex<double> **ctot, std::complex<double> **cc)
 {
-	std::complex<double>* ctot_send = new std::complex<double>[GlobalV::NBANDS*GlobalV::NLOCAL];
+    std::complex<double> *ctot_send = new std::complex<double>[GlobalV::NBANDS * GlobalV::NLOCAL];
 
 #ifdef __MPI
-	MPI_Status status;
+    MPI_Status status;
 #endif
 
-	for (int i=0; i<GlobalV::DSIZE; i++)
-	{
-		if (GlobalV::DRANK==0)
-		{
-			if (i==0)
-			{
-				// get the wave functions from 'ctot',
-				// save them in the matrix 'c'.
-				for (int iw=0; iw<GlobalV::NLOCAL; iw++)
-				{
-					const int mu_local = GlobalC::GridT.trace_lo[iw];
-					if (mu_local >= 0)
-					{
-						for (int ib=0; ib<GlobalV::NBANDS; ib++)
-						{
-							//ctot[ib][iw] = cc[ib][mu_local];
-							ctot_send[ib*GlobalV::NLOCAL+iw] = cc[ib][mu_local];
-						}
-					}
-				}
-			}
-			else
-			{
-				int tag;
-				// receive lgd2
-				int lgd2 = 0;
-				tag = i * 3;
-				#ifdef __MPI
-				MPI_Recv(&lgd2, 1, MPI_INT, i, tag, DIAG_WORLD, &status);
-				#endif
-				if(lgd2==0)
-				{
-
-				}
-				else
-				{
-					// receive trace_lo2
-					tag = i * 3 + 1;
-					int* trace_lo2 = new int[GlobalV::NLOCAL];
-					#ifdef __MPI
-					MPI_Recv(trace_lo2, GlobalV::NLOCAL, MPI_INT, i, tag, DIAG_WORLD, &status);
-					#endif
-
-					// receive crecv
-					std::complex<double>* crecv = new std::complex<double>[GlobalV::NBANDS*lgd2];
-					ModuleBase::GlobalFunc::ZEROS(crecv, GlobalV::NBANDS*lgd2);
-					tag = i * 3 + 2;
-					#ifdef __MPI
-					MPI_Recv(crecv,GlobalV::NBANDS*lgd2,mpicomplex,i,tag,DIAG_WORLD, &status);
-					#endif
-					for (int ib=0; ib<GlobalV::NBANDS; ib++)
-					{
-						for (int iw=0; iw<GlobalV::NLOCAL; iw++)
-						{
-							const int mu_local = trace_lo2[iw];
-							if (mu_local>=0)
-							{
-								//ctot[ib][iw] = crecv[mu_local*GlobalV::NBANDS+ib];
-								ctot_send[ib*GlobalV::NLOCAL+iw] = crecv[mu_local*GlobalV::NBANDS+ib];
-							}
-						}
-					}
-
-					delete[] crecv;
-					delete[] trace_lo2;
-				}
-			}
-		}// end GlobalV::DRANK=0
-		else if ( i == GlobalV::DRANK)
-		{
-			int tag;
-
-			// send GlobalC::GridT.lgd
-			tag = GlobalV::DRANK * 3;
-			#ifdef __MPI
-			MPI_Send(&GlobalC::GridT.lgd, 1, MPI_INT, 0, tag, DIAG_WORLD);
-			#endif
-
-			if(GlobalC::GridT.lgd != 0)
-			{
-				// send trace_lo
-				tag = GlobalV::DRANK * 3 + 1;
-				#ifdef __MPI
-				MPI_Send(GlobalC::GridT.trace_lo, GlobalV::NLOCAL, MPI_INT, 0, tag, DIAG_WORLD);
-				#endif
-
-				// send cc
-				std::complex<double>* csend = new std::complex<double>[GlobalV::NBANDS*GlobalC::GridT.lgd];
-				ModuleBase::GlobalFunc::ZEROS(csend, GlobalV::NBANDS*GlobalC::GridT.lgd);
-
-				for (int ib=0; ib<GlobalV::NBANDS; ib++)
-				{
-					for (int mu=0; mu<GlobalC::GridT.lgd; mu++)
-					{
-						csend[mu*GlobalV::NBANDS+ib] = cc[ib][mu];
-					}
-				}
-	
-				tag = GlobalV::DRANK * 3 + 2;
-				#ifdef __MPI
-				MPI_Send(csend, GlobalV::NBANDS*GlobalC::GridT.lgd, mpicomplex, 0, tag, DIAG_WORLD);
-				#endif
-			
-
-				delete[] csend;
-
-			}
-		}// end i==GlobalV::DRANK
-		#ifdef __MPI
-		MPI_Barrier(DIAG_WORLD);
-		#endif
-	}
-	#ifdef __MPI
-	MPI_Bcast(ctot_send,GlobalV::NBANDS*GlobalV::NLOCAL,mpicomplex,0,DIAG_WORLD);
-	#endif
-
-	for(int ib = 0; ib < GlobalV::NBANDS; ib++)
-	{
-		for(int iw = 0; iw < GlobalV::NLOCAL; iw++)
-		{
-			ctot[ib][iw] = ctot_send[ib*GlobalV::NLOCAL+iw];
-		}
-	}
-
-	delete[] ctot_send;
-
-	return;
-}
+    for (int i = 0; i < GlobalV::DSIZE; i++)
+    {
+        if (GlobalV::DRANK == 0)
+        {
+            if (i == 0)
+            {
+                // get the wave functions from 'ctot',
+                // save them in the matrix 'c'.
+                for (int iw = 0; iw < GlobalV::NLOCAL; iw++)
+                {
+                    const int mu_local = GlobalC::GridT.trace_lo[iw];
+                    if (mu_local >= 0)
+                    {
+                        for (int ib = 0; ib < GlobalV::NBANDS; ib++)
+                        {
+                            // ctot[ib][iw] = cc[ib][mu_local];
+                            ctot_send[ib * GlobalV::NLOCAL + iw] = cc[ib][mu_local];
+                        }
+                    }
+                }
+            }
+            else
+            {
+                int tag;
+                // receive lgd2
+                int lgd2 = 0;
+                tag = i * 3;
+#ifdef __MPI
+                MPI_Recv(&lgd2, 1, MPI_INT, i, tag, DIAG_WORLD, &status);
+#endif
+                if (lgd2 == 0)
+                {
+                }
+                else
+                {
+                    // receive trace_lo2
+                    tag = i * 3 + 1;
+                    int *trace_lo2 = new int[GlobalV::NLOCAL];
+#ifdef __MPI
+                    MPI_Recv(trace_lo2, GlobalV::NLOCAL, MPI_INT, i, tag, DIAG_WORLD, &status);
+#endif
 
+                    // receive crecv
+                    std::complex<double> *crecv = new std::complex<double>[GlobalV::NBANDS * lgd2];
+                    ModuleBase::GlobalFunc::ZEROS(crecv, GlobalV::NBANDS * lgd2);
+                    tag = i * 3 + 2;
+#ifdef __MPI
+                    MPI_Recv(crecv, GlobalV::NBANDS * lgd2, mpicomplex, i, tag, DIAG_WORLD, &status);
+#endif
+                    for (int ib = 0; ib < GlobalV::NBANDS; ib++)
+                    {
+                        for (int iw = 0; iw < GlobalV::NLOCAL; iw++)
+                        {
+                            const int mu_local = trace_lo2[iw];
+                            if (mu_local >= 0)
+                            {
+                                // ctot[ib][iw] = crecv[mu_local*GlobalV::NBANDS+ib];
+                                ctot_send[ib * GlobalV::NLOCAL + iw] = crecv[mu_local * GlobalV::NBANDS + ib];
+                            }
+                        }
+                    }
+
+                    delete[] crecv;
+                    delete[] trace_lo2;
+                }
+            }
+        } // end GlobalV::DRANK=0
+        else if (i == GlobalV::DRANK)
+        {
+            int tag;
+
+            // send GlobalC::GridT.lgd
+            tag = GlobalV::DRANK * 3;
+#ifdef __MPI
+            MPI_Send(&GlobalC::GridT.lgd, 1, MPI_INT, 0, tag, DIAG_WORLD);
 #endif
 
+            if (GlobalC::GridT.lgd != 0)
+            {
+                // send trace_lo
+                tag = GlobalV::DRANK * 3 + 1;
+#ifdef __MPI
+                MPI_Send(GlobalC::GridT.trace_lo, GlobalV::NLOCAL, MPI_INT, 0, tag, DIAG_WORLD);
+#endif
 
+                // send cc
+                std::complex<double> *csend = new std::complex<double>[GlobalV::NBANDS * GlobalC::GridT.lgd];
+                ModuleBase::GlobalFunc::ZEROS(csend, GlobalV::NBANDS * GlobalC::GridT.lgd);
+
+                for (int ib = 0; ib < GlobalV::NBANDS; ib++)
+                {
+                    for (int mu = 0; mu < GlobalC::GridT.lgd; mu++)
+                    {
+                        csend[mu * GlobalV::NBANDS + ib] = cc[ib][mu];
+                    }
+                }
+
+                tag = GlobalV::DRANK * 3 + 2;
+#ifdef __MPI
+                MPI_Send(csend, GlobalV::NBANDS * GlobalC::GridT.lgd, mpicomplex, 0, tag, DIAG_WORLD);
+#endif
+
+                delete[] csend;
+            }
+        } // end i==GlobalV::DRANK
+#ifdef __MPI
+        MPI_Barrier(DIAG_WORLD);
+#endif
+    }
+#ifdef __MPI
+    MPI_Bcast(ctot_send, GlobalV::NBANDS * GlobalV::NLOCAL, mpicomplex, 0, DIAG_WORLD);
+#endif
+
+    for (int ib = 0; ib < GlobalV::NBANDS; ib++)
+    {
+        for (int iw = 0; iw < GlobalV::NLOCAL; iw++)
+        {
+            ctot[ib][iw] = ctot_send[ib * GlobalV::NLOCAL + iw];
+        }
+    }
+
+    delete[] ctot_send;
+
+    return;
+}
+
+#endif
diff --git a/source/src_io/to_wannier90.h b/source/src_io/to_wannier90.h
index 2377135ad9..2e14cca1a1 100644
--- a/source/src_io/to_wannier90.h
+++ b/source/src_io/to_wannier90.h
@@ -3,101 +3,109 @@
 
 #include <iostream>
 using namespace std;
-#include <vector>
-#include <algorithm>
-#include <cmath>
-#include <cstdlib>
+#include "../module_base/complexmatrix.h"
 #include "../module_base/global_function.h"
 #include "../module_base/global_variable.h"
+#include "../module_base/lapack_connector.h"
 #include "../module_base/matrix.h"
 #include "../module_base/matrix3.h"
-#include "../module_base/complexmatrix.h"
-#include "../module_base/lapack_connector.h"
 #include "../src_lcao/wavefunc_in_pw.h"
 #include "module_psi/psi.h"
 
+#include <algorithm>
+#include <cmath>
+#include <cstdlib>
+#include <vector>
+
 #ifdef __LCAO
 #include "../src_lcao/local_orbital_wfc.h"
 #endif
 
-
 class toWannier90
 {
-public:
-	//const int k_supercell = 5;                                                              // default the k-space supercell
-	//const int k_cells = (2 * k_supercell + 1)*(2 * k_supercell + 1)*(2 * k_supercell + 1);  // the primitive cell number in k-space supercell
-	//const int k_shells = 12;                                                                // default the shell numbers
-	//const double large_number = 99999999.0;
-	//const double small_number = 0.000001;
-	//std::vector<ModuleBase::Vector3<double>> lmn;                                                            //ÿ��k��ԭ�����
-	//std::vector<double> dist_shell;                                                              //ÿһ��shell�Ľ���k�����
-	//std::vector<int> multi;                                                                      //ÿһ��shell�Ľ���k����Ŀ
-	//int num_shell_real;                                                                     //����������B1������shell��Ŀ�����ս����(ע��1��ʼ����)
-	//int *shell_list_real;                                                                   //1��12��shell�в�ƽ�в��ȼ۵�shell��ǩ������Ϊnum_shell_real
-	//double *bweight;                                                                        //ÿ��shell��bweight������Ϊnum_shell_real
-	
-	int num_kpts;                                                                           // k�����Ŀ
-	int cal_num_kpts;                                                                       // ��Ҫ�����k����Ŀ������nspin=2ʱ���ô�
-	ModuleBase::Matrix3 recip_lattice;
-	std::vector<std::vector<int>> nnlist;                                                             //ÿ��k��Ľ���k�����
-	std::vector<std::vector<ModuleBase::Vector3<double>>> nncell;                                                 //ÿ��k��Ľ���k�����ڵ�ԭ�����
-	int nntot = 0;                                                                          //ÿ��k��Ľ���k����Ŀ   
-	int num_wannier;																		//��Ҫ����wannier�����ĸ���
-	int *L;																					//��̽����Ľ�������ָ��,����Ϊnum_wannier
-	int *m;																					//��̽����Ĵ�������ָ��,����Ϊnum_wannier
-	int *rvalue;																			//��̽����ľ��򲿷ֺ�����ʽ,ֻ��������ʽ,����Ϊnum_wannier
-	double *alfa;																			//��̽����ľ��򲿷ֺ����еĵ��ڲ���,����Ϊnum_wannier
-	ModuleBase::Vector3<double> *R_centre;																//��̽�����������,����Ϊnum_wannier,cartesian����
-	std::string wannier_file_name = "seedname";                                                  // .mmn,.amn�ļ���
-	int num_exclude_bands = 0;																// �ų�������ܴ���Ŀ��-1��ʾû����Ҫ�ų����ܴ�
-	int *exclude_bands;                                                                     // �ų��ܴ���index
-	bool *tag_cal_band;																		// �ж�GlobalV::NBANDS�ܴ���һ����Ҫ����
-	int num_bands;																		   	// wannier90 �е�num_bands
-	bool gamma_only_wannier = false;														// ֻ��gamma������wannier����
-	std::string wannier_spin = "up";                                                             // spin��������up,down��������
-	int start_k_index = 0;                                                                  // ����forѭ��Ѱ��k��ָ�꣬spin=2ʱ��ʼ��index�ǲ�һ����
-
-	
-	// ������lcao�����µ�wannier90�������
-	ModuleBase::realArray table_local;
-	psi::Psi<std::complex<double>> *unk_inLcao = nullptr;                                                             // lcao�����²����������ڲ���unk
+  public:
+    // const int k_supercell = 5;
+    // const int k_cells = (2 * k_supercell + 1)*(2 * k_supercell + 1)*(2 * k_supercell + 1);
+    // const int k_shells = 12;
+    // const double large_number = 99999999.0;
+    // const double small_number = 0.000001;
+    // std::vector<ModuleBase::Vector3<double>> lmn;
+    // std::vector<double> dist_shell;
+    // std::vector<int> multi;
+    // int num_shell_real;
+    // int *shell_list_real;
+    // double *bweight;
 
+    int num_kpts;
+    int cal_num_kpts;
+    ModuleBase::Matrix3 recip_lattice;
+    std::vector<std::vector<int>> nnlist;
+    std::vector<std::vector<ModuleBase::Vector3<double>>> nncell;
+    int nntot = 0;
+    int num_wannier;
+    int *L;
+    int *m;
+    int *rvalue;
+    double *alfa;
+    ModuleBase::Vector3<double> *R_centre;
+    std::string wannier_file_name = "seedname";
+    int num_exclude_bands = 0;
+    int *exclude_bands;
+    bool *tag_cal_band;
+    int num_bands;
+    bool gamma_only_wannier = false;
+    std::string wannier_spin = "up";
+    int start_k_index = 0;
 
+    ModuleBase::realArray table_local;
+    psi::Psi<std::complex<double>> *unk_inLcao = nullptr;
 
     toWannier90(int num_kpts, ModuleBase::Matrix3 recip_lattice);
-    toWannier90(int num_kpts,ModuleBase::Matrix3 recip_lattice, std::complex<double>*** wfc_k_grid_in);
+    toWannier90(int num_kpts, ModuleBase::Matrix3 recip_lattice, std::complex<double> ***wfc_k_grid_in);
     ~toWannier90();
 
-	//void kmesh_supercell_sort(); //������ԭ��ľ����С��������lmn
-	//void get_nnkpt_first();      //������12��shell�Ľ���k��ľ���͸���
-	//void kmesh_get_bvectors(int multi, int reference_kpt, double dist_shell, std::vector<ModuleBase::Vector3<double>>& bvector);  //��ȡָ��shell�㣬ָ���ο�k��Ľ���k���bvector
-	//void get_nnkpt_last(); //��ȡ���յ�shell��Ŀ��bweight
-    //void get_nnlistAndnncell();
+    // void kmesh_supercell_sort();
+    // void get_nnkpt_first();
+    // void kmesh_get_bvectors(int multi, int reference_kpt, double dist_shell,
+    // std::vector<ModuleBase::Vector3<double>>& bvector); void get_nnkpt_last();
 
-	void init_wannier(const psi::Psi<std::complex<double>>* psi=nullptr);
-	void read_nnkp();
-	void outEIG();
-	void cal_Amn(const psi::Psi<std::complex<double>>& wfc_pw);
-	void cal_Mmn(const psi::Psi<std::complex<double>>& wfc_pw);
-	void produce_trial_in_pw(const int &ik, ModuleBase::ComplexMatrix &trial_orbitals_k);
-	void get_trial_orbitals_lm_k(const int wannier_index, const int orbital_L, const int orbital_m, ModuleBase::matrix &ylm, 
-										ModuleBase::matrix &dr, ModuleBase::matrix &r, ModuleBase::matrix &psir, const int mesh_r, 
-										ModuleBase::Vector3<double> *gk, const int npw, ModuleBase::ComplexMatrix &trial_orbitals_k);
-	void integral(const int meshr, const double *psir, const double *r, const double *rab, const int &l, double* table);
-	void writeUNK(const psi::Psi<std::complex<double>>& wfc_pw);
-	// void ToRealSpace(const int &ik, const int &ib, const ModuleBase::ComplexMatrix *evc, std::complex<double> *psir, const ModuleBase::Vector3<double> G);
-	// std::complex<double> unkdotb(const std::complex<double> *psir, const int ikb, const int bandindex, const ModuleBase::ComplexMatrix *wfc_pw);
-	std::complex<double> unkdotkb(const int &ik, const int &ikb, const int &iband_L, const int &iband_R, const ModuleBase::Vector3<double> G, const psi::Psi<std::complex<double>>& wfc_pw);
-	// std::complex<double> gamma_only_cal(const int &ib_L, const int &ib_R, const ModuleBase::ComplexMatrix *wfc_pw, const ModuleBase::Vector3<double> G);
-	
-	// lcao����
-	void lcao2pw_basis(const int ik, ModuleBase::ComplexMatrix &orbital_in_G);
-	void getUnkFromLcao();
-    void get_lcao_wfc_global_ik(std::complex<double>** ctot, std::complex<double>** cc);
+    void init_wannier(const psi::Psi<std::complex<double>> *psi = nullptr);
+    void read_nnkp();
+    void outEIG();
+    void cal_Amn(const psi::Psi<std::complex<double>> &wfc_pw);
+    void cal_Mmn(const psi::Psi<std::complex<double>> &wfc_pw);
+    void produce_trial_in_pw(const int &ik, ModuleBase::ComplexMatrix &trial_orbitals_k);
+    void get_trial_orbitals_lm_k(const int wannier_index,
+                                 const int orbital_L,
+                                 const int orbital_m,
+                                 ModuleBase::matrix &ylm,
+                                 ModuleBase::matrix &dr,
+                                 ModuleBase::matrix &r,
+                                 ModuleBase::matrix &psir,
+                                 const int mesh_r,
+                                 ModuleBase::Vector3<double> *gk,
+                                 const int npw,
+                                 ModuleBase::ComplexMatrix &trial_orbitals_k);
+    void integral(const int meshr, const double *psir, const double *r, const double *rab, const int &l, double *table);
+    void writeUNK(const psi::Psi<std::complex<double>> &wfc_pw);
+    // void ToRealSpace(const int &ik, const int &ib, const ModuleBase::ComplexMatrix *evc, std::complex<double> *psir,
+    // const ModuleBase::Vector3<double> G); std::complex<double> unkdotb(const std::complex<double> *psir, const int
+    // ikb, const int bandindex, const ModuleBase::ComplexMatrix *wfc_pw);
+    std::complex<double> unkdotkb(const int &ik,
+                                  const int &ikb,
+                                  const int &iband_L,
+                                  const int &iband_R,
+                                  const ModuleBase::Vector3<double> G,
+                                  const psi::Psi<std::complex<double>> &wfc_pw);
+    // std::complex<double> gamma_only_cal(const int &ib_L, const int &ib_R, const ModuleBase::ComplexMatrix *wfc_pw,
+    // const ModuleBase::Vector3<double> G);
 
-private:
-    std::complex<double>*** wfc_k_grid;
+    void lcao2pw_basis(const int ik, ModuleBase::ComplexMatrix &orbital_in_G);
+    void getUnkFromLcao();
+    void get_lcao_wfc_global_ik(std::complex<double> **ctot, std::complex<double> **cc);
 
+  private:
+    std::complex<double> ***wfc_k_grid;
 };
 
 #endif
diff --git a/source/src_lcao/LCAO_gen_fixedH.cpp b/source/src_lcao/LCAO_gen_fixedH.cpp
index eb0522a0b1..eba45bd222 100644
--- a/source/src_lcao/LCAO_gen_fixedH.cpp
+++ b/source/src_lcao/LCAO_gen_fixedH.cpp
@@ -24,13 +24,6 @@ LCAO_gen_fixedH::~LCAO_gen_fixedH()
 void LCAO_gen_fixedH::calculate_NL_no(double* HlocR)
 {
     ModuleBase::TITLE("LCAO_gen_fixedH","calculate_NL_no");
-	if(GlobalV::NSPIN==4)
-	{
-		this->build_Nonlocal_mu(HlocR, false);
-		return;
-		//ModuleBase::WARNING_QUIT("LCAO_gen_fixedH::calculate_NL_no","noncollinear case shoule be complex<double>* type");
-	} 
-
 	if(GlobalV::GAMMA_ONLY_LOCAL)
 	{
 	  	//for gamma only.
@@ -64,16 +57,6 @@ void LCAO_gen_fixedH::calculate_NL_no(double* HlocR)
     return;
 }
 
-/*void LCAO_gen_fixedH::calculate_NL_no(std::complex<double>* HlocR)
-{
-    ModuleBase::TITLE("LCAO_gen_fixedH","calculate_NL_no");
-	if(GlobalV::NSPIN!=4) ModuleBase::WARNING_QUIT("LCAO_gen_fixedH::calculate_NL_no","complex<double>* type shoule be noncollinear case");
-
-	this->build_Nonlocal_mu(HlocR, false);
-
-    return;
-}*/
-
 void LCAO_gen_fixedH::calculate_T_no(double* HlocR)
 {
     ModuleBase::TITLE("LCAO_gen_fixedH","calculate_T_no");
@@ -733,39 +716,51 @@ void LCAO_gen_fixedH::build_Nonlocal_mu_new(double* NLloc, const bool &calc_deri
 								{
 									std::vector<double> nlm_1=(*nlm_cur1_e)[iw1_all];
 									std::vector<double> nlm_2=(*nlm_cur2_e)[iw2_all];
-									double nlm_tmp = 0.0;
-
-									const int nproj = GlobalC::ucell.infoNL.nproj[T0];
-									int ib = 0;
-									for (int nb = 0; nb < nproj; nb++)
+									if(GlobalV::NSPIN==4)
 									{
-										const int L0 = GlobalC::ucell.infoNL.Beta[T0].Proj[nb].getL();
-										for(int m=0;m<2*L0+1;m++)
+										std::complex<double> nlm_tmp = ModuleBase::ZERO;
+										int is0 = (j-j0*GlobalV::NPOL) + (k-k0*GlobalV::NPOL)*2;
+										for (int no = 0; no < GlobalC::ucell.atoms[T0].non_zero_count_soc[is0]; no++)
 										{
-											if(nlm_1[ib]!=0.0 && nlm_2[ib]!=0.0)
+											const int p1 = GlobalC::ucell.atoms[T0].index1_soc[is0][no];
+											const int p2 = GlobalC::ucell.atoms[T0].index2_soc[is0][no];
+											nlm_tmp += nlm_1[p1] * nlm_2[p2] * GlobalC::ucell.atoms[T0].d_so(is0, p2, p1);
+										}
+										this->LM->Hloc_fixedR_soc[nnr+nnr_inner] += nlm_tmp;
+									}
+									else
+									{
+										double nlm_tmp = 0.0;
+										const int nproj = GlobalC::ucell.infoNL.nproj[T0];
+										int ib = 0;
+										for (int nb = 0; nb < nproj; nb++)
+										{
+											const int L0 = GlobalC::ucell.infoNL.Beta[T0].Proj[nb].getL();
+											for(int m=0;m<2*L0+1;m++)
 											{
-												nlm_tmp += nlm_1[ib]*nlm_2[ib]*GlobalC::ucell.atoms[T0].dion(nb,nb);
+												if(nlm_1[ib]!=0.0 && nlm_2[ib]!=0.0)
+												{
+													nlm_tmp += nlm_1[ib]*nlm_2[ib]*GlobalC::ucell.atoms[T0].dion(nb,nb);
+												}
+												ib+=1;
 											}
-											ib+=1;
 										}
-									}
-									assert(ib==nlm_1.size());
+										assert(ib==nlm_1.size());
 
-									if(GlobalV::GAMMA_ONLY_LOCAL)
-									{
-										// mohan add 2010-12-20
-										if( nlm_tmp!=0.0 )
+										if(GlobalV::GAMMA_ONLY_LOCAL)
 										{
-											// GlobalV::ofs_running << std::setw(10) << iw1_all << std::setw(10) 
-											// << iw2_all << std::setw(20) << nlm[0] << std::endl; 
-											this->LM->set_HSgamma(iw1_all,iw2_all,nlm_tmp,'N', NLloc);//N stands for nonlocal.
+											// mohan add 2010-12-20
+											if( nlm_tmp!=0.0 )
+											{
+												this->LM->set_HSgamma(iw1_all,iw2_all,nlm_tmp,'N', NLloc);//N stands for nonlocal.
+											}
 										}
-									}
-									else
-									{
-										if( nlm_tmp!=0.0 )
+										else
 										{
-											NLloc[nnr+nnr_inner] += nlm_tmp;
+											if( nlm_tmp!=0.0 )
+											{
+												NLloc[nnr+nnr_inner] += nlm_tmp;
+											}
 										}
 									}
 								}// calc_deri
@@ -874,12 +869,10 @@ void LCAO_gen_fixedH::build_Nonlocal_mu_new(double* NLloc, const bool &calc_deri
 
 	if(!GlobalV::GAMMA_ONLY_LOCAL)
 	{
-	//		std::cout << " nr="  << nnr << std::endl;
-	//		std::cout << " pv->nnr=" << pv->nnr << std::endl;
-	//		GlobalV::ofs_running << " nr="  << nnr << std::endl;
-	//		GlobalV::ofs_running << " pv->nnr=" << pv->nnr << std::endl;
 		if( nnr!=pv->nnr)
 		{
+			GlobalV::ofs_running << " nr="  << nnr << std::endl;
+			GlobalV::ofs_running << " pv->nnr=" << pv->nnr << std::endl;
 			ModuleBase::WARNING_QUIT("LCAO_gen_fixedH::build_Nonlocal_mu_new","nnr!=LNNR.nnr");
 		}
 	}
@@ -1025,6 +1018,13 @@ void LCAO_gen_fixedH::build_Nonlocal_mu(double* NLloc, const bool &calc_deri)
 									if(!calc_deri)
 									{
 										int is0 = (j-j0*GlobalV::NPOL) + (k-k0*GlobalV::NPOL)*2;
+										//Note : there was a bug in the old implementation
+										//of soc nonlocal PP, which does not seem to affect the
+										//converged results though.
+										//However, there is a discrepancy in the integrate test case 
+										//240*soc, when checked against the new method.
+										//The origin of the bug is the mismatch between the indexes
+										//of <psi|beta> and d_so
 										GlobalC::UOT.snap_psibeta(
 												GlobalC::ORB,
 												GlobalC::ucell.infoNL,
diff --git a/source/src_pdiag/pdiag_double.cpp b/source/src_pdiag/pdiag_double.cpp
index 5fb375362e..4b1b43a727 100644
--- a/source/src_pdiag/pdiag_double.cpp
+++ b/source/src_pdiag/pdiag_double.cpp
@@ -13,7 +13,6 @@
 extern "C"
 {
     #include "../module_base/blacs_connector.h"
-    #include "my_elpa.h"
 	#include "../module_base/scalapack_connector.h"
 }
 #include "pdgseps.h"
@@ -27,47 +26,6 @@ extern "C"
 #include "diag_cusolver.cuh"
 #endif
 
-#ifdef __MPI
-inline int set_elpahandle(elpa_t &handle, const int *desc,const int local_nrows,const int local_ncols, const int nbands)
-{
-  int error;
-  int nprows, npcols, myprow, mypcol;
-  Cblacs_gridinfo(desc[1], &nprows, &npcols, &myprow, &mypcol);
-  elpa_init(20210430);
-  handle = elpa_allocate(&error);
-  elpa_set_integer(handle, "na", desc[2], &error);
-  elpa_set_integer(handle, "nev", nbands, &error);
-
-  elpa_set_integer(handle, "local_nrows", local_nrows, &error);
-
-  elpa_set_integer(handle, "local_ncols", local_ncols, &error);
-
-  elpa_set_integer(handle, "nblk", desc[4], &error);
-
-  elpa_set_integer(handle, "mpi_comm_parent", MPI_Comm_c2f(MPI_COMM_WORLD), &error);
-
-  elpa_set_integer(handle, "process_row", myprow, &error);
-
-  elpa_set_integer(handle, "process_col", mypcol, &error);
-
-  elpa_set_integer(handle, "blacs_context", desc[1], &error);
-
-  elpa_set_integer(handle, "cannon_for_generalized", 0, &error);
-   /* Setup */
-  elpa_setup(handle);   /* Set tunables */
-  return 0;
-}
-#endif
-
-
-inline bool ifElpaHandle(const bool& newIteration, const bool& ifNSCF)
-{
-    int doHandle = false;
-	if(newIteration) doHandle = true;
-	if(ifNSCF) doHandle = true;
-	return doHandle;
-}
-
 #ifdef __CUSOLVER_LCAO
 template <typename T>
 void cusolver_helper_gather(const T* mat_loc, T* mat_glb, const Parallel_Orbitals* pv){
diff --git a/source/src_pw/energy.cpp b/source/src_pw/energy.cpp
index f22c8b2cfe..3bac1627d6 100644
--- a/source/src_pw/energy.cpp
+++ b/source/src_pw/energy.cpp
@@ -181,6 +181,15 @@ void energy::print_etot(
 				this->print_format("E_sol_el", esol_el);
 				this->print_format("E_sol_cav", esol_cav);
 			}
+			if (GlobalV::comp_chg)
+			{
+                vector<double> ecomp(3, 0);
+                GlobalC::solvent_model.cal_Acomp(GlobalC::ucell, GlobalC::rhopw, GlobalC::CHR.rho, ecomp);
+				this->print_format("E_comp_self", ecomp[0]);
+				this->print_format("E_comp_electron", ecomp[1]);
+				this->print_format("E_comp_nuclear", ecomp[2]);
+                this->print_format("E_comp_tot", ecomp[0] + ecomp[1] + ecomp[2]);
+            }
 #ifdef __DEEPKS
 			if (GlobalV::deepks_scf)	//caoyu add 2021-08-10
 			{
diff --git a/source/src_pw/forces.cpp b/source/src_pw/forces.cpp
index 3a38806b18..3822e798ae 100644
--- a/source/src_pw/forces.cpp
+++ b/source/src_pw/forces.cpp
@@ -1,82 +1,83 @@
 #include "forces.h"
+
+#include "../module_symmetry/symmetry.h"
 #include "global.h"
 #include "vdwd2.h"
-#include "vdwd3.h"				  
-#include "../module_symmetry/symmetry.h"
+#include "vdwd3.h"
 // new
-#include "../module_xc/xc_functional.h"
 #include "../module_base/math_integral.h"
-#include "../src_parallel/parallel_reduce.h"
 #include "../module_base/timer.h"
 #include "../module_surchem/efield.h"
 #include "../module_surchem/surchem.h"
 
-double Forces::output_acc = 1.0e-8; // (Ryd/angstrom).	
+double Forces::output_acc = 1.0e-8; // (Ryd/angstrom).
 
 Forces::Forces()
 {
 }
 
-Forces::~Forces() {}
+Forces::~Forces()
+{
+}
 
 #include "../module_base/mathzone.h"
 void Forces::init(ModuleBase::matrix& force, const psi::Psi<std::complex<double>>* psi_in)
 {
-	ModuleBase::TITLE("Forces", "init");
-	this->nat = GlobalC::ucell.nat;
-	force.create(nat, 3);
-	
-	ModuleBase::matrix forcelc(nat, 3);
-	ModuleBase::matrix forceion(nat, 3);
-	ModuleBase::matrix forcecc(nat, 3);
-	ModuleBase::matrix forcenl(nat, 3);
-	ModuleBase::matrix forcescc(nat, 3);
+    ModuleBase::TITLE("Forces", "init");
+    this->nat = GlobalC::ucell.nat;
+    force.create(nat, 3);
+
+    ModuleBase::matrix forcelc(nat, 3);
+    ModuleBase::matrix forceion(nat, 3);
+    ModuleBase::matrix forcecc(nat, 3);
+    ModuleBase::matrix forcenl(nat, 3);
+    ModuleBase::matrix forcescc(nat, 3);
     this->cal_force_loc(forcelc, GlobalC::rhopw);
     this->cal_force_ew(forceion, GlobalC::rhopw);
     this->cal_force_nl(forcenl, psi_in);
-	this->cal_force_cc(forcecc, GlobalC::rhopw);
-	this->cal_force_scc(forcescc, GlobalC::rhopw);
+    this->cal_force_cc(forcecc, GlobalC::rhopw);
+    this->cal_force_scc(forcescc, GlobalC::rhopw);
 
-	ModuleBase::matrix stress_vdw_pw;//.create(3,3);
+    ModuleBase::matrix stress_vdw_pw; //.create(3,3);
     ModuleBase::matrix force_vdw;
     force_vdw.create(nat, 3);
-	if(GlobalC::vdwd2_para.flag_vdwd2)													//Peize Lin add 2014.04.03, update 2021.03.09
-	{
-        Vdwd2 vdwd2(GlobalC::ucell,GlobalC::vdwd2_para);
-		vdwd2.cal_force();
-		for(int iat=0; iat<GlobalC::ucell.nat; ++iat)
-		{
-			force_vdw(iat,0) = vdwd2.get_force()[iat].x;
-			force_vdw(iat,1) = vdwd2.get_force()[iat].y;
-			force_vdw(iat,2) = vdwd2.get_force()[iat].z;
-		}
-		if(GlobalV::TEST_FORCE)
-		{
-			Forces::print("VDW      FORCE (Ry/Bohr)", force_vdw);
-		}
-	}
-	else if(GlobalC::vdwd3_para.flag_vdwd3)													//jiyy add 2019-05-18, update 2021-05-02
-	{
-        Vdwd3 vdwd3(GlobalC::ucell,GlobalC::vdwd3_para);
-		vdwd3.cal_force();
-		for(int iat=0; iat<GlobalC::ucell.nat; ++iat)
-		{
-			force_vdw(iat,0) = vdwd3.get_force()[iat].x;
-			force_vdw(iat,1) = vdwd3.get_force()[iat].y;
-			force_vdw(iat,2) = vdwd3.get_force()[iat].z;
-		}
-		if(GlobalV::TEST_FORCE)
-		{
-			Forces::print("VDW      FORCE (Ry/Bohr)", force_vdw);
-		}
-	}
+    if (GlobalC::vdwd2_para.flag_vdwd2) // Peize Lin add 2014.04.03, update 2021.03.09
+    {
+        Vdwd2 vdwd2(GlobalC::ucell, GlobalC::vdwd2_para);
+        vdwd2.cal_force();
+        for (int iat = 0; iat < GlobalC::ucell.nat; ++iat)
+        {
+            force_vdw(iat, 0) = vdwd2.get_force()[iat].x;
+            force_vdw(iat, 1) = vdwd2.get_force()[iat].y;
+            force_vdw(iat, 2) = vdwd2.get_force()[iat].z;
+        }
+        if (GlobalV::TEST_FORCE)
+        {
+            Forces::print("VDW      FORCE (Ry/Bohr)", force_vdw);
+        }
+    }
+    else if (GlobalC::vdwd3_para.flag_vdwd3) // jiyy add 2019-05-18, update 2021-05-02
+    {
+        Vdwd3 vdwd3(GlobalC::ucell, GlobalC::vdwd3_para);
+        vdwd3.cal_force();
+        for (int iat = 0; iat < GlobalC::ucell.nat; ++iat)
+        {
+            force_vdw(iat, 0) = vdwd3.get_force()[iat].x;
+            force_vdw(iat, 1) = vdwd3.get_force()[iat].y;
+            force_vdw(iat, 2) = vdwd3.get_force()[iat].z;
+        }
+        if (GlobalV::TEST_FORCE)
+        {
+            Forces::print("VDW      FORCE (Ry/Bohr)", force_vdw);
+        }
+    }
 
     ModuleBase::matrix force_e;
-    if(GlobalV::EFIELD_FLAG)
+    if (GlobalV::EFIELD_FLAG)
     {
         force_e.create(GlobalC::ucell.nat, 3);
         Efield::compute_force(GlobalC::ucell, force_e);
-        if(GlobalV::TEST_FORCE)
+        if (GlobalV::TEST_FORCE)
         {
             Forces::print("EFIELD      FORCE (Ry/Bohr)", force_e);
         }
@@ -93,33 +94,42 @@ void Forces::init(ModuleBase::matrix& force, const psi::Psi<std::complex<double>
         }
     }
 
-    //impose total force = 0
+    ModuleBase::matrix forcecomp;
+    if (GlobalV::comp_chg)
+    {
+        forcecomp.create(GlobalC::ucell.nat, 3);
+        GlobalC::solvent_model.cal_comp_force(forcecomp, GlobalC::rhopw);
+    }
+
+    // impose total force = 0
     int iat = 0;
-	for (int ipol = 0; ipol < 3; ipol++)
-	{
-		double sum = 0.0;
-		iat = 0;
+    for (int ipol = 0; ipol < 3; ipol++)
+    {
+        double sum = 0.0;
+        iat = 0;
 
-		for (int it = 0;it < GlobalC::ucell.ntype;it++)
-		{
-			for (int ia = 0;ia < GlobalC::ucell.atoms[it].na;ia++)
-			{
-				force(iat, ipol) =
-					forcelc(iat, ipol)
-					+ forceion(iat, ipol)
-					+ forcenl(iat, ipol)
-					+ forcecc(iat, ipol)
-					+ forcescc(iat, ipol);
-
-				if(GlobalC::vdwd2_para.flag_vdwd2 || GlobalC::vdwd3_para.flag_vdwd3)		//linpz and jiyy added vdw force, modified by zhengdy
-				{
+        for (int it = 0; it < GlobalC::ucell.ntype; it++)
+        {
+            for (int ia = 0; ia < GlobalC::ucell.atoms[it].na; ia++)
+            {
+                force(iat, ipol) = forcelc(iat, ipol) + forceion(iat, ipol) + forcenl(iat, ipol) + forcecc(iat, ipol)
+                                   + forcescc(iat, ipol);
+
+                if (GlobalC::vdwd2_para.flag_vdwd2
+                    || GlobalC::vdwd3_para.flag_vdwd3) // linpz and jiyy added vdw force, modified by zhengdy
+                {
                     force(iat, ipol) += force_vdw(iat, ipol);
-                }																										   
-					
-				if(GlobalV::EFIELD_FLAG)
-				{
-					force(iat,ipol) = force(iat, ipol) + force_e(iat, ipol);
-				}
+                }
+
+                if (GlobalV::EFIELD_FLAG)
+                {
+                    force(iat, ipol) = force(iat, ipol) + force_e(iat, ipol);
+                }
+
+                if (GlobalV::comp_chg)
+                {
+                    force(iat, ipol) = force(iat, ipol) + forcecomp(iat, ipol);
+                }
 
                 if(GlobalV::imp_sol)
                 {
@@ -128,99 +138,125 @@ void Forces::init(ModuleBase::matrix& force, const psi::Psi<std::complex<double>
 
 				sum += force(iat, ipol);
 
-				iat++;
-			}
-		}
+                iat++;
+            }
+        }
 
-		double compen = sum / GlobalC::ucell.nat;
-		for(int iat=0; iat<GlobalC::ucell.nat; ++iat)
-		{
-			force(iat, ipol) = force(iat, ipol) - compen;
-		}	
-	}
-	
-	if(ModuleSymmetry::Symmetry::symm_flag)
-	{
-		double *pos;
-		double d1,d2,d3;
-		pos = new double[GlobalC::ucell.nat*3];
-		ModuleBase::GlobalFunc::ZEROS(pos, GlobalC::ucell.nat*3);
-		int iat = 0;
-		for(int it = 0;it < GlobalC::ucell.ntype;it++)
-		{
-			//Atom* atom = &GlobalC::ucell.atoms[it];
-			for(int ia =0;ia< GlobalC::ucell.atoms[it].na;ia++)
-			{
-				pos[3*iat  ] = GlobalC::ucell.atoms[it].taud[ia].x ;
-				pos[3*iat+1] = GlobalC::ucell.atoms[it].taud[ia].y ;
-				pos[3*iat+2] = GlobalC::ucell.atoms[it].taud[ia].z;
-				for(int k=0; k<3; ++k)
-				{
-					GlobalC::symm.check_translation( pos[iat*3+k], -floor(pos[iat*3+k]));
-					GlobalC::symm.check_boundary( pos[iat*3+k] );
-				}
-				iat++;				
-			}
-		}
-		
-		for(int iat=0; iat<GlobalC::ucell.nat; iat++)
-		{
-			ModuleBase::Mathzone::Cartesian_to_Direct(force(iat,0),force(iat,1),force(iat,2),
-                                        GlobalC::ucell.a1.x, GlobalC::ucell.a1.y, GlobalC::ucell.a1.z,
-                                        GlobalC::ucell.a2.x, GlobalC::ucell.a2.y, GlobalC::ucell.a2.z,
-                                        GlobalC::ucell.a3.x, GlobalC::ucell.a3.y, GlobalC::ucell.a3.z,
-                                        d1,d2,d3);
-			
-			force(iat,0) = d1;force(iat,1) = d2;force(iat,2) = d3;
-		}
-		GlobalC::symm.force_symmetry(force , pos, GlobalC::ucell);
-		for(int iat=0; iat<GlobalC::ucell.nat; iat++)
-		{
-			ModuleBase::Mathzone::Direct_to_Cartesian(force(iat,0),force(iat,1),force(iat,2),
-                                        GlobalC::ucell.a1.x, GlobalC::ucell.a1.y, GlobalC::ucell.a1.z,
-                                        GlobalC::ucell.a2.x, GlobalC::ucell.a2.y, GlobalC::ucell.a2.z,
-                                        GlobalC::ucell.a3.x, GlobalC::ucell.a3.y, GlobalC::ucell.a3.z,
-                                        d1,d2,d3);
-			force(iat,0) = d1;force(iat,1) = d2;force(iat,2) = d3;
-		}
-		// std::cout << "nrotk =" << GlobalC::symm.nrotk << std::endl;
-		delete[] pos;
-		
-	}
+        if(!(GlobalV::comp_chg && GlobalC::solvent_model.comp_q!=0 && ipol==GlobalC::solvent_model.comp_dim))
+        {
+            double compen = sum / GlobalC::ucell.nat;
+            for (int iat = 0; iat < GlobalC::ucell.nat; ++iat)
+            {
+                force(iat, ipol) = force(iat, ipol) - compen;
+            }
+        }
+    }
 
- 	GlobalV::ofs_running << std::setiosflags(ios::fixed) << std::setprecision(6) << std::endl;
-	/*if(GlobalV::TEST_FORCE)
-	{
-		Forces::print("LOCAL    FORCE (Ry/Bohr)", forcelc);
-		Forces::print("NONLOCAL FORCE (Ry/Bohr)", forcenl);
-		Forces::print("NLCC     FORCE (Ry/Bohr)", forcecc);
-		Forces::print("ION      FORCE (Ry/Bohr)", forceion);
-		Forces::print("SCC      FORCE (Ry/Bohr)", forcescc);
-		if(GlobalV::EFIELD) Forces::print("EFIELD   FORCE (Ry/Bohr)", force_e);
-	}*/
-	
-/*
-	Forces::print("   TOTAL-FORCE (Ry/Bohr)", force);
-	
-	if(INPUT.out_force)                                                   // pengfei 2016-12-20
-	{
-		std::ofstream ofs("FORCE.dat");
-		if(!ofs)
-		{
-			std::cout << "open FORCE.dat error !" <<std::endl;
-		}
-		for(int iat=0; iat<GlobalC::ucell.nat; iat++)
-		{
-			ofs << "   " << force(iat,0)*ModuleBase::Ry_to_eV / 0.529177 
-				<< "   " << force(iat,1)*ModuleBase::Ry_to_eV / 0.529177 
-				<< "   " << force(iat,2)*ModuleBase::Ry_to_eV / 0.529177 << std::endl;
-		}
-		ofs.close();
-	}
-*/
-		
-	// output force in unit eV/Angstrom
-	GlobalV::ofs_running << std::endl;
+    if (ModuleSymmetry::Symmetry::symm_flag)
+    {
+        double* pos;
+        double d1, d2, d3;
+        pos = new double[GlobalC::ucell.nat * 3];
+        ModuleBase::GlobalFunc::ZEROS(pos, GlobalC::ucell.nat * 3);
+        int iat = 0;
+        for (int it = 0; it < GlobalC::ucell.ntype; it++)
+        {
+            // Atom* atom = &GlobalC::ucell.atoms[it];
+            for (int ia = 0; ia < GlobalC::ucell.atoms[it].na; ia++)
+            {
+                pos[3 * iat] = GlobalC::ucell.atoms[it].taud[ia].x;
+                pos[3 * iat + 1] = GlobalC::ucell.atoms[it].taud[ia].y;
+                pos[3 * iat + 2] = GlobalC::ucell.atoms[it].taud[ia].z;
+                for (int k = 0; k < 3; ++k)
+                {
+                    GlobalC::symm.check_translation(pos[iat * 3 + k], -floor(pos[iat * 3 + k]));
+                    GlobalC::symm.check_boundary(pos[iat * 3 + k]);
+                }
+                iat++;
+            }
+        }
+
+        for (int iat = 0; iat < GlobalC::ucell.nat; iat++)
+        {
+            ModuleBase::Mathzone::Cartesian_to_Direct(force(iat, 0),
+                                                      force(iat, 1),
+                                                      force(iat, 2),
+                                                      GlobalC::ucell.a1.x,
+                                                      GlobalC::ucell.a1.y,
+                                                      GlobalC::ucell.a1.z,
+                                                      GlobalC::ucell.a2.x,
+                                                      GlobalC::ucell.a2.y,
+                                                      GlobalC::ucell.a2.z,
+                                                      GlobalC::ucell.a3.x,
+                                                      GlobalC::ucell.a3.y,
+                                                      GlobalC::ucell.a3.z,
+                                                      d1,
+                                                      d2,
+                                                      d3);
+
+            force(iat, 0) = d1;
+            force(iat, 1) = d2;
+            force(iat, 2) = d3;
+        }
+        GlobalC::symm.force_symmetry(force, pos, GlobalC::ucell);
+        for (int iat = 0; iat < GlobalC::ucell.nat; iat++)
+        {
+            ModuleBase::Mathzone::Direct_to_Cartesian(force(iat, 0),
+                                                      force(iat, 1),
+                                                      force(iat, 2),
+                                                      GlobalC::ucell.a1.x,
+                                                      GlobalC::ucell.a1.y,
+                                                      GlobalC::ucell.a1.z,
+                                                      GlobalC::ucell.a2.x,
+                                                      GlobalC::ucell.a2.y,
+                                                      GlobalC::ucell.a2.z,
+                                                      GlobalC::ucell.a3.x,
+                                                      GlobalC::ucell.a3.y,
+                                                      GlobalC::ucell.a3.z,
+                                                      d1,
+                                                      d2,
+                                                      d3);
+            force(iat, 0) = d1;
+            force(iat, 1) = d2;
+            force(iat, 2) = d3;
+        }
+        // std::cout << "nrotk =" << GlobalC::symm.nrotk << std::endl;
+        delete[] pos;
+    }
+
+    GlobalV::ofs_running << std::setiosflags(ios::fixed) << std::setprecision(6) << std::endl;
+    /*if(GlobalV::TEST_FORCE)
+    {
+        Forces::print("LOCAL    FORCE (Ry/Bohr)", forcelc);
+        Forces::print("NONLOCAL FORCE (Ry/Bohr)", forcenl);
+        Forces::print("NLCC     FORCE (Ry/Bohr)", forcecc);
+        Forces::print("ION      FORCE (Ry/Bohr)", forceion);
+        Forces::print("SCC      FORCE (Ry/Bohr)", forcescc);
+        if(GlobalV::EFIELD) Forces::print("EFIELD   FORCE (Ry/Bohr)", force_e);
+    }*/
+
+    /*
+        Forces::print("   TOTAL-FORCE (Ry/Bohr)", force);
+
+        if(INPUT.out_force)                                                   // pengfei 2016-12-20
+        {
+            std::ofstream ofs("FORCE.dat");
+            if(!ofs)
+            {
+                std::cout << "open FORCE.dat error !" <<std::endl;
+            }
+            for(int iat=0; iat<GlobalC::ucell.nat; iat++)
+            {
+                ofs << "   " << force(iat,0)*ModuleBase::Ry_to_eV / 0.529177
+                    << "   " << force(iat,1)*ModuleBase::Ry_to_eV / 0.529177
+                    << "   " << force(iat,2)*ModuleBase::Ry_to_eV / 0.529177 << std::endl;
+            }
+            ofs.close();
+        }
+    */
+
+    // output force in unit eV/Angstrom
+    GlobalV::ofs_running << std::endl;
 
 	if(GlobalV::TEST_FORCE)
 	{
@@ -237,211 +273,227 @@ void Forces::init(ModuleBase::matrix& force, const psi::Psi<std::complex<double>
     return;
 }
 
-void Forces::print_to_files(std::ofstream &ofs, const std::string &name, const ModuleBase::matrix &f)
+void Forces::print_to_files(std::ofstream& ofs, const std::string& name, const ModuleBase::matrix& f)
 {
     int iat = 0;
     ofs << " " << name;
     ofs << std::setprecision(8);
-	//ofs << std::setiosflags(ios::showpos);
-   
-	double fac = ModuleBase::Ry_to_eV / 0.529177;// (eV/A)
+    // ofs << std::setiosflags(ios::showpos);
 
-	if(GlobalV::TEST_FORCE)
-	{
-		std::cout << std::setiosflags(ios::showpos);
-		std::cout << " " << name;
-		std::cout << std::setprecision(8);
-	}
+    double fac = ModuleBase::Ry_to_eV / 0.529177; // (eV/A)
 
-    for (int it = 0;it < GlobalC::ucell.ntype;it++)
+    if (GlobalV::TEST_FORCE)
     {
-        for (int ia = 0;ia < GlobalC::ucell.atoms[it].na;ia++)
-        {
-            ofs << " " << std::setw(5) << it
-            << std::setw(8) << ia+1
-            << std::setw(20) << f(iat, 0)*fac
-            << std::setw(20) << f(iat, 1)*fac
-            << std::setw(20) << f(iat, 2)*fac << std::endl;
-			
-			if(GlobalV::TEST_FORCE)
-			{
-            	std::cout << " " << std::setw(5) << it
-            	<< std::setw(8) << ia+1
-            	<< std::setw(20) << f(iat, 0)*fac
-            	<< std::setw(20) << f(iat, 1)*fac
-            	<< std::setw(20) << f(iat, 2)*fac << std::endl;
-			}
+        std::cout << std::setiosflags(ios::showpos);
+        std::cout << " " << name;
+        std::cout << std::setprecision(8);
+    }
+
+    for (int it = 0; it < GlobalC::ucell.ntype; it++)
+    {
+        for (int ia = 0; ia < GlobalC::ucell.atoms[it].na; ia++)
+        {
+            ofs << " " << std::setw(5) << it << std::setw(8) << ia + 1 << std::setw(20) << f(iat, 0) * fac
+                << std::setw(20) << f(iat, 1) * fac << std::setw(20) << f(iat, 2) * fac << std::endl;
+
+            if (GlobalV::TEST_FORCE)
+            {
+                std::cout << " " << std::setw(5) << it << std::setw(8) << ia + 1 << std::setw(20) << f(iat, 0) * fac
+                          << std::setw(20) << f(iat, 1) * fac << std::setw(20) << f(iat, 2) * fac << std::endl;
+            }
             iat++;
         }
     }
 
-	GlobalV::ofs_running << std::resetiosflags(ios::showpos);
-	std::cout << std::resetiosflags(ios::showpos);
+    GlobalV::ofs_running << std::resetiosflags(ios::showpos);
+    std::cout << std::resetiosflags(ios::showpos);
     return;
 }
 
-
-
-void Forces::print(const std::string &name, const ModuleBase::matrix &f, bool ry)
+void Forces::print(const std::string& name, const ModuleBase::matrix& f, bool ry)
 {
-	ModuleBase::GlobalFunc::NEW_PART(name);
+    ModuleBase::GlobalFunc::NEW_PART(name);
 
-	GlobalV::ofs_running << " " << std::setw(8) << "atom" << std::setw(15) << "x" << std::setw(15) << "y" << std::setw(15) << "z" << std::endl;
-	GlobalV::ofs_running << std::setiosflags(ios::showpos);
+    GlobalV::ofs_running << " " << std::setw(8) << "atom" << std::setw(15) << "x" << std::setw(15) << "y"
+                         << std::setw(15) << "z" << std::endl;
+    GlobalV::ofs_running << std::setiosflags(ios::showpos);
     GlobalV::ofs_running << std::setprecision(8);
 
-	const double fac = ModuleBase::Ry_to_eV / 0.529177;
-	
-	if(GlobalV::TEST_FORCE)
-	{
-		std::cout << " --------------- " << name << " ---------------" << std::endl;
-		std::cout << " " << std::setw(8) << "atom" << std::setw(15) << "x" << std::setw(15) << "y" << std::setw(15) << "z" << std::endl;
-		std::cout << std::setiosflags(ios::showpos);
-		std::cout << std::setprecision(6);
-	}
+    const double fac = ModuleBase::Ry_to_eV / 0.529177;
+
+    if (GlobalV::TEST_FORCE)
+    {
+        std::cout << " --------------- " << name << " ---------------" << std::endl;
+        std::cout << " " << std::setw(8) << "atom" << std::setw(15) << "x" << std::setw(15) << "y" << std::setw(15)
+                  << "z" << std::endl;
+        std::cout << std::setiosflags(ios::showpos);
+        std::cout << std::setprecision(6);
+    }
 
     int iat = 0;
-    for (int it = 0;it < GlobalC::ucell.ntype;it++)
+    for (int it = 0; it < GlobalC::ucell.ntype; it++)
     {
-        for (int ia = 0;ia < GlobalC::ucell.atoms[it].na;ia++)
+        for (int ia = 0; ia < GlobalC::ucell.atoms[it].na; ia++)
         {
-			std::stringstream ss;
-			ss << GlobalC::ucell.atoms[it].label << ia+1;
+            std::stringstream ss;
+            ss << GlobalC::ucell.atoms[it].label << ia + 1;
+
+            if (ry) // output Rydberg Unit
+            {
+                GlobalV::ofs_running << " " << std::setw(8) << ss.str();
+                if (abs(f(iat, 0)) > Forces::output_acc)
+                    GlobalV::ofs_running << std::setw(15) << f(iat, 0);
+                else
+                    GlobalV::ofs_running << std::setw(15) << "0";
+                if (abs(f(iat, 1)) > Forces::output_acc)
+                    GlobalV::ofs_running << std::setw(15) << f(iat, 1);
+                else
+                    GlobalV::ofs_running << std::setw(15) << "0";
+                if (abs(f(iat, 2)) > Forces::output_acc)
+                    GlobalV::ofs_running << std::setw(15) << f(iat, 2);
+                else
+                    GlobalV::ofs_running << std::setw(15) << "0";
+                GlobalV::ofs_running << std::endl;
+            }
+            else
+            {
+                GlobalV::ofs_running << " " << std::setw(8) << ss.str();
+                if (abs(f(iat, 0)) > Forces::output_acc)
+                    GlobalV::ofs_running << std::setw(15) << f(iat, 0) * fac;
+                else
+                    GlobalV::ofs_running << std::setw(15) << "0";
+                if (abs(f(iat, 1)) > Forces::output_acc)
+                    GlobalV::ofs_running << std::setw(15) << f(iat, 1) * fac;
+                else
+                    GlobalV::ofs_running << std::setw(15) << "0";
+                if (abs(f(iat, 2)) > Forces::output_acc)
+                    GlobalV::ofs_running << std::setw(15) << f(iat, 2) * fac;
+                else
+                    GlobalV::ofs_running << std::setw(15) << "0";
+                GlobalV::ofs_running << std::endl;
+            }
+
+            if (GlobalV::TEST_FORCE && ry)
+            {
+                std::cout << " " << std::setw(8) << ss.str();
+                if (abs(f(iat, 0)) > Forces::output_acc)
+                    std::cout << std::setw(15) << f(iat, 0);
+                else
+                    std::cout << std::setw(15) << "0";
+                if (abs(f(iat, 1)) > Forces::output_acc)
+                    std::cout << std::setw(15) << f(iat, 1);
+                else
+                    std::cout << std::setw(15) << "0";
+                if (abs(f(iat, 2)) > Forces::output_acc)
+                    std::cout << std::setw(15) << f(iat, 2);
+                else
+                    std::cout << std::setw(15) << "0";
+                std::cout << std::endl;
+            }
+            else if (GlobalV::TEST_FORCE)
+            {
+                std::cout << " " << std::setw(8) << ss.str();
+                if (abs(f(iat, 0)) > Forces::output_acc)
+                    std::cout << std::setw(15) << f(iat, 0) * fac;
+                else
+                    std::cout << std::setw(15) << "0";
+                if (abs(f(iat, 1)) > Forces::output_acc)
+                    std::cout << std::setw(15) << f(iat, 1) * fac;
+                else
+                    std::cout << std::setw(15) << "0";
+                if (abs(f(iat, 2)) > Forces::output_acc)
+                    std::cout << std::setw(15) << f(iat, 2) * fac;
+                else
+                    std::cout << std::setw(15) << "0";
+                std::cout << std::endl;
+            }
 
-			if(ry) // output Rydberg Unit
-			{
-				GlobalV::ofs_running << " " << std::setw(8) << ss.str();
-				if( abs(f(iat,0)) > Forces::output_acc) GlobalV::ofs_running << std::setw(15) << f(iat,0);
-				else GlobalV::ofs_running << std::setw(15) << "0";
-				if( abs(f(iat,1)) > Forces::output_acc) GlobalV::ofs_running << std::setw(15) << f(iat,1);
-				else GlobalV::ofs_running << std::setw(15) << "0";
-				if( abs(f(iat,2)) > Forces::output_acc) GlobalV::ofs_running << std::setw(15) << f(iat,2);
-				else GlobalV::ofs_running << std::setw(15) << "0";
-				GlobalV::ofs_running << std::endl;
-			}
-			else
-			{
-				GlobalV::ofs_running << " " << std::setw(8) << ss.str();
-				if( abs(f(iat,0)) > Forces::output_acc) GlobalV::ofs_running << std::setw(15) << f(iat,0)*fac;
-				else GlobalV::ofs_running << std::setw(15) << "0";
-				if( abs(f(iat,1)) > Forces::output_acc) GlobalV::ofs_running << std::setw(15) << f(iat,1)*fac;
-				else GlobalV::ofs_running << std::setw(15) << "0";
-				if( abs(f(iat,2)) > Forces::output_acc) GlobalV::ofs_running << std::setw(15) << f(iat,2)*fac;
-				else GlobalV::ofs_running << std::setw(15) << "0";
-				GlobalV::ofs_running << std::endl;
-			}
-
-			if(GlobalV::TEST_FORCE && ry)
-			{
-				std::cout << " " << std::setw(8) << ss.str();
-				if( abs(f(iat,0)) > Forces::output_acc) std::cout << std::setw(15) << f(iat,0);
-				else std::cout << std::setw(15) << "0";
-				if( abs(f(iat,1)) > Forces::output_acc) std::cout << std::setw(15) << f(iat,1);
-				else std::cout << std::setw(15) << "0";
-				if( abs(f(iat,2)) > Forces::output_acc) std::cout << std::setw(15) << f(iat,2);
-				else std::cout << std::setw(15) << "0";
-				std::cout << std::endl;
-			}
-			else if (GlobalV::TEST_FORCE)
-			{
-				std::cout << " " << std::setw(8) << ss.str();
-				if( abs(f(iat,0)) > Forces::output_acc) std::cout << std::setw(15) << f(iat,0)*fac;
-				else std::cout << std::setw(15) << "0";
-				if( abs(f(iat,1)) > Forces::output_acc) std::cout << std::setw(15) << f(iat,1)*fac;
-				else std::cout << std::setw(15) << "0";
-				if( abs(f(iat,2)) > Forces::output_acc) std::cout << std::setw(15) << f(iat,2)*fac;
-				else std::cout << std::setw(15) << "0";
-				std::cout << std::endl;
-			}	
-				
             iat++;
         }
     }
 
-	GlobalV::ofs_running << std::resetiosflags(ios::showpos);
-	std::cout << std::resetiosflags(ios::showpos);
+    GlobalV::ofs_running << std::resetiosflags(ios::showpos);
+    std::cout << std::resetiosflags(ios::showpos);
     return;
 }
 
-
 void Forces::cal_force_loc(ModuleBase::matrix& forcelc, ModulePW::PW_Basis* rho_basis)
 {
-	ModuleBase::timer::tick("Forces","cal_force_loc");
+    ModuleBase::timer::tick("Forces", "cal_force_loc");
 
-    std::complex<double> *aux = new std::complex<double>[rho_basis->nmaxgr];
+    std::complex<double>* aux = new std::complex<double>[rho_basis->nmaxgr];
     ModuleBase::GlobalFunc::ZEROS(aux, rho_basis->nrxx);
 
     // now, in all pools , the charge are the same,
     // so, the force calculated by each pool is equal.
-    
-	for(int is=0; is<GlobalV::NSPIN; is++)
-	{
-		for (int ir=0; ir<rho_basis->nrxx; ir++)
-		{
-        	aux[ir] += std::complex<double>( GlobalC::CHR.rho[is][ir], 0.0 );
-		}
-	}
 
-	// to G space.
-    rho_basis->real2recip(aux,aux);
+    for (int is = 0; is < GlobalV::NSPIN; is++)
+    {
+        for (int ir = 0; ir < rho_basis->nrxx; ir++)
+        {
+            aux[ir] += std::complex<double>(GlobalC::CHR.rho[is][ir], 0.0);
+        }
+    }
 
+    // to G space.
+    rho_basis->real2recip(aux, aux);
 
     int iat = 0;
-    for (int it = 0;it < GlobalC::ucell.ntype;it++)
+    for (int it = 0; it < GlobalC::ucell.ntype; it++)
     {
-        for (int ia = 0;ia < GlobalC::ucell.atoms[it].na;ia++)
+        for (int ia = 0; ia < GlobalC::ucell.atoms[it].na; ia++)
         {
-            for (int ig = 0; ig < rho_basis->npw ; ig++)
+            for (int ig = 0; ig < rho_basis->npw; ig++)
             {
                 const double phase = ModuleBase::TWO_PI * (rho_basis->gcar[ig] * GlobalC::ucell.atoms[it].tau[ia]);
-                const double factor = GlobalC::ppcell.vloc(it, rho_basis->ig2igg[ig]) *
-									  ( cos(phase) * aux[ig].imag()
-                                      + sin(phase) * aux[ig].real()); 
+                const double factor = GlobalC::ppcell.vloc(it, rho_basis->ig2igg[ig])
+                                      * (cos(phase) * aux[ig].imag() + sin(phase) * aux[ig].real());
                 forcelc(iat, 0) += rho_basis->gcar[ig][0] * factor;
                 forcelc(iat, 1) += rho_basis->gcar[ig][1] * factor;
                 forcelc(iat, 2) += rho_basis->gcar[ig][2] * factor;
             }
-            for (int ipol = 0;ipol < 3;ipol++)
+            for (int ipol = 0; ipol < 3; ipol++)
             {
                 forcelc(iat, ipol) *= (GlobalC::ucell.tpiba * GlobalC::ucell.omega);
             }
             ++iat;
         }
     }
-    //this->print(GlobalV::ofs_running, "local forces", forcelc);
+    // this->print(GlobalV::ofs_running, "local forces", forcelc);
     Parallel_Reduce::reduce_double_pool(forcelc.c, forcelc.nr * forcelc.nc);
     delete[] aux;
-	ModuleBase::timer::tick("Forces","cal_force_loc");
+    ModuleBase::timer::tick("Forces", "cal_force_loc");
     return;
 }
 
 #include "H_Ewald_pw.h"
 void Forces::cal_force_ew(ModuleBase::matrix& forceion, ModulePW::PW_Basis* rho_basis)
 {
-	ModuleBase::timer::tick("Forces","cal_force_ew");
+    ModuleBase::timer::tick("Forces", "cal_force_ew");
 
     double fact = 2.0;
-    std::complex<double> *aux = new std::complex<double> [rho_basis->npw];
+    std::complex<double>* aux = new std::complex<double>[rho_basis->npw];
     ModuleBase::GlobalFunc::ZEROS(aux, rho_basis->npw);
 
-    for (int it = 0;it < GlobalC::ucell.ntype;it++)
+    for (int it = 0; it < GlobalC::ucell.ntype; it++)
     {
         for (int ig = 0; ig < rho_basis->npw; ig++)
         {
-            if(ig == rho_basis->ig_gge0)   continue;
+            if (ig == rho_basis->ig_gge0)
+                continue;
             aux[ig] += static_cast<double>(GlobalC::ucell.atoms[it].zv) * conj(GlobalC::sf.strucFac(it, ig));
         }
     }
 
-	// calculate total ionic charge
+    // calculate total ionic charge
     double charge = 0.0;
-    for (int it = 0;it < GlobalC::ucell.ntype;it++)
+    for (int it = 0; it < GlobalC::ucell.ntype; it++)
     {
-        charge += GlobalC::ucell.atoms[it].na * GlobalC::ucell.atoms[it].zv;//mohan modify 2007-11-7
+        charge += GlobalC::ucell.atoms[it].na * GlobalC::ucell.atoms[it].zv; // mohan modify 2007-11-7
     }
-	
-	double alpha = 1.1;
-	double upperbound ;
+
+    double alpha = 1.1;
+    double upperbound;
     do
     {
         alpha -= 0.10;
@@ -450,200 +502,210 @@ void Forces::cal_force_ew(ModuleBase::matrix& forceion, ModulePW::PW_Basis* rho_
 
         if (alpha <= 0.0)
         {
-            ModuleBase::WARNING_QUIT("ewald","Can't find optimal alpha.");
+            ModuleBase::WARNING_QUIT("ewald", "Can't find optimal alpha.");
         }
-        upperbound = 2.0 * charge * charge * sqrt(2.0 * alpha / ModuleBase::TWO_PI) *
-                     erfc(sqrt(GlobalC::ucell.tpiba2 * rho_basis->ggecut / 4.0 / alpha));
-    }
-    while (upperbound > 1.0e-6);
-//	std::cout << " GlobalC::en.alpha = " << alpha << std::endl;
-//	std::cout << " upperbound = " << upperbound << std::endl;
-	
-
+        upperbound = 2.0 * charge * charge * sqrt(2.0 * alpha / ModuleBase::TWO_PI)
+                     * erfc(sqrt(GlobalC::ucell.tpiba2 * rho_basis->ggecut / 4.0 / alpha));
+    } while (upperbound > 1.0e-6);
+    //	std::cout << " GlobalC::en.alpha = " << alpha << std::endl;
+    //	std::cout << " upperbound = " << upperbound << std::endl;
 
     for (int ig = 0; ig < rho_basis->npw; ig++)
     {
-        if(ig == rho_basis->ig_gge0)   continue;
-        aux[ig] *= exp(-1.0 * rho_basis->gg[ig] * GlobalC::ucell.tpiba2 / alpha / 4.0) / (rho_basis->gg[ig] * GlobalC::ucell.tpiba2);
+        if (ig == rho_basis->ig_gge0)
+            continue;
+        aux[ig] *= exp(-1.0 * rho_basis->gg[ig] * GlobalC::ucell.tpiba2 / alpha / 4.0)
+                   / (rho_basis->gg[ig] * GlobalC::ucell.tpiba2);
     }
 
     int iat = 0;
-    for (int it = 0;it < GlobalC::ucell.ntype;it++)
+    for (int it = 0; it < GlobalC::ucell.ntype; it++)
     {
-        for (int ia = 0;ia < GlobalC::ucell.atoms[it].na;ia++)
+        for (int ia = 0; ia < GlobalC::ucell.atoms[it].na; ia++)
         {
             for (int ig = 0; ig < rho_basis->npw; ig++)
             {
-                if(ig == rho_basis->ig_gge0)   continue;
+                if (ig == rho_basis->ig_gge0)
+                    continue;
                 const ModuleBase::Vector3<double> gcar = rho_basis->gcar[ig];
                 const double arg = ModuleBase::TWO_PI * (gcar * GlobalC::ucell.atoms[it].tau[ia]);
-                double sumnb =  -cos(arg) * aux[ig].imag() + sin(arg) * aux[ig].real();
+                double sumnb = -cos(arg) * aux[ig].imag() + sin(arg) * aux[ig].real();
                 forceion(iat, 0) += gcar[0] * sumnb;
                 forceion(iat, 1) += gcar[1] * sumnb;
                 forceion(iat, 2) += gcar[2] * sumnb;
             }
-            for (int ipol = 0;ipol < 3;ipol++)
+            for (int ipol = 0; ipol < 3; ipol++)
             {
-                forceion(iat, ipol) *= GlobalC::ucell.atoms[it].zv * ModuleBase::e2 * GlobalC::ucell.tpiba * ModuleBase::TWO_PI / GlobalC::ucell.omega * fact;
+                forceion(iat, ipol) *= GlobalC::ucell.atoms[it].zv * ModuleBase::e2 * GlobalC::ucell.tpiba
+                                       * ModuleBase::TWO_PI / GlobalC::ucell.omega * fact;
             }
 
-	//		std::cout << " atom" << iat << std::endl;
-	//		std::cout << std::setw(15) << forceion(iat, 0) << std::setw(15) << forceion(iat,1) << std::setw(15) << forceion(iat,2) << std::endl; 
+            //		std::cout << " atom" << iat << std::endl;
+            //		std::cout << std::setw(15) << forceion(iat, 0) << std::setw(15) << forceion(iat,1) << std::setw(15)
+            //<< forceion(iat,2) << std::endl;
             iat++;
         }
     }
-    delete [] aux;
-
+    delete[] aux;
 
-	// means that the processor contains G=0 term.
+    // means that the processor contains G=0 term.
     if (rho_basis->ig_gge0 >= 0)
     {
         double rmax = 5.0 / (sqrt(alpha) * GlobalC::ucell.lat0);
         int nrm = 0;
-		
-        //output of rgen: the number of vectors in the sphere
+
+        // output of rgen: the number of vectors in the sphere
         const int mxr = 50;
         // the maximum number of R vectors included in r
-        ModuleBase::Vector3<double> *r  = new ModuleBase::Vector3<double>[mxr];
-        double *r2 = new double[mxr];
-		ModuleBase::GlobalFunc::ZEROS(r2, mxr);
-        int *irr = new int[mxr];
-		ModuleBase::GlobalFunc::ZEROS(irr, mxr);
+        ModuleBase::Vector3<double>* r = new ModuleBase::Vector3<double>[mxr];
+        double* r2 = new double[mxr];
+        ModuleBase::GlobalFunc::ZEROS(r2, mxr);
+        int* irr = new int[mxr];
+        ModuleBase::GlobalFunc::ZEROS(irr, mxr);
         // the square modulus of R_j-tau_s-tau_s'
 
-		int iat1 = 0;
+        int iat1 = 0;
         for (int T1 = 0; T1 < GlobalC::ucell.ntype; T1++)
         {
-			Atom* atom1 = &GlobalC::ucell.atoms[T1]; 
+            Atom* atom1 = &GlobalC::ucell.atoms[T1];
             for (int I1 = 0; I1 < atom1->na; I1++)
             {
-				int iat2 = 0; // mohan fix bug 2011-06-07
+                int iat2 = 0; // mohan fix bug 2011-06-07
                 for (int T2 = 0; T2 < GlobalC::ucell.ntype; T2++)
                 {
                     for (int I2 = 0; I2 < GlobalC::ucell.atoms[T2].na; I2++)
                     {
                         if (iat1 != iat2)
                         {
-                            ModuleBase::Vector3<double> d_tau = GlobalC::ucell.atoms[T1].tau[I1] - GlobalC::ucell.atoms[T2].tau[I2];
+                            ModuleBase::Vector3<double> d_tau
+                                = GlobalC::ucell.atoms[T1].tau[I1] - GlobalC::ucell.atoms[T2].tau[I2];
                             H_Ewald_pw::rgen(d_tau, rmax, irr, GlobalC::ucell.latvec, GlobalC::ucell.G, r, r2, nrm);
 
-                            for (int n = 0;n < nrm;n++)
+                            for (int n = 0; n < nrm; n++)
                             {
-								const double rr = sqrt(r2[n]) * GlobalC::ucell.lat0;
+                                const double rr = sqrt(r2[n]) * GlobalC::ucell.lat0;
 
-                                double factor = GlobalC::ucell.atoms[T1].zv * GlobalC::ucell.atoms[T2].zv * ModuleBase::e2 / (rr * rr)
-                                                * (erfc(sqrt(alpha) * rr) / rr
-                                    + sqrt(8.0 * alpha / ModuleBase::TWO_PI) * exp(-1.0 * alpha * rr * rr)) * GlobalC::ucell.lat0;
+                                double factor
+                                    = GlobalC::ucell.atoms[T1].zv * GlobalC::ucell.atoms[T2].zv * ModuleBase::e2
+                                      / (rr * rr)
+                                      * (erfc(sqrt(alpha) * rr) / rr
+                                         + sqrt(8.0 * alpha / ModuleBase::TWO_PI) * exp(-1.0 * alpha * rr * rr))
+                                      * GlobalC::ucell.lat0;
 
-								forceion(iat1, 0) -= factor * r[n].x;
+                                forceion(iat1, 0) -= factor * r[n].x;
                                 forceion(iat1, 1) -= factor * r[n].y;
                                 forceion(iat1, 2) -= factor * r[n].z;
 
-//								std::cout << " r.z=" << r[n].z << " r2=" << r2[n] << std::endl;
-						//		std::cout << " " << iat1 << " " << iat2 << " n=" << n
-						//		 << " rn.z=" << r[n].z 
-						//		 << " r2=" << r2[n] << " rr=" << rr << " fac=" << factor << " force=" << forceion(iat1,2) 
-						//		 << " new_part=" << factor*r[n].z <<  std::endl;
+                                //								std::cout << " r.z=" << r[n].z << " r2=" << r2[n] <<
+                                // std::endl; 		std::cout << " " << iat1 << " " << iat2 << " n=" << n
+                                //		 << " rn.z=" << r[n].z
+                                //		 << " r2=" << r2[n] << " rr=" << rr << " fac=" << factor << " force=" <<
+                                // forceion(iat1,2)
+                                //		 << " new_part=" << factor*r[n].z <<  std::endl;
                             }
                         }
 
                         ++iat2;
                     }
-                }//atom b
+                } // atom b
 
-//				std::cout << " atom" << iat1 << std::endl;
-//				std::cout << std::setw(15) << forceion(iat1, 0) << std::setw(15) << forceion(iat1,1) << std::setw(15) << forceion(iat1,2) << std::endl; 
+                //				std::cout << " atom" << iat1 << std::endl;
+                //				std::cout << std::setw(15) << forceion(iat1, 0) << std::setw(15) << forceion(iat1,1) <<
+                // std::setw(15) << forceion(iat1,2) << std::endl;
 
                 ++iat1;
             }
-        }//atom a
-        delete []r;
-        delete []r2;
-        delete []irr;
+        } // atom a
+        delete[] r;
+        delete[] r2;
+        delete[] irr;
     }
 
     Parallel_Reduce::reduce_double_pool(forceion.c, forceion.nr * forceion.nc);
 
-    //this->print(GlobalV::ofs_running, "ewald forces", forceion);
+    // this->print(GlobalV::ofs_running, "ewald forces", forceion);
 
-	ModuleBase::timer::tick("Forces","cal_force_ew");
+    ModuleBase::timer::tick("Forces", "cal_force_ew");
 
     return;
 }
 
 void Forces::cal_force_cc(ModuleBase::matrix& forcecc, ModulePW::PW_Basis* rho_basis)
 {
-	// recalculate the exchange-correlation potential.
-	
+    // recalculate the exchange-correlation potential.
+
     ModuleBase::matrix v(GlobalV::NSPIN, rho_basis->nrxx);
 
-	if(XC_Functional::get_func_type() == 3)
-	{
+    if (XC_Functional::get_func_type() == 3)
+    {
 #ifdef USE_LIBXC
-    	const auto etxc_vtxc_v = XC_Functional::v_xc_meta(
-            rho_basis->nrxx, rho_basis->nxyz, GlobalC::ucell.omega,
-            GlobalC::CHR.rho, GlobalC::CHR.rho_core, GlobalC::CHR.kin_r);
-        
+        const auto etxc_vtxc_v = XC_Functional::v_xc_meta(rho_basis->nrxx,
+                                                          rho_basis->nxyz,
+                                                          GlobalC::ucell.omega,
+                                                          GlobalC::CHR.rho,
+                                                          GlobalC::CHR.rho_core,
+                                                          GlobalC::CHR.kin_r);
+
         GlobalC::en.etxc = std::get<0>(etxc_vtxc_v);
         GlobalC::en.vtxc = std::get<1>(etxc_vtxc_v);
         v = std::get<2>(etxc_vtxc_v);
 #else
-        ModuleBase::WARNING_QUIT("cal_force_cc","to use mGGA, compile with LIBXC");
+        ModuleBase::WARNING_QUIT("cal_force_cc", "to use mGGA, compile with LIBXC");
 #endif
-	}
-	else
-	{	
-    	const auto etxc_vtxc_v = XC_Functional::v_xc(
-            rho_basis->nrxx, rho_basis->nxyz, GlobalC::ucell.omega,
-            GlobalC::CHR.rho, GlobalC::CHR.rho_core);
-        
+    }
+    else
+    {
+        const auto etxc_vtxc_v = XC_Functional::v_xc(rho_basis->nrxx,
+                                                     rho_basis->nxyz,
+                                                     GlobalC::ucell.omega,
+                                                     GlobalC::CHR.rho,
+                                                     GlobalC::CHR.rho_core);
+
         GlobalC::en.etxc = std::get<0>(etxc_vtxc_v);
         GlobalC::en.vtxc = std::get<1>(etxc_vtxc_v);
-	    v = std::get<2>(etxc_vtxc_v);
-	}
+        v = std::get<2>(etxc_vtxc_v);
+    }
 
-	const ModuleBase::matrix vxc = v;
-    std::complex<double> * psiv = new std::complex<double> [rho_basis->nmaxgr];
+    const ModuleBase::matrix vxc = v;
+    std::complex<double>* psiv = new std::complex<double>[rho_basis->nmaxgr];
     ModuleBase::GlobalFunc::ZEROS(psiv, rho_basis->nrxx);
     if (GlobalV::NSPIN == 1 || GlobalV::NSPIN == 4)
     {
-        for (int ir = 0;ir < rho_basis->nrxx;ir++)
+        for (int ir = 0; ir < rho_basis->nrxx; ir++)
         {
-            psiv[ir] = std::complex<double>(vxc(0, ir),  0.0);
+            psiv[ir] = std::complex<double>(vxc(0, ir), 0.0);
         }
     }
     else
     {
-        for (int ir = 0;ir < rho_basis->nrxx;ir++)
+        for (int ir = 0; ir < rho_basis->nrxx; ir++)
         {
-            psiv[ir] = 0.5 * (vxc(0 ,ir) + vxc(1, ir));
+            psiv[ir] = 0.5 * (vxc(0, ir) + vxc(1, ir));
         }
     }
 
-	// to G space
+    // to G space
     rho_basis->real2recip(psiv, psiv);
 
-    //psiv contains now Vxc(G)
-    double * rhocg = new double [rho_basis->ngg];
+    // psiv contains now Vxc(G)
+    double* rhocg = new double[rho_basis->ngg];
     ModuleBase::GlobalFunc::ZEROS(rhocg, rho_basis->ngg);
     int iat = 0;
-    for (int T1 = 0;T1 < GlobalC::ucell.ntype;T1++)
+    for (int T1 = 0; T1 < GlobalC::ucell.ntype; T1++)
     {
         if (GlobalC::ucell.atoms[T1].nlcc)
         {
-            //call drhoc
-            GlobalC::CHR.non_linear_core_correction(
-                GlobalC::ppcell.numeric,
-                GlobalC::ucell.atoms[T1].msh,
-                GlobalC::ucell.atoms[T1].r,
-                GlobalC::ucell.atoms[T1].rab,
-                GlobalC::ucell.atoms[T1].rho_atc,
-                rhocg,
-                rho_basis);
-
-
-			std::complex<double> ipol0, ipol1, ipol2;
-            for (int I1 = 0;I1 < GlobalC::ucell.atoms[T1].na;I1++)
+            // call drhoc
+            GlobalC::CHR.non_linear_core_correction(GlobalC::ppcell.numeric,
+                                                    GlobalC::ucell.atoms[T1].msh,
+                                                    GlobalC::ucell.atoms[T1].r,
+                                                    GlobalC::ucell.atoms[T1].rab,
+                                                    GlobalC::ucell.atoms[T1].rho_atc,
+                                                    rhocg,
+                                                    rho_basis);
+
+            std::complex<double> ipol0, ipol1, ipol2;
+            for (int I1 = 0; I1 < GlobalC::ucell.atoms[T1].na; I1++)
             {
                 for (int ig = 0; ig < rho_basis->npw; ig++)
                 {
@@ -656,7 +718,7 @@ void Forces::cal_force_cc(ModuleBase::matrix& forcecc, ModulePW::PW_Basis* rho_b
                     const std::complex<double> expiarg = std::complex<double>(sin(arg), cos(arg));
 
                     ipol0 = GlobalC::ucell.tpiba * GlobalC::ucell.omega * rhocgigg * gv.x * psiv_conj * expiarg;
-                    forcecc(iat, 0) +=  ipol0.real();
+                    forcecc(iat, 0) += ipol0.real();
 
                     ipol1 = GlobalC::ucell.tpiba * GlobalC::ucell.omega * rhocgigg * gv.y * psiv_conj * expiarg;
                     forcecc(iat, 1) += ipol1.real();
@@ -667,38 +729,40 @@ void Forces::cal_force_cc(ModuleBase::matrix& forcecc, ModulePW::PW_Basis* rho_b
                 ++iat;
             }
         }
-        else{
+        else
+        {
             iat += GlobalC::ucell.atoms[T1].na;
         }
     }
     assert(iat == GlobalC::ucell.nat);
-    delete [] rhocg;
-	delete [] psiv; // mohan fix bug 2012-03-22
-    Parallel_Reduce::reduce_double_pool(forcecc.c, forcecc.nr * forcecc.nc); //qianrui fix a bug for kpar > 1
-	return;
+    delete[] rhocg;
+    delete[] psiv; // mohan fix bug 2012-03-22
+    Parallel_Reduce::reduce_double_pool(forcecc.c, forcecc.nr * forcecc.nc); // qianrui fix a bug for kpar > 1
+    return;
 }
 
 #include "../module_base/complexarray.h"
 #include "../module_base/complexmatrix.h"
 void Forces::cal_force_nl(ModuleBase::matrix& forcenl, const psi::Psi<complex<double>>* psi_in)
 {
-	ModuleBase::TITLE("Forces","cal_force_nl");
-	ModuleBase::timer::tick("Forces","cal_force_nl");
+    ModuleBase::TITLE("Forces", "cal_force_nl");
+    ModuleBase::timer::tick("Forces", "cal_force_nl");
 
     const int nkb = GlobalC::ppcell.nkb;
-	if(nkb == 0) return; // mohan add 2010-07-25
-	
-	// dbecp: conj( -iG * <Beta(nkb,npw)|psi(nbnd,npw)> )
-	ModuleBase::ComplexArray dbecp( 3, GlobalV::NBANDS, nkb);
-    ModuleBase::ComplexMatrix becp( GlobalV::NBANDS, nkb);
-    
-	
-	// vkb1: |Beta(nkb,npw)><Beta(nkb,npw)|psi(nbnd,npw)>
-	ModuleBase::ComplexMatrix vkb1( nkb, GlobalC::wf.npwx );
-
-    for (int ik = 0;ik < GlobalC::kv.nks;ik++)
+    if (nkb == 0)
+        return; // mohan add 2010-07-25
+
+    // dbecp: conj( -iG * <Beta(nkb,npw)|psi(nbnd,npw)> )
+    ModuleBase::ComplexArray dbecp(3, GlobalV::NBANDS, nkb);
+    ModuleBase::ComplexMatrix becp(GlobalV::NBANDS, nkb);
+
+    // vkb1: |Beta(nkb,npw)><Beta(nkb,npw)|psi(nbnd,npw)>
+    ModuleBase::ComplexMatrix vkb1(nkb, GlobalC::wf.npwx);
+
+    for (int ik = 0; ik < GlobalC::kv.nks; ik++)
     {
-        if (GlobalV::NSPIN==2) GlobalV::CURRENT_SPIN = GlobalC::kv.isk[ik];
+        if (GlobalV::NSPIN == 2)
+            GlobalV::CURRENT_SPIN = GlobalC::kv.isk[ik];
         const int nbasis = GlobalC::kv.ngk[ik];
         // generate vkb
         if (GlobalC::ppcell.nkb > 0)
@@ -708,62 +772,62 @@ void Forces::cal_force_nl(ModuleBase::matrix& forcenl, const psi::Psi<complex<do
 
         // get becp according to wave functions and vkb
         // important here ! becp must set zero!!
-		// vkb: Beta(nkb,npw)
-		// becp(nkb,nbnd): <Beta(nkb,npw)|psi(nbnd,npw)>
+        // vkb: Beta(nkb,npw)
+        // becp(nkb,nbnd): <Beta(nkb,npw)|psi(nbnd,npw)>
         becp.zero_out();
         psi_in[0].fix_k(ik);
         char transa = 'C';
         char transb = 'N';
         ///
-        ///only occupied band should be calculated.
+        /// only occupied band should be calculated.
         ///
         int nbands_occ = GlobalV::NBANDS;
-        while(GlobalC::wf.wg(ik, nbands_occ-1) < ModuleBase::threshold_wg)
+        while (GlobalC::wf.wg(ik, nbands_occ - 1) < ModuleBase::threshold_wg)
         {
             nbands_occ--;
         }
         int npm = GlobalV::NPOL * nbands_occ;
         zgemm_(&transa,
-            &transb,
-            &nkb,
-            &npm,
-            &nbasis,
-            &ModuleBase::ONE,
-            GlobalC::ppcell.vkb.c,
-            &GlobalC::wf.npwx,
-            psi_in[0].get_pointer(),
-            &GlobalC::wf.npwx,
-            &ModuleBase::ZERO,
-            becp.c,
-            &nkb);
-        Parallel_Reduce::reduce_complex_double_pool( becp.c, becp.size);
-
-        //out.printcm_real("becp",becp,1.0e-4);
-        // Calculate the derivative of beta,
-        // |dbeta> =  -ig * |beta>
+               &transb,
+               &nkb,
+               &npm,
+               &nbasis,
+               &ModuleBase::ONE,
+               GlobalC::ppcell.vkb.c,
+               &GlobalC::wf.npwx,
+               psi_in[0].get_pointer(),
+               &GlobalC::wf.npwx,
+               &ModuleBase::ZERO,
+               becp.c,
+               &nkb);
+        Parallel_Reduce::reduce_complex_double_pool(becp.c, becp.size);
+
+        // out.printcm_real("becp",becp,1.0e-4);
+        //  Calculate the derivative of beta,
+        //  |dbeta> =  -ig * |beta>
         dbecp.zero_out();
-        for (int ipol = 0; ipol<3; ipol++)
+        for (int ipol = 0; ipol < 3; ipol++)
         {
-			for (int i = 0;i < nkb;i++)
-			{
-                std::complex<double>* pvkb1 = &vkb1(i,0);
-                std::complex<double>* pvkb = &GlobalC::ppcell.vkb(i,0);
-				if (ipol==0)
-				{
-					for (int ig=0; ig<nbasis; ig++)
-                        pvkb1[ig] = pvkb[ig] * ModuleBase::NEG_IMAG_UNIT * GlobalC::wfcpw->getgcar(ik,ig)[0];
+            for (int i = 0; i < nkb; i++)
+            {
+                std::complex<double>* pvkb1 = &vkb1(i, 0);
+                std::complex<double>* pvkb = &GlobalC::ppcell.vkb(i, 0);
+                if (ipol == 0)
+                {
+                    for (int ig = 0; ig < nbasis; ig++)
+                        pvkb1[ig] = pvkb[ig] * ModuleBase::NEG_IMAG_UNIT * GlobalC::wfcpw->getgcar(ik, ig)[0];
                 }
-				if (ipol==1)
-				{
-					for (int ig=0; ig<nbasis; ig++)
-                        pvkb1[ig] = pvkb[ig] * ModuleBase::NEG_IMAG_UNIT * GlobalC::wfcpw->getgcar(ik,ig)[1];
+                if (ipol == 1)
+                {
+                    for (int ig = 0; ig < nbasis; ig++)
+                        pvkb1[ig] = pvkb[ig] * ModuleBase::NEG_IMAG_UNIT * GlobalC::wfcpw->getgcar(ik, ig)[1];
                 }
-				if (ipol==2)
-				{
-					for (int ig=0; ig<nbasis; ig++)
-                        pvkb1[ig] = pvkb[ig] * ModuleBase::NEG_IMAG_UNIT * GlobalC::wfcpw->getgcar(ik,ig)[2];
+                if (ipol == 2)
+                {
+                    for (int ig = 0; ig < nbasis; ig++)
+                        pvkb1[ig] = pvkb[ig] * ModuleBase::NEG_IMAG_UNIT * GlobalC::wfcpw->getgcar(ik, ig)[2];
                 }
-			}
+            }
             std::complex<double>* pdbecp = &dbecp(ipol, 0, 0);
             zgemm_(&transa,
                 &transb,
@@ -842,27 +906,27 @@ void Forces::cal_force_nl(ModuleBase::matrix& forcenl, const psi::Psi<complex<do
 
     // sum up forcenl from all processors
     Parallel_Reduce::reduce_double_all(forcenl.c, forcenl.nr * forcenl.nc);
-//  this->print(GlobalV::ofs_running, "nonlocal forces", forcenl);
-	ModuleBase::timer::tick("Forces","cal_force_nl");
+    //  this->print(GlobalV::ofs_running, "nonlocal forces", forcenl);
+    ModuleBase::timer::tick("Forces", "cal_force_nl");
     return;
 }
 
 void Forces::cal_force_scc(ModuleBase::matrix& forcescc, ModulePW::PW_Basis* rho_basis)
 {
-    std::complex<double>* psic = new std::complex<double> [rho_basis->nmaxgr];
+    std::complex<double>* psic = new std::complex<double>[rho_basis->nmaxgr];
 
     if (GlobalV::NSPIN == 1 || GlobalV::NSPIN == 4)
     {
-        for (int i = 0;i < rho_basis->nrxx;i++)
+        for (int i = 0; i < rho_basis->nrxx; i++)
         {
-            psic[i] = GlobalC::pot.vnew(0,i);
+            psic[i] = GlobalC::pot.vnew(0, i);
         }
     }
     else
     {
         int isup = 0;
         int isdw = 1;
-        for (int i = 0;i < rho_basis->nrxx;i++)
+        for (int i = 0; i < rho_basis->nrxx; i++)
         {
             psic[i] = (GlobalC::pot.vnew(isup, i) + GlobalC::pot.vnew(isdw, i)) * 0.5;
         }
@@ -870,7 +934,7 @@ void Forces::cal_force_scc(ModuleBase::matrix& forcescc, ModulePW::PW_Basis* rho
 
     int ndm = 0;
 
-    for (int it = 0;it < GlobalC::ucell.ntype;it++)
+    for (int it = 0; it < GlobalC::ucell.ntype; it++)
     {
         if (ndm < GlobalC::ucell.atoms[it].msh)
         {
@@ -878,29 +942,30 @@ void Forces::cal_force_scc(ModuleBase::matrix& forcescc, ModulePW::PW_Basis* rho
         }
     }
 
-    //work space
+    // work space
     double* aux = new double[ndm];
     ModuleBase::GlobalFunc::ZEROS(aux, ndm);
 
     double* rhocgnt = new double[rho_basis->ngg];
     ModuleBase::GlobalFunc::ZEROS(rhocgnt, rho_basis->ngg);
 
-    rho_basis->real2recip(psic,psic);
+    rho_basis->real2recip(psic, psic);
 
     int igg0 = 0;
     const int ig0 = rho_basis->ig_gge0;
-    if (rho_basis->gg_uniq [0] < 1.0e-8)  igg0 = 1;
+    if (rho_basis->gg_uniq[0] < 1.0e-8)
+        igg0 = 1;
 
     double fact = 2.0;
-    for (int nt = 0;nt < GlobalC::ucell.ntype;nt++)
+    for (int nt = 0; nt < GlobalC::ucell.ntype; nt++)
     {
-//		Here we compute the G.ne.0 term
+        //		Here we compute the G.ne.0 term
         const int mesh = GlobalC::ucell.atoms[nt].msh;
 
-        for (int ig = igg0 ; ig < rho_basis->ngg; ++ig)
+        for (int ig = igg0; ig < rho_basis->ngg; ++ig)
         {
             const double gx = sqrt(rho_basis->gg_uniq[ig]) * GlobalC::ucell.tpiba;
-            for (int ir = 0;ir < mesh;ir++)
+            for (int ir = 0; ir < mesh; ir++)
             {
                 if (GlobalC::ucell.atoms[nt].r[ir] < 1.0e-8)
                 {
@@ -912,19 +977,20 @@ void Forces::cal_force_scc(ModuleBase::matrix& forcescc, ModulePW::PW_Basis* rho
                     aux[ir] = GlobalC::ucell.atoms[nt].rho_at[ir] * sin(gxx) / gxx;
                 }
             }
-            ModuleBase::Integral::Simpson_Integral(mesh , aux, GlobalC::ucell.atoms[nt].rab , rhocgnt [ig]);
+            ModuleBase::Integral::Simpson_Integral(mesh, aux, GlobalC::ucell.atoms[nt].rab, rhocgnt[ig]);
         }
 
         int iat = 0;
-        for (int it = 0;it < GlobalC::ucell.ntype;it++)
+        for (int it = 0; it < GlobalC::ucell.ntype; it++)
         {
-            for (int ia = 0;ia < GlobalC::ucell.atoms[it].na;ia++)
+            for (int ia = 0; ia < GlobalC::ucell.atoms[it].na; ia++)
             {
                 if (nt == it)
                 {
-                    for (int ig = 0;ig < rho_basis->npw; ++ig)
+                    for (int ig = 0; ig < rho_basis->npw; ++ig)
                     {
-                        if(ig==ig0)     continue;
+                        if (ig == ig0)
+                            continue;
                         const ModuleBase::Vector3<double> gv = rho_basis->gcar[ig];
                         const ModuleBase::Vector3<double> pos = GlobalC::ucell.atoms[it].tau[ia];
                         const double rhocgntigg = rhocgnt[GlobalC::rhopw->ig2igg[ig]];
@@ -935,21 +1001,19 @@ void Forces::cal_force_scc(ModuleBase::matrix& forcescc, ModulePW::PW_Basis* rho
                         forcescc(iat, 1) += fact * rhocgntigg * GlobalC::ucell.tpiba * gv.y * cpm.real();
                         forcescc(iat, 2) += fact * rhocgntigg * GlobalC::ucell.tpiba * gv.z * cpm.real();
                     }
-					//std::cout << " forcescc = " << forcescc(iat,0) << " " << forcescc(iat,1) << " " << forcescc(iat,2) << std::endl;
+                    // std::cout << " forcescc = " << forcescc(iat,0) << " " << forcescc(iat,1) << " " <<
+                    // forcescc(iat,2) << std::endl;
                 }
                 iat++;
             }
         }
     }
-    
-	Parallel_Reduce::reduce_double_pool(forcescc.c, forcescc.nr * forcescc.nc);
 
-	delete[] psic; //mohan fix bug 2012-03-22
-	delete[] aux; //mohan fix bug 2012-03-22
-	delete[] rhocgnt;  //mohan fix bug 2012-03-22
+    Parallel_Reduce::reduce_double_pool(forcescc.c, forcescc.nr * forcescc.nc);
+
+    delete[] psic; // mohan fix bug 2012-03-22
+    delete[] aux; // mohan fix bug 2012-03-22
+    delete[] rhocgnt; // mohan fix bug 2012-03-22
 
     return;
 }
-
-
-
diff --git a/source/src_pw/potential.cpp b/source/src_pw/potential.cpp
index cde1ad785f..20514fe19f 100644
--- a/source/src_pw/potential.cpp
+++ b/source/src_pw/potential.cpp
@@ -7,9 +7,9 @@
 #include "global.h"
 #include "math.h"
 // new
+#include "../module_surchem/efield.h"
 #include "../module_surchem/surchem.h"
 #include "H_Hartree_pw.h"
-#include "../module_surchem/efield.h"
 #ifdef __LCAO
 #include "../src_lcao/ELEC_evolve.h"
 #endif
@@ -39,8 +39,10 @@ void Potential::allocate(const int nrxx)
     assert(nrxx >= 0);
 
     delete[] this->vltot;
-    if(nrxx > 0)    this->vltot = new double[nrxx];
-    else            this->vltot = nullptr;
+    if (nrxx > 0)
+        this->vltot = new double[nrxx];
+    else
+        this->vltot = nullptr;
     ModuleBase::Memory::record("Potential", "vltot", nrxx, "double");
 
     this->vr.create(GlobalV::NSPIN, nrxx);
@@ -59,8 +61,10 @@ void Potential::allocate(const int nrxx)
     }
 
     delete[] this->vr_eff1;
-    if(nrxx > 0)    this->vr_eff1 = new double[nrxx];
-    else            this->vr_eff1 = nullptr;
+    if (nrxx > 0)
+        this->vr_eff1 = new double[nrxx];
+    else
+        this->vr_eff1 = nullptr;
 #ifdef __CUDA
     cudaMalloc((void **)&this->d_vr_eff1, nrxx * sizeof(double));
 #endif
@@ -69,7 +73,7 @@ void Potential::allocate(const int nrxx)
     this->vnew.create(GlobalV::NSPIN, nrxx);
     ModuleBase::Memory::record("Potential", "vnew", GlobalV::NSPIN * nrxx, "double");
 
-    if (GlobalV::imp_sol)
+    if (GlobalV::imp_sol || GlobalV::comp_chg)
     {
         GlobalC::solvent_model.allocate(nrxx, GlobalV::NSPIN);
     }
@@ -266,7 +270,7 @@ void Potential::init_pot(const int &istep, // number of ionic steps
 void Potential::set_local_pot(double *vl_pseudo, // store the local pseudopotential
                               const int &ntype, // number of atom types
                               ModuleBase::matrix &vloc, // local pseduopotentials
-                              ModulePW::PW_Basis* rho_basis,
+                              ModulePW::PW_Basis *rho_basis,
                               ModuleBase::ComplexMatrix &sf // structure factors
 ) const
 {
@@ -359,17 +363,11 @@ ModuleBase::matrix Potential::v_of_rho(const double *const *const rho_in, const
         v += H_Hartree_pw::v_hartree(GlobalC::ucell, GlobalC::rhopw, GlobalV::NSPIN, rho_in);
         if(GlobalV::comp_chg)
         {
-            v += GlobalC::solvent_model.v_compensating(GlobalC::ucell, GlobalC::rhopw);
+            v += GlobalC::solvent_model.v_compensating(GlobalC::ucell, GlobalC::rhopw, GlobalV::NSPIN, rho_in);
         }
         if (GlobalV::imp_sol)
         {
             v += GlobalC::solvent_model.v_correction(GlobalC::ucell, GlobalC::rhopw, GlobalV::NSPIN, rho_in);
-            /*
-            // test energy outside
-            cout << "energy Outside: " << endl;
-            GlobalC::solvent_model.cal_Ael(GlobalC::ucell, GlobalC::rhopw);
-            GlobalC::solvent_model.cal_Acav(GlobalC::ucell, GlobalC::rhopw);
-            */
         }
     }
 
@@ -381,6 +379,11 @@ ModuleBase::matrix Potential::v_of_rho(const double *const *const rho_in, const
         v += Efield::add_efield(GlobalC::ucell, GlobalC::rhopw, GlobalV::NSPIN, rho_in);
     }
 
+    // test get ntot_reci
+    // complex<double> *tmpn = new complex<double>[GlobalC::rhopw->npw];
+    // ModuleBase::GlobalFunc::ZEROS(tmpn, GlobalC::rhopw->npw);
+    // GlobalC::solvent_model.get_totn_reci(GlobalC::ucell, GlobalC::rhopw, tmpn);
+    // delete[] tmpn;
 
     ModuleBase::timer::tick("Potential", "v_of_rho");
     return v;
diff --git a/source/src_ri/exx_abfs.cpp b/source/src_ri/exx_abfs.cpp
index 90acf22a18..995ad78e1d 100644
--- a/source/src_ri/exx_abfs.cpp
+++ b/source/src_ri/exx_abfs.cpp
@@ -490,7 +490,6 @@ std::cout<<"I"<<std::endl;
 
 void Exx_Abfs::cal_exx() const
 {
-	// ȫ����ֻһ��
 std::cout<<"A"<<std::endl;
 
 	const std::vector<std::vector<std::vector<Numerical_Orbital_Lm>>>
@@ -639,7 +638,6 @@ std::cout<<"E"<<std::endl;
 
 std::cout<<"F"<<std::endl;
 
-	// ÿһ�����Ӳ�
 	const std::map<size_t,std::map<size_t,std::map<size_t,std::map<size_t,ModuleBase::matrix>>>>
 		&&ms_abfs_abfs = m_abfs_abfs.cal_overlap_matrix( index_abfs, index_abfs );
 ofs_ms("ms_abfs_abfs",ms_abfs_abfs);
@@ -661,7 +659,6 @@ ofs_ms("ms_C",ms_C);
 
 std::cout<<"H"<<std::endl;
 
-	// ÿһ�����Ӳ�
 	timeval t_begin;
 	gettimeofday( &t_begin, NULL);
 
diff --git a/tests/integrate/117_PW_comp_H2O/INPUT b/tests/integrate/117_PW_comp_H2O/INPUT
new file mode 100644
index 0000000000..bbd6325ed6
--- /dev/null
+++ b/tests/integrate/117_PW_comp_H2O/INPUT
@@ -0,0 +1,21 @@
+INPUT_PARAMETERS
+#Parameters (1.General)
+suffix      	        autotest
+calculation             scf
+pseudo_dir              ../tools/PP_ORB/
+ntype                   2
+nbands                  20
+ecutwfc                 100
+scf_nmax                50
+symmetry                1
+cal_force               1
+
+#Parameters (Compensating charge)
+
+comp_chg               1
+comp_q                 1
+comp_l                 1
+comp_center            5
+comp_dim               2
+
+nelec                  9
diff --git a/tests/integrate/117_PW_comp_H2O/KPT b/tests/integrate/117_PW_comp_H2O/KPT
new file mode 100644
index 0000000000..c289c0158a
--- /dev/null
+++ b/tests/integrate/117_PW_comp_H2O/KPT
@@ -0,0 +1,4 @@
+K_POINTS
+0
+Gamma
+1 1 1 0 0 0
diff --git a/tests/integrate/117_PW_comp_H2O/README b/tests/integrate/117_PW_comp_H2O/README
new file mode 100644
index 0000000000..f84c049e93
--- /dev/null
+++ b/tests/integrate/117_PW_comp_H2O/README
@@ -0,0 +1,4 @@
+This test for: compensating charge energy and force correction
+*H2O
+*PW
+*kpoints 1*1*1
diff --git a/tests/integrate/117_PW_comp_H2O/STRU b/tests/integrate/117_PW_comp_H2O/STRU
new file mode 100644
index 0000000000..72fef51c06
--- /dev/null
+++ b/tests/integrate/117_PW_comp_H2O/STRU
@@ -0,0 +1,29 @@
+ATOMIC_SPECIES
+H 1.008 H_ONCV_PBE-1.0.upf
+O 15.9994 O_ONCV_PBE-1.0.upf
+
+NUMERICAL_ORBITAL
+H_gga_6au_60Ry_2s1p.orb
+O_gga_6au_60Ry_2s2p1d.orb
+
+LATTICE_CONSTANT
+1
+
+LATTICE_VECTORS
+10 0 0
+0 10 0
+0 0 10
+
+ATOMIC_POSITIONS
+Cartesian    # Cartesian(Unit is LATTICE_CONSTANT)
+
+H
+0.0
+2
+0.000 0.000 1.815 0 0 0
+0.057 1.710 -0.605 0 0 0
+O
+0.0
+1
+0.000 0.000 0.000 0 0 0
+
diff --git a/tests/integrate/117_PW_comp_H2O/jd b/tests/integrate/117_PW_comp_H2O/jd
new file mode 100644
index 0000000000..74044e4804
--- /dev/null
+++ b/tests/integrate/117_PW_comp_H2O/jd
@@ -0,0 +1,2 @@
+test compensating charge correction for H2O
+
diff --git a/tests/integrate/117_PW_comp_H2O/result.ref b/tests/integrate/117_PW_comp_H2O/result.ref
new file mode 100644
index 0000000000..5839754eb1
--- /dev/null
+++ b/tests/integrate/117_PW_comp_H2O/result.ref
@@ -0,0 +1,8 @@
+etotref -468.6754437192308274
+etotperatomref -156.2251479064
+totalforceref 4.151456
+ecompselfref +1.15402457399
+ecompelectronref -10.4346177053
+ecompnuclearref +10.5545672604
+ecomptotref +1.27397412918
+totaltimeref 17.27178
diff --git a/tests/integrate/240_NO_KP_15_SO/result.ref b/tests/integrate/240_NO_KP_15_SO/result.ref
index c13b46fb68..12e44d2296 100644
--- a/tests/integrate/240_NO_KP_15_SO/result.ref
+++ b/tests/integrate/240_NO_KP_15_SO/result.ref
@@ -1,3 +1,3 @@
-etotref -1870.568879523735
-etotperatomref -935.2844397619
-totaltimeref 16.605
+etotref -1870.520882454763
+etotperatomref -935.2604412274
+totaltimeref 8.5263
diff --git a/tests/integrate/240_NO_KP_15_SO_average/INPUT b/tests/integrate/240_NO_KP_15_SO_average/INPUT
new file mode 100644
index 0000000000..88ac7b38db
--- /dev/null
+++ b/tests/integrate/240_NO_KP_15_SO_average/INPUT
@@ -0,0 +1,47 @@
+INPUT_PARAMETERS
+#Parameters	(General)
+suffix	            autotest
+pseudo_dir          ../tools/PP_ORB/
+ntype               2
+#nbands              40
+pseudo_type         upf201
+gamma_only          0
+
+
+calculation         scf
+symmetry             1
+
+#test_force          1
+relax_nmax               1
+force_thr_ev        0.001
+out_level           ie
+relax_method           cg
+out_chg          1
+#out_band            1
+#init_chg        file
+
+smearing_method            gaussian
+smearing_sigma             0.001
+#Parameters (3.PW)
+ecutwfc             20
+scf_thr                 1e-6
+scf_nmax               100
+
+
+#cal_stress              1
+#noncolin             1
+lspinorb             1
+
+#Parameters (LCAO)
+basis_type lcao
+ks_solver           genelpa
+chg_extrap       second-order
+out_dm             0
+pw_diag_thr                0.00001
+
+
+mixing_type         pulay
+mixing_beta         0.4
+mixing_gg0          1.5
+
+soc_lambda 0.0
diff --git a/tests/integrate/240_NO_KP_15_SO_average/KPT b/tests/integrate/240_NO_KP_15_SO_average/KPT
new file mode 100644
index 0000000000..28006d5e2d
--- /dev/null
+++ b/tests/integrate/240_NO_KP_15_SO_average/KPT
@@ -0,0 +1,4 @@
+K_POINTS
+0
+Gamma
+2 2 2  0 0 0
diff --git a/tests/integrate/240_NO_KP_15_SO_average/README b/tests/integrate/240_NO_KP_15_SO_average/README
new file mode 100644
index 0000000000..1a6b704cc2
--- /dev/null
+++ b/tests/integrate/240_NO_KP_15_SO_average/README
@@ -0,0 +1,17 @@
+This test for:
+*GaAs-soc
+*LCAO
+*kpoints 2*2*2
+*sg15 pseudopotential
+*smearing_method gauss
+*ks_solver genelpa
+*mixing_type pulay-kerker
+*mixing_beta 0.4
+
+Compared with 240*SO, I added parameter soc_lambda = 0 to this test case,
+which means I am performing the calculation in a nspin = 4 manner
+but with soc strength = 0
+Therefore, the result should be consistent with the calculation by turning off
+soc (i.e., set lspinorb to be 0 in the INPUT file)
+This is not the case for the old implementation of soc nonlocal PP (build_Nonlocal_mu)
+but the new implementation (build_Nonlocal_mu_new) fixed it
diff --git a/tests/integrate/240_NO_KP_15_SO_average/STRU b/tests/integrate/240_NO_KP_15_SO_average/STRU
new file mode 100644
index 0000000000..fb74bacd4c
--- /dev/null
+++ b/tests/integrate/240_NO_KP_15_SO_average/STRU
@@ -0,0 +1,27 @@
+ATOMIC_SPECIES
+As 1   As_ONCV_PBE_FR-1.1.upf 
+Ga 1   Ga_ONCV_PBE_FR-1.0.upf.txt
+
+LATTICE_CONSTANT
+1  // add lattice constant, 10.58 ang
+
+NUMERICAL_ORBITAL
+../tools/PP_ORB/As_gga_8au_60Ry_2s2p1d.orb
+../tools/PP_ORB/Ga_gga_9au_60Ry_2s2p2d.orb
+
+LATTICE_VECTORS
+5.34197 5.34197  0.0
+0.0  5.34197 5.34197
+5.34197  0.0  5.34197
+ATOMIC_POSITIONS
+Direct //Cartesian or Direct coordinate.
+
+As
+0
+1
+0.2500000          0.2500000          0.25000000     0 0 0
+
+Ga              //Element Label
+0
+1              //number of atom
+0.00000          0.00000          0.000000     0 0 0
diff --git a/tests/integrate/240_NO_KP_15_SO_average/jd b/tests/integrate/240_NO_KP_15_SO_average/jd
new file mode 100644
index 0000000000..e78cd9feb5
--- /dev/null
+++ b/tests/integrate/240_NO_KP_15_SO_average/jd
@@ -0,0 +1 @@
+Func:LCAO/SOC(lspinorb=1); Sys:AsGa; Ref:energy
diff --git a/tests/integrate/240_NO_KP_15_SO_average/result.ref b/tests/integrate/240_NO_KP_15_SO_average/result.ref
new file mode 100644
index 0000000000..23bba32b0d
--- /dev/null
+++ b/tests/integrate/240_NO_KP_15_SO_average/result.ref
@@ -0,0 +1,3 @@
+etotref -1870.759635935754
+etotperatomref -935.3798179679
+totaltimeref 8.4023
diff --git a/tests/integrate/Autotest.sh b/tests/integrate/Autotest.sh
index e273fed663..fe447df6b4 100755
--- a/tests/integrate/Autotest.sh
+++ b/tests/integrate/Autotest.sh
@@ -63,7 +63,7 @@ check_out(){
 	#------------------------------------------------------
 	if test -e "jd"; then
 		jd=`cat jd`
- 		echo " [  ------  ] $jd"
+ 		echo "[----------] $jd"
 	fi
 
 	#------------------------------------------------------
@@ -100,20 +100,20 @@ check_out(){
 		# deviation should be positively defined
 		#--------------------------------------------------
 		if [ ! -n "$deviation" ]; then
-            echo -e "\e[1;31m [  FAILED  ]  Fatal Error: key $key not found in output. \e[0m"
-			let failed++
-			failed_case_list+=$dir
+            echo -e "\e[0;31m[ERROR     ] Fatal Error: key $key not found in output.\e[0m"
+			let fatal++
+			fatal_case_list+=$dir'\n'
 			break
         else
 			if [ $(echo "sqrt($deviation*$deviation) < $threshold"|bc) = 0 ]; then
-				echo -e "\e[1;33m [  FAILED  ] \e[0m"\
+				echo -e "[WARNING   ] "\
 					"$key cal=$cal ref=$ref deviation=$deviation"
 				let failed++
-				failed_case_list+=$dir
+				failed_case_list+=$dir'\n'
 				if [ $(echo "sqrt($deviation*$deviation) < $fatal_threshold"|bc) = 0 ]; then
 					let fatal++
 					fatal_case_list+=$dir
-					echo -e "\e[1;31m [  FATAL   ] \e[0m"\
+					echo -e "\e[0;31m[ERROR      ] \e[0m"\
 						"An unacceptable deviation occurs."
 					calculation=`grep calculation INPUT | awk '{print $2}' | sed s/[[:space:]]//g`
 					running_path=`echo "OUT.autotest/running_$calculation"".log"`
@@ -123,7 +123,7 @@ check_out(){
 			else
 				#echo "$key cal=$cal ref=$ref deviation=$deviation"
 				#echo "[ PASS ] $key"
-				echo -e "\e[1;32m [      OK  ] \e[0m $key"
+				echo -e "\e[0;32m[      OK  ] \e[0m $key"
 			fi
 		fi
 		let ok++
@@ -156,8 +156,8 @@ fi
 
 for dir in $testdir; do
 	cd $dir
-	echo -e "\e[1;32m [  RUN     ]\e[0m $dir"
-	TIMEFORMAT=' [  ------  ] Time elapsed: %R seconds'
+	echo -e "\e[0;32m[ RUN      ]\e[0m $dir"
+	TIMEFORMAT='[----------] Time elapsed: %R seconds'
 	#parallel test
 	time {
 		if [ "$sanitize" == true ]; then
@@ -178,9 +178,9 @@ for dir in $testdir; do
 		then
 			../tools/catch_properties.sh result.out
 			if [ $? == 1 ]; then
-				echo -e "\e[1;31m [  FAILED  ]  Fatal Error in catch_properties.sh \e[0m"
-				let failed++
-				failed_case_list+=$dir
+				echo -e "\e[0;31m [ERROR     ]  Fatal Error in catch_properties.sh \e[0m"
+				let fatal++
+				fatal_case_list+=$dir'\n'
 				break
 			else
 				check_out result.out
@@ -205,14 +205,14 @@ if [ -z $g ]
 then
 if [ $failed -eq 0 ]
 then
-	echo -e "\e[1;32m [  PASSED  ] \e[0m $ok test cases passed."
+	echo -e "\e[0;32m[ PASSED   ] \e[0m $ok test cases passed."
 else
-	echo -e "\e[1;33m [  FAILED  ] \e[0m $failed test cases out of $[ $failed + $ok ] failed."
-	echo $failed_case_list
+	echo -e "[WARNING]\e[0m    $failed test cases out of $[ $failed + $ok ] failed."
+	echo -e $failed_case_list
 	if [ $fatal -gt 0 ]
 	then
-		echo -e "\e[1;31m [  FAILED  ] \e[0m $fatal test cases out of $[ $failed + $ok ] produced fatal error."
-		echo $fatal_case_list
+		echo -e "\e[0;31m[ERROR     ]\e[0m $fatal test cases out of $[ $failed + $ok ] produced fatal error."
+		echo -e $fatal_case_list
 		exit 1
 	fi
 fi
diff --git a/tests/integrate/CASES b/tests/integrate/CASES
index 8c78e47153..7aca91ada5 100644
--- a/tests/integrate/CASES
+++ b/tests/integrate/CASES
@@ -59,6 +59,7 @@
 115_PW_sol_H2
 115_PW_sol_H2O
 116_PW_scan_Si2
+117_PW_comp_H2O
 120_PW_KP_MD_ADS
 120_PW_KP_MD_FIRE
 120_PW_KP_MD_LGV
@@ -143,6 +144,7 @@
 220_NO_KP_MD_NVE
 #230_NO_KP_MD_TD
 240_NO_KP_15_SO
+240_NO_KP_15_SO_average
 250_NO_KP_CR_VDW2
 250_NO_KP_CR_VDW3
 260_NO_15_PK_PU_AF
diff --git a/tests/integrate/tools/catch_properties.sh b/tests/integrate/tools/catch_properties.sh
index 9d11bc46dd..8d90f1aa2d 100755
--- a/tests/integrate/tools/catch_properties.sh
+++ b/tests/integrate/tools/catch_properties.sh
@@ -44,6 +44,7 @@ out_dm=`grep out_dm INPUT | awk '{print $2}' | sed s/[[:space:]]//g`
 out_mul=`grep out_mul INPUT | awk '{print $2}' | sed s/[[:space:]]//g`
 gamma_only=`grep gamma_only INPUT | awk '{print $2}' | sed s/[[:space:]]//g`
 imp_sol=`grep imp_sol INPUT | awk '{print $2}' | sed s/[[:space:]]//g`
+comp_chg=`grep comp_chg INPUT | awk '{print $2}' | sed s/[[:space:]]//g`
 #echo $running_path
 base=`grep -En '(^|[[:space:]])basis_type($|[[:space:]])' INPUT | awk '{print $2}' | sed s/[[:space:]]//g`
 word="driver_line"
@@ -290,6 +291,17 @@ if ! test -z "$imp_sol" && [ $imp_sol == 1 ]; then
 	echo "esolcavref $esol_cav" >>$1
 fi
 
+if ! test -z "$comp_chg" && [ $comp_chg == 1 ]; then
+	ecomp_self=`grep E_comp_self $running_path | awk '{print $3}'`
+	ecomp_electron=`grep E_comp_electron $running_path | awk '{print $3}'`
+	ecomp_nuclear=`grep E_comp_nuclear $running_path | awk '{print $3}'`
+	ecomp_tot=`grep E_comp_tot $running_path | awk '{print $3}'`
+	echo "ecompselfref $ecomp_self" >>$1
+	echo "ecompelectronref $ecomp_electron" >>$1
+	echo "ecompnuclearref $ecomp_nuclear" >>$1
+	echo "ecomptotref $ecomp_tot" >>$1
+fi
+
 #echo $total_band
 ttot=`grep $word $running_path | awk '{print $3}'`
 echo "totaltimeref $ttot" >>$1